Note: these lexicons are extracted from an old version of CCGbank.
They will soon be replaced by ones extraced from the final version of CCGbank (which you can get from the LDC).


Here you can download lexicons that we have extracted from CCGbank, our translation of the Penn Treebank
to a corpus of Combinatory Categorial Grammar derivations. You are free to use these for research purposes;
however, we would appreciate it if you could acknowledge us by citing the following reference:

Julia Hockenmaier and Mark Steedman. Acquiring Compact Lexicalized Grammars from a Cleaner Treebank
Proceedings of Third International Conference on Language Resources and Evaluation, Las Palmas, 2002..ps


Data format

Each entry has five columns:
  1. The word (mail)
  2. Its lexical category (N)
  3. The word probability P(word=mail|cat=N)
  4. The category probability P(cat=N|word=mail)
  5. The frequency of the word--lexical category combination in the corpus
mail            N                   0.00013495       0.508772         29 
mail            N/N                 0.000168503      0.438596         25 
mail            S[b]\NP             0.000367647      0.0175439         1  
mail            ((S[b]\NP)/NP)/NP   0.00359712       0.0175439         1 
mail            ((S[dcl]\NP)/PP)/NP 0.000619963      0.0175439         1 
The probabilities are simple relative frequency estimates obtained from the observed frequency counts.

In the .tags files, we have appended the POS-tag to each word:
mail|NN         N                   0.000134947      0.537037         29 
mail|NN         N/N                 0.000168487      0.462963         25 
mail|VB         S[b]\NP             0.000367647      0.5              1 
mail|VB         ((S[b]\NP)/NP)/NP   0.00359712       0.5              1 
mail|VBP        ((S[dcl]\NP)/PP)/NP 0.000619963      1                1 


The files