next up previous
Next: Performance for Other Natural Up: A Database Query Application Previous: A Database Query Application

Comparisons using English

The first experiment was a comparison on the original English corpus. Figure 9 shows learning curves for CHILL when using the lexicons learned by WOLFIE (CHILL+Wolfie) and by Siskind's system (CHILL+Siskind). The uppermost curve ( CHILL+handbuilt) shows CHILL's performance when given the hand-built lexicon. CHILL-testlex shows the performance when words that never appear in the training data (e.g., are only in the test sentences) are deleted from the hand-built lexicon (since a learning algorithm has no chance of learning these). Finally, the horizontal line shows the performance of the GEOBASE benchmark.
Figure 9: Accuracy on English Geography Corpus
\begin{figure}\centerline{\epsfxsize=5in \epsfbox{}}

The results show that a lexicon learned by WOLFIE led to parsers that were almost as accurate as those generated using a hand-built lexicon. The best accuracy is achieved by parsers using the hand-built lexicon, followed by the hand-built lexicon with words only in the test set removed, followed by WOLFIE, followed by Siskind's system. All the systems do as well or better than GEOBASE by the time they reach 125 training examples. The differences between WOLFIE and Siskind's system are statistically significant at all training example sizes. These results show that WOLFIE can learn lexicons that support the learning of successful parsers, and that are better from this perspective than those learned by a competing system. Also, comparing to the CHILL-testlex curve, we see that most of the drop in accuracy from a hand-built lexicon is due to words in the test set that the system has not seen during training. In fact, none of the differences between CHILL+Wolfie and CHILL-testlex are statistically significant. One of the implicit hypotheses of our problem definition is that coverage of the training data implies a good lexicon. The results show a coverage of 100% of the 225 training examples for WOLFIE versus 94.4% for Siskind. In addition, the lexicons learned by Siskind's system were more ambiguous and larger than those learned by WOLFIE. WOLFIE's lexicons had an average of 1.1 meanings per word, and an average size of 56.5 entries (after 225 training examples) versus 1.7 meanings per word and 154.8 entries in Siskind's lexicons. For comparison, the hand-built lexicon had 1.2 meanings per word and 88 entries. These differences, summarized in Table 3, undoubtedly contribute to the final performance differences.
Table 3: Lexicon Comparison
Lexicon Coverage Ambiguity Entries
hand-built 100% 1.2 88
WOLFIE 100% 1.1 56.5
Siskind 94.4% 1.7 154.8

next up previous
Next: Performance for Other Natural Up: A Database Query Application Previous: A Database Query Application
Cindi Thompson