Next: SUMMARY Up: Issues in Building General Previous: STRESS ASSIGNMENT

DOES IT REALLY WORK

To find out a more realistic assessment of these models' treatment of unknown words we processed the first section of the WSJ Penn Treebank [9]. This consists of a total of 39923 words in news text style. Using our standard OALD lexicon we find that a total of 1775 words (4.6%) are not found in the lexicon, 943 of which are unique. Of those unknown words we find the following distribution

$\$	Occurs	%
names	1360	76.6
unknown	351	19.8
American spelling	57	3.2
typos	7	0.4

American spelling of words is distinguished here (e.g. ``honor'', ``center'') as it is so systematic. As OALD is a British English Lexicon it doesn't contain such spellings, though for TTS use it obviously should. As WSJ is more carefully published than other texts such as email, the issue of typos is almost negligible. We have done similar analysis of unknown words from Time magazine articles finding a very similar distribution and ratio of unknowns, thus we feel the above is typical of news story type text.

We listened to each of the 1775 words as pronounced by a number of the models discussed above. A yes/no decision was made about acceptability. Note that a number these words have multiple acceptable pronunciations. If any of those were predicted they were deemed acceptable. For example the pronunciations of ``Reagan'' as /r ey g ah n/ and as /r iy g ah n/ were both considered acceptable.

The best results, shown above for OALD, were obtained by building the deepest possible trees. But when those models were applied to these unknown words the results showed that although the models were not over-trained for the unseen test set extracted from the lexicon itself, they were for these unknown words. The following shows the results after varying the stop value for CART building.

$\$	Lexicon	Unknown
Stop	Test set	Test set	size
1	74.56%	62.14%	39500
4	65.17%	67.66%	17948
5	63.15%	70.65%	14968
6	61.65%	67.49%	12782

Thus the best model for unknown words is not the best model for the held out lexical entries. What is more, the best model for unknown words is less than 40% the size of the best model for the lexical test set. These figures reflect both the fact that the held out data in the lexical test set (every tenth entry) is often just a morphological variation of the entries around it, and secondly the lexical test set does not take into account word frequency of unknown words.

Looking at those words that are pronounced wrongly we find some mistakes are still recognizable (e.g. Chrysler as /k r ih s l ah er/) but many are unacceptable and unrecognizable showing there is still work to be done. Further analysis of these words shows

$\$	Occurs	%
names	413	79
unknown	94	18
American spelling	7	0
typos	2	0

One would expect proper names to be the hardest to pronounce (especially those of foreign origin) but although it appears they are slightly harder our model seems to do as well on them as other non-names.

Further analysis of the types of names that are still unpronounceable shows a larger proportion of non-anglo-saxon origin than in those that are correctly pronounced. As many of the languages these names originate from often have a more standardized pronunciation than English (e.g. Polish, Italian, Japanese (in its romanized form)), knowing the origin of an unknown word may allow more specific rules to be applied, but we have not yet investigated this area.

Next: SUMMARY Up: Issues in Building General Previous: STRESS ASSIGNMENT

Alan W Black
1999-03-20