next up previous
Next: DOES IT REALLY WORK Up: Issues in Building General Previous: BUILDING RULES


The importance, and realization of lexical stress varies between languages but in order to produce a reasonable pronunciation from a string of letters it is often more than simply producing a string of phones, lexical stress markings are also required. In English lexical stress may be different depending on syntactic class, it may even move with some morphological derivations. Therefore predicting lexical stress for each vowel in the predicted string cannot in general be done from the letter context alone. However results in [12] suggest that combining phone and stress prediction in a single model give better results.

We tested this on the OALD data set. We first built letter to phone models where lexical stressing information was removed from the phones and we trained a separate stress prediction model using the same test set using features such as syllable position in word, vowel length, vowel height, number of syllables from end of word, and part of speech. On held out data from the OALD the per syllable results are

Actual Predicted
  unstressed stressed %
unstressed 7390 378 95.1%
stressed 512 8207 94.1%

total correct 15597/16487 (94.6%)
This model was combined with the output of the letter to phone model (LTP+S).

The second model introduced two types of vowel phone, stressed and unstressed versions. The standard LTS model building technique was applied so the CART trees themselves produced phone and stressing information directly (LTPS).

LNS 96.36% 96.27%
Letter -- 95.80%
WNS 76.92% 74.69%
Word 63.68% 74.56%

(LNS = letter/phone ignoring stress, WNS = word ignoring stress)
A score for ``letters correct'' for the separate model is not available as the stress prediction model does not preserve alignment.

Thus it can be clearly seen that although higher values are possible per word when ignoring stress, a separated model applied afterwards gives significantly lower results than if the phones and stress levels are predicted by a single model.

We also discovered that including part of speech information in the phone prediction models themselves improved the accuracy of the model. Without POS information the combined model gives 95.32% letter correct and 71.28% word correct. Thus part of speech obviously helps and is readily available in a TTS system with a standard POS tagger even for unknown words.

Ultimately stress cannot be predicted on local context alone as there are a number of example in English where local context is insufficient (cf. photograph/photography). Ideally morphological decomposition is required to do such prediction but we have not yet investigated this area.

next up previous
Next: DOES IT REALLY WORK Up: Issues in Building General Previous: BUILDING RULES
Alan W Black