next up previous
Next: BUILDING RULES Up: Issues in Building General Previous: INTRODUCTION


In order to make the building of models easier we wish to have a standardized alignment between the letters in an entry and the phones in its pronunciation.

The number of letters in a word and the number of phones in its pronunciation in general are not a one to one match. For the languages we have investigated, letters can map to zero, one, two or very exceptionally three phones. Even when there are the same number of letters and phones the ``correct'' alignment may not be the most simple. In general there seems to be less phones than letters.

The cases where a letter goes to more than one phone are fairly restricted (e.g. x to /k s/, o to /w uh/ as in one). Almost all letters can in some context correspond to no phone, which we will call _epsilon_.

A more complex model involving multi-letter clusters to zero or more phones is also possible though this introduces complexities in the model learning, and alignment process that we preferred to avoid.

Ideally we would like a purely automatic method for finding the best single letter alignments, but so far we have achieved better results from a hand-seeded method.

The hand-seeded method requires the explicit listing of which phones (or multi-phones) each letter in the alphabet may correspond to, irrespective of context. This is relatively easy to do and can be done as an interactive process over the training set as new correspondences are added to the allowables list. For example the letter ``c'' may be realized as any one of

_epsilon_ k ch s sh t-s
Vowel letters have typically a much longer list of potential phones.

The hand-seeded algorithm takes the list of allowables and finds all possible alignments between each entry's letters and phones. A count is taken for which correspondences are used for each alignment and a table of probabilities of a phone (epsilon or multi-phone) given a letter is estimated, again irrespective of context. Then the entries are re-aligned and each possible alignment is scored with the generated probabilities. The best alignment is selected. The alignments generated by this algorithm are close to what would be produced by hand and it is very rare to find alignments that would be considered unacceptable.

The building of the allowables table is simple and quick though does require some skill, however it can be done even without an in-depth knowledge of the language the lexicon is for. A few words do not produce alignments (which would require new entries in the allowables table) which typically represent classes for which the relationship between the letter form and the phones is too opaque. These are typically abbreviations, such us ``dept'' as /d ih p aa r t m ah n t/; words with very unusual pronunciation e.g. ``lieutenant'' (British English); Foreign words (e.g. ``Lvov'') and what could be considered mistakes in the lexicon e.g. ``cannibalistic'' with two /l/ phones. Typically the number of entries that failed to have an alignment are well under 1%.

The second alignment is an application of the expectation maximization (EM) algorithm [7] which we call the ``epsilon scattering method''. The idea is to estimate the probabilities for one letter L to match with one phoneme P, and to use DTW to introduce epsilons at positions maximizing the probability of the word's alignment path. Once the dictionary is aligned, the association probabilities can be computed again, and so on until convergence. e.g. five iterations are necessary on the CMU lexicon

/*  initialize prob(L,P) */
1 foreach word in training_set
     count with DTW all possible L/P 
     association for all possible epsilon 
     positions in the phonetic 
/* EM loop */
2 foreach word in training_set
     compute new_p(L,P) on alignment_path
3  if (prob != new_p) goto 2
This differs from [6] in that the probabilities are distributed equally ('scattered') among each of the possible alternatives, rather than assigning an arbitrary weight to each shift.

When we build models from the results of alignment using each of the above algorithms on the OALD we get the follow results

Method Letters Words
Epsilon scattering 90.69% 63.97%
Hand-seeded 93.97% 78.13%
``Letters correct'' is the number of letter-phone pairs which are correctly predicted with respect to the test set. ``Words correct'' are the number of complete words where the complete phone string predicted (minus epsilons, but including stress markers) is correct with respect to the test set.

So we can see clearly that the hand-seeded method is better. However we still feel that the hand-seeded is a simple task and feel that we have not yet fully investigated method to improve the automatic method to achieve the level of the hand-seeded method.

next up previous
Next: BUILDING RULES Up: Issues in Building General Previous: INTRODUCTION
Alan W Black