Next: Labeling the data Up: Unit selection without a Previous: Background

An experiment

Our basic experiment involved taking recordings of two different dialects of Spanish. Spanish was chosen as the language for testing, even though it is a well defined case, as we have already built Spanish synthesizers before. It is that familiarity that made us choose it for this experiment.

Also the relationship between the writing system and phonology is relatively close. However, although it is not complex, it is is also not simply a one-to-one relationship.

In order to build a voice without using phonetic information we used the letter set as the phoneme set. Thus our phoneset consists of 26 standard English letters plus the accented characters á, é, í, ó, ú, and ñ, and also SIL (silence). We did use our knowledge of the language and made all letters lower case, and omitted rarer accented characters like ç.

The texts we used for recording had been automatically selected from various newspaper texts to give best diphone coverage, for a general Spanish synthesizer. More elaborate selection techniques, such as [3] were not available to us as they would require a more detailed phonetic and acoustic analysis of the language. However we are aware that our data used in our recordings did use some phonetic knowledge in its construction, but still feel the basic experiment is valid.

The lexicon, the process that provides pronunciations from words, simply takes each word, converts the characters in it to lower case and returns them as a list of phones. As no vowel/consonant information is available each word is coded as a single syllable.

Another knowledge-based expansion of the data is conversion of numeric strings to number words, as is conventional in all text to speech synthesizers. As our text was selected from newspapers, a number of digit strings and abbreviations appeared in the text. Such tokens do not have a closely related pronunciation to their letter sequence. In a standard Spanish synthesizer token expansion rules are used to expand these ``non-standard words'' to explicit, complete words. For this experiment, we used the same expansion set for the data, thus using some knowledge of the language. However, this is equivalent to requiring each word to be written in full.

The prompt list of 419 utterances was recorded by a female Castillian Spanish speaker and by a male Colombian Spanish speaker. The number of words is 5044, and the number of units in these databases is around 28,000. The exact number of units varies between speakers due to the number of leading, trailing and inter-phrasal SIL phones required as the speaker did not deliver the data at exactly the same speed, nor with the same phrasing. The data was recorded in professional recording studios, at 16KHz samples with a simultaneous EGG (larygnograph) channel.

Next: Labeling the data Up: Unit selection without a Previous: Background

Alan W Black 2002-10-01