Festival at CMU	Demos	Manual	Download	Festival at Edinburgh

Festival: Building new voices

Audio files are 16bit Microsoft .WAV at (mostly) 16KHz sampling.

Festival is designed to allow new voices and languages to be added easily and consistently. In many cases no new C++ code is required. The document Building Voices in Festival includes instructions and scripts top aid building of new voices. The site http://www.festvox.org/ hosts the document, scripts and examples databases.

Building Diphone Databases

One major part is building a diphone database (if that route is taken). This involves collect all phone-phone transitions in the language. For example in (US) English this done through recording 1348 nonsense words. We synthesize prompts (if we have an existing synthesizer in the target language)

pau t aa b aa b aa pau
pau t aa p aa p aa pau
...

Which are spoken by our target speaker

pau t aa b aa b aa pau
pau t aa p aa p aa pau
...

After recording these are autoaligned (using cross-language aligning if necessary) and a voice automatically built. Although autoaligning is usually good it sometimes fails requiring some amount of hand correction.

Fully automatic diphone based voice
After some hand correction

For this voice, recording took one morning, aligning about an hour. and hand correction took two hours.

This technique has also been used for building synthesis in other language including: Greek, Polish, Basque, Spanish, various English dialects, Swedish and German.

Building Prosodic Models

Initially simple rule models can be written for phrasing, duration and intonation. These can be improved on with suitable data and building statistical models. Of course, high quality, natural, controllable prosody is still a research issue, but simple forms should be possible in most languages.

Depending on time and data availability we can build more complex models

Duration fixed, averages, simple rule, transplanted from other languages or fully trained.
Intonation fixed, declining, hat patterns or content words, by rule, or fully trained.

To show the degradation here are such models on English sentences.

No prosody, fixed F0 and fixed durations.
Simple declining F0.
"hat" accents on stress syllables.
"hat" accents on stress syllables and end tones.
Statistically train F0 and durations.

Building Lexicons and Letter-to-Sound Rules

The task of producing a pronunciation given a word varies in complexity from language to language. In Spanish the task is mostly trivial, but in Japanese, kanji characters often have several readings and choosing between them may require quite high level linguistic and pragmatic information.

In many languages although there is a history relationship between the alphabetic written form and the pronunciation that relationship isn't so obvious. English and German are good examples. To synthesize these language a lexicon (list of words and pronunciation is required). But any list of of words, no matter how big, will not contain all the words that appear in text so a method for pronounce out-of-vocabulary words is necessary. Although such rule systems can be written by hand it is a slow and skilled process. We have developed a fully trainable method for producing letter-to-sound models from lexicons. We have successfully used it for, various dialects of English, French and German. See Black, Lenzo and Pagel 1998 html or postscript for technical details, and the Festival manual for instructions.

Text analysis

Text isn't as easy to say as one might think, numbers symbol, abbreviations are common and need to be expanded into words if they are to be spoken. In English, for example, numbers are pronounced differently depending on their type. The digits 1998 are pronounced as one thousand (and) nine hundred and ninety-eight if it is a quantity; nineteen ninety-eight if a year; and one nine nine eight if a phone number or part number. Statistical methods can be trained to choose between these. In other language number pronunciation may affect by the gender case, tense etc of the item being counted, even though no textual indication is given.

Festival offers a flexible rule and trainable system to build text analysis front ends. A much more detailed discussion of text analysis for synthesis (and language processing in general) was the subject of a project at the Johns Hopkins Summer Workshop 99 and is detailed here

Back to demo index

This page is maintained by Alan W Black awb@cs.cmu.edu