Festival at CMU Demos Manual Download Festival at Edinburgh

Festival: General examples

Audio files are 16bit Microsoft .WAV at (mostly) 16KHz sampling.

Simple text-to-speech

This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.

Festival currently uses a diphone synthesizer, both residual excited LPC and PSOLA methods are supported. The upper levels, duration and intonation, are generated from statistically trained models, built from databases of natural speech. The architecture of the system is designed to be flexible, including various tools, which allow new modules to be added easily.

Multi-lingual text-to-speech

Festival is a multilingual synthesizer. The default language may be set at start-up time or changed easily during a session.

This welsh synthesizer is was ported from a previous CSTR Welsh synthesizer

A Castillean Spanish synthesizer was built from diphone collected during an MSc project.

Two German synthesisers were developed as part of a summer project at Oregon Graduate Institute

Statistical text analysis aids speech synthesis

A statistical phrase break prediction system ensures that even distribution of breaks are inserted, such that similar contexts for breaks are not confused. A statistical part-of-speech tagger allows Festival to identify the correct pronunciation of homographs. Certain character sequences may be a roman number pronounced as simple a number, an ordinal, or as a letter sequence. The cases can be differentiated by a statistically trained model that takes into account the context.

TTS modes

Special modes including tokenization, lexicon, and prosody can be built when deal with special types of text. For example if you are to read a list of addresses it is better to do so in an address mode.

Unit selection

In order to improve quality of the waveform itself we can selection sub-word units from a larger corpus that simply one example of each diphone. This is an example from a new implementation of Hunt and Black ICASSP96

The first was produced from 460 (TIMIT) phonetically balanced sentences, using only phonetic context and pitch as selection features with hand-tuned weights. No signal processing to modify pitch and duration was made to the selected units. The units selected typical contain 2-3 phones. The second example was synthesized using a diphone database from the same speaker. Only the waveform synthesizers differ, that is they use the same target phones.

A different technique for finding appropriate units is described in Black and Taylor 97 posctscript html. Here appropriate sub-word units (diphones or demi-phones) are clustered using an acoustic measure.

NoteIn both the above techniques the good examples are good, but the bad examples are much worse than diphones alone. These techniques are still need to be researched further until they are stable enough produce high quality all the time.

Back to demo index
This page is maintained by Alan W Black awb@cs.cmu.edu