next up previous
Next: All of the people Up: Perfect Synthesis for all Previous: Some of the people

All of the people some of the time

For any particular application of speech synthesis the type of output it will generate is not everything. Although we try to build synthesizers which are general enough to be good at everything they are typically not tuned for particular applications.

[7] takes an extreme view of how to get good synthesis all of the time by restricting what the synthesizer can say. Thus a simple talking clock can easily be built that sounds better than a general speech synthesizer, though of course it can only tell the time and nothing else.

This is cheating, though is taking advantage of what unit selection synthesis does best. By designing your data explicitly to cover the expected output one can achieve near perfect synthesis for that domain. Often this is sufficient for many applications.

We have built a number of voices specifically designed for applications. Apart from trivial talking clocks, weather information is a useful but constrained domain. Note for easiest construction and best results, developing the generation part of the system in conjunction to the synthesizer itself makes for best results.

For example in [7] we report on a simple weather system for any US city based on live web data giving, time, temperature, outlook, wind direction. A total of 100 utterances were recorded, each of the basic templated form of the intended synthetic utterances. The quality provided is excellent, though of course it can only say the weather.

With the CMU DARPA Communicator system, a telephone based flight information spoken dialog systems [15], a much more general spoken output structure was required. We first analysed what the system had said (using a previous general TTS synthesizer) and built a set of prompts that covered that space. The resulting synthesizer says the in domain text very well as its designed to cover, though sometimes is required to deliver out of domain text, e.g. when a new airport is referred to or some change is made to the language generation systems.

It is clear, through simple blind listening tests, that domain synthesizers can sound much better than general synthesizers. Knowing the desired style and context of the voice allows much more appropriate delivery.

Examples like weather are the extreme cases where the domain can be fully defined and a reasonable set of prompts can be explicitly designed to cover the space. In more general cases there is still a well defined core of expected output. Thus the database can be designed as mixture of domain specific prompts and general prompts.

In fact we have defined this relationship in more detail [16]. One can construct different voices for different tasks (e.g. weather, stocks, email reading) and make explicit changes in voice when changing domains. We called this tiering. The second route is combining domain related prompts together into a single voice, typically with a significant amount prompts to support general synthesis. This we call blending.

Blending allows, potentially, a smaller footprint and also less firm boundaries between the domains, thus switching between voice types is not required. Though blended voices are harder to get right while small well defined tiered voices are probably the easiest to guarantee high quality all of the time.

However it should be noted that this is only really a solution if the amount of work to design and build a domain directed synthesizer is sufficiently less than building a general voice.

next up previous
Next: All of the people Up: Perfect Synthesis for all Previous: Some of the people
Alan W Black 2002-09-30