Unit selection synthesis, where appropriate sub-word units are selected from databases of natural speech, seems to hold the promise of high quality natural sounding speech synthesis. However, the quality of such systems is inherently related to the quality and appropriateness of the database from which the units are selected.
In the extreme case, it has been shown  that if the database is deliberately and closely tailored to the intended application, high quality synthesis can be produced robustly and reliably. Many applications are of the sort where the domain can be adequately specified, such as time, weather, and even apparently open dialog systems like the CMU DARPA Communicator flight information system . But not all applications are so, and a more general solution is desired. The work presented here addresses exactly that problem, how can we find the best set of utterances to record in order to cover the domain we intend our synthesizer to perform well in. The techniques presented here are suitable for both relatively limited domains to completely open domains like reading arbitrary stories or names and addresses.
When a unit selection synthesizer produces high quality output, it is bringing forward the implicit style within the unit database, as well as other factors in the system. In this paper we do not address the harder problems of appropriately modifying the units we select to allow for a richer output; unlike our earlier work on limited domain synthesis , however, we are building synthesizers that can say anything, though their style will still be more appropriate for the domain for which they were designed. That is, a synthesizer based on a database that cover names and addresses well will be able to read arbitrary new stories, but will sound odd, slow, and hyperarticulated. Likewise, a unit selection synthesizer built using newswire text will make every turn in a dialog system sound like a CNN report.