A third and more serious limited domain synthesizer we have built using these techniques is for the CMU DARPA Communicator system . The Communicator is a telephone based, mixed initiative dialog system for planning trips, flights, and booking cars and hotels. At first it appears the domain is not closed as it includes greeting to registered users by name, and allows reference to (at least in principle) any airport in the world.
Since the project began some two years ago, we have logs of everything the system has said. To develop our recording corpus, we selected the latest three months of logs and found the most frequent phrases used by the system. Around 100 phrases are what could be term fixed form, in that they contain no variable parts, such as ``Welcome to the CMU Communicator,'' and ``I'm sorry, I don't understand that.'' We then extracted the set of basic templates used by the language generation system and collected the possible values, cities, airports, airlines and the closed classes of dates, times, prices, etc.
For the obvious closed class slots, namely dates, flight numbers, prices, times etc, we constructed a small number of fillers which provided word coverage for each class, without having to list them exhaustively.
For cities and airports, which are essentially an open class, we used the frequency information in our logs to select which set to include in our recordings. For the more frequently mentioned cities we included more than one occurrence in our prompts (in differing prosodic position) and for less frequent names we only included them once, in an intended prosodically neutral position. With around 300 cities and airports we could cover all of cities in the three month logs. On checking through previous logs the percentage of out of domain words was very small.
The templates were filled out with actual values giving rise to around 500 more prompts. These were recorded in the style of a helpful agent, labelled, and a unit selection synthesizer was built. To test the system we used the phrases from our existing logs and listened to many examples. This pointed at errors in labeling which were corrected. The most common form of error was a misplacement of silence (pauses). We had constructed the sentences to use punctuation when a pause is desired, though some of the utterances generated by the language generation system do not always use punctuation consistently. Also, the speaker did not always insert a pause where the synthesizer expected them. These problems are easily hand corrected, and we also used automatic techniques to find pauses which had an unusually large amount of power which tended to be mislabelled sections.
Various text processing issues also were included in this voice to properly deal with flight numbers and homographs such as ``US Airways''.
Although we had built an initial test voice for communicator using this technique, as we changed many of the basic prompts and styles for a later version, we rebuilt a new voice once we were confident the system was stable and the code was thoroughly debugged. The final voice was built in under one-man week with a break down of approximately one day to design the prompts, one day to record the prompts and build the basic voice, and the rest of the time for tuning and correction.
After this version was running, we made some changes to the language generation system and decided to add some extra airport names and some more (foreign) city names. We constructed a further 50 utterances and recorded these and added them into the system in another morning's work. This exercise was important to us, as for many domains although they may be limited they may not remain static so the ability to add new content easily is important.
In an open domain like Communicator we also have to deal with out of vocabulary words. As the unit selection algorithm deliberately fails when an unknown word is present we must provide a backup. We initially intended to only use a diphone synthesizer for the out of vocabulary word alone but it is very obvious when listening to such examples that the voice quality switch midway in a sentence is extremely distracting, especially as the unknown word is typically an important content word like a place name, even though the diphone synthesizer is based on the same voice as our limited domain voice. Thus if a phrase contains an out of vocabulary word we back-off for the whole phrase, which although is not ideal, is much more understandable.
We have also considered backing off to a more general a unit selection synthesizer for the unknown word as this would, perhaps, better preserve voice quality. However although the quality of this is sometimes good, it can also be very bad, and have yet no automatic way to distinguishing the quality. It is this wide variation in quality in unit selection that the limited domain synthesis is addressing, hence using a diphones synthesizer currently for us is the best solution.
During recent evaluations of the whole dialog system by external parties, we logged the number of utterances synthesized and also how many contained words out of vocabulary, and hence required the backup diphone synthesizer. Over a three week period 18,276 phrases were synthesized. 459 (2.5%) contained out of vocabulary words (71 distinct words). These were all less frequent (or forgotten) place names.
It is important to note that, although Communicator was not designed as a system that would have a limited output vocabulary, using these limited domain synthesis techniques we have more than adequately given it a more interesting and higher quality voice than a conventional TTS system.