The speech synthesis was constructed by Cepstral, LLC using techniques that allow high quality unit selection synthesis on small footprint as demanded by the intended platform.

The English male voice (a female voice is also available) offers clear speech in a command-like style. This voice had been created for previous projects and was specifically designed for delivery short dialog utterances such as would be needed in this application.

The voice is a general speech synthesizer that can say anything, and is not limited to a particular domain. At 11KHz, a suitable sample rate for the PDA hardware, the voice plus the language front end (including the lexicon) is about 9 megabytes.

The Arabic synthesizer was built specially for this project. An initial test voice was built in the Festival Speech Synthesis System [9]. This allowed a certain amount of tuning before a small footprint delivery was attempted.

We used the romanization decided on for the recognizer and translation engine, as predicting vowels in Arabic script is a non-trivial problem. Using the generated list of translations created for the translation part of the system we use a method as described in [10] to select an optimal subset of these utterances that best cover the acoustic phonetic space. Thus from a list of around 7500 sentences we selected 666 sentences, 102 sentences hand constructed to cover numbers, and 52 two general greetings (for both male and female speakers).

One notable aspect of the building of an Arabic voice for this system was that we found our native speaker slightly reluctant to have their voice used in a device that could later be used by the military for unspecified use, potentially in their own country. With the improvement in speech synthesis to the level where the output voice is recognizable as the particular person who recorded the database, we must be sensitive to the uses of the system we build. Although we are very careful to explain to all our voice talent what the consequences of recording a synthesis voice are, people may not be fully aware until they see the complete system. Because of this, we used a different speaker for the final recordings.

Evaluation of speech synthesis is always hard but there are simple diagnostic tests that can be run to identify problems in the synthesizer. In this case we carried out three specific tests. Based on the Diagnostic Rhyme Test [11], and Modified Rhyme Test we constructed simple mono-syllabic words which differed in one phonetic feature. For English this test typically includes aspects like voicing, nasality, sustenation etc. We modified this list for Arabic and included emphaticness to the class. A second level test involved sentences that were not part of the recorded database but still considered ``in-domain''. The DRT/MRT and in-domain sentences were then played to native Arabic speakers and they were asked to mark any words which ``sounded bad'' for any reason, a deliberately vague term. The results are as follows show percentage of ``good'' words in the synthesized utterances

DRT MRT Sentence
78.3 72.0 84.7
These numbers are comparable to English voices, of similar size and degree of development.

Alan W Black 2003-10-27