The task of building a voice consists of the following processes
- Design the corpusIssues in designing prompts are discussed above. We synthesize the prompts for a number of reasons: first, to ensure that all the appropriate tokens are expanded properly. For example in our Communicator dialog domain, we must ensure flight numbers and dates (both strings of numeric characters) are given the correct expansion. Second, we use the synthesized utterance to estimate the time required for recording. We can, optionally, play the prompt to the human voice talent, but that often has the adverse effect of making the human speak more like the synthesizer, so we generally only present the text. The final reason to synthesize the output is that we use the synthesized prompt in labeling the human spoken utterance.
- Synthesize each utterance
- Record the voice talent
- Annotate (label) the recordings
- Extract pitchmarks
- Extract pitch-synchronous parameters
- Build a cluster unit selection synthesizer
- Test and tune, repeating as necessary
Although recording with studio quality equipment can give better results, we are interested in making the process as accessible as possible. When studios are used for recording to DAT tape the transfer process and splitting of the files is laborious and time consuming. For all of the limited domain systems we have built, we have recorded directly to computer files. Most commonly we use a laptop (not connected to the mains power) in a quiet room (i.e. without other computers or air-conditioning), to reduce background noise. The recording quality, once audio devices are set up appropriately, is acceptable though taking care at this point is important. More information on the recording process is given in , including the use of a GUI tool for recording session management (pointyclicky).
The prompts are recorded in the desired style for the synthesizer. A talking clock, consisting of 24 simple utterances is one of our standard baseline examples. Building clocks with ``funny voices'' is easy, but importantly the resulting synthesizer retains the style of the speaker exactly - Scottish accents, falsetto, ``laid-back'' speakers, and even cartoonish voices are all captured well.
After recording, we label the text using a simple but effective technique based on : we use DTW to align between the mel-scale cepstral coefficients (and delta) of the synthesized and recorded waveforms. As we know the position of the labels in the synthesized prompt, we can map this onto the collected recording. This technique was originally developed for labeling diphone data, where the phonetics are much more clearly defined, but we have found this technique perfectly adequate for this task also. In fact, there are distinct advantages of this often loose labeling over hand crafted low level phonetic labelling. For example when a speaker pronounces the word ``Tuesday'', in a Scottish accent, it might better be phonetically labelled as /ch y uw z d ey/, while the synthesizer labels (US English) are given as /t uw z d ey/. But the alignment will match the label /t/ to the spoken /ch y/ and hence when a /t/ followed by /uw/ is selected for synthesis it will select the appropriate piece of speech preserving the original speaker's idiolect. The speaker must produce utterances that are close to the desired form, but they do not need to be phonetically exact.
Although the labeling is often good, it is never perfect. Hand correction will improve it, with diminishing returns. After labeling, we extract mel-scale cepstral coefficients; we have found that our unit selection techniques work much better if this is done pitch synchronously rather than at a fixed frame rate. As we do not (normally) record these databases with an EGG (electro-glottograph) signal, we extract the pitch marks from the waveform directly, although this is not as accurate as extracting from an EGG signal.
The unit selection technique we use is an updated version of that more fully described in . However, there are a number of substantive improvements in that algorithm since we last published, as well as some specific tuning we have found useful for limited domain synthesis.
The general algorithm takes all units of the same type and calculates an acoustic distance between each, using a weighted Euclidean mahalanobis distance of cepstrum parameters plus F0. Selected features including phonetic and prosodic context are used to build a decision tree that minimizes acoustic distance in each partition. Although  makes similar use of decision trees for clustering, we do not use HMMs to first label nor use sub-phonetic acoustic states; nor do we build the tree to its maximum depth, but (optionally) stop with 5 to 10 instances in each cluster.
At synthesis time, we select the appropriate cluster using the decision tree, and then find the best path through the candidates, taking account the costs (and optimal position) of joins using another acoustic based cost (cf. optimal coupling ).
For limited domain synthesis, we have determined that certain parameters are more likely to give reliable synthesis. First, in addition to taking candidates from the selected cluster we also include any units that are consecutive in the database to units selected for the previous segment and are of the right type. Thus, selection is not just for candidate units, but we are effectively selecting the beginning of longer units.
Normally for general unit selection we have used phone name as the unit ``type name'', though the acoustic distance may also include X% of the previous phone, so these are much closer to diphones than phones. In the limited domain synthesizers, we construct the type from the phone plus the word the phone comes from. Thus a /d/ in the word ``limited'' is distinct from a /d/ in the word ``domain''. This apparently severe restriction may give rise to a claim that we are doing ``merely'' word concatenation, but this is not true. We are still selecting individual phones, though they come form some instance of the word to be synthesized. In fact, what happens is that a word is often synthesized from phones from different instances of the desired word and the join point between parts is chosen dynamically at the best point, typically in mid-vowel or fricative or silence of a stop.
This choice of unit type means there are now much fewer instances of each type, which has the distinct advantage of much faster synthesis - the initial motivation for this restriction. However we have also found that when words not in the original vocabulary are synthesized they are often poorly synthesized. Therefore, at present, we see this as a good cut-off point at which we can guarantee high quality synthesis. Although this restriction may be disappointing to some, what we are presenting is limited domain synthesis and find this restriction acceptable for many applications; work continues on methods of backing off acceptably.
We now have the selection system working in slightly less time that is takes to do standard diphone synthesis. Although the unit selection process is computationally more expensive than diphone selection, in the unit selection case we do not (usually) do prosodic modification, though we do pitch-synchronous smoothing for some databases. The unit selection database is substantially larger than a diphone database. We have not yet experimented with data compression algorithms, but as the quality of unit selection synthesis depends on larger variety of units available, it will always be the case that all but the smallest limited domain synthesizers require a larger space than diphone synthesizers.