next up previous
Next: Cluster unit selection Up: Optimal Data Selection for Previous: Background

Initial attempts

For limited domain voices, the process of designing what to record is easier, even if there is not an automated process. At first approximation, the object is to include in the databases at least one occurrence of each word in the domain in each desired prosodic context. Hand design can often be adequate for such domains.

Our initial attempts to do this automatically for new domains was to take a large example of the intended output and greedily select sentences that maximized the word coverage. To some extent, this works, but obviously does not take into account the phonetic coverage of the database, and so word joins may be poor. We can take each sentence and generate the diphones that are required to synthesize it and then greedily select sentences which have the best diphone coverage. We also investigated selecting for demisyllables rather than diphones to see what sort of coverage was necessary.

As a test we took the text of Lewis Caroll's ``Alice's Adventures in Wonderland'' [5] to see how many sentences are sufficient given these various criteria. The Gutenberg version of Alice (alice29.txt) contains 26,457 words from which Festival finds 1,918 utterances. The following table shows the number of utterances needed to cover the criteria as selected by a simple greedy algorithm.

Optimize for utts % total
Words 979 51%
Diphones 196 10.2%
Demisyllables 312 16.2%
There are, of course, many ways to describe and define demisyllables; here we use onset as initial consonantal gestures (if any) as well as an initial portion of the vocalic nucleus, and the coda as the remainder of the vocalic portion into the the final consonantal gestures (if present). Syllable affixes were not treated distinctly from the coda. Thus, the units can be written as onset cluster - vowel and vowel - coda cluster, respectively. The demisyllable inventory and feature set is based upon [8], with some simplifications.

We can easily add other factors to the units we are trying to optimize for, be that lexical stress, pitch and/or metrical accents, position in phrase etc. As pointed out in [13] getting all possible features and all contexts is prohibitive. The addition of each multiplies the amount of data requires to systematically cover the space.

Another direction to find the right data to record is to define the space and then explicitly create a comprehensive database by design. Simple diphone databases are a prototypical example of this, we define the phone set and then what diphones can appear in the language, and then carefully design a database to ensure it has one example of each of the token types defined (e.g. [11]). This direction seems feasible for smaller inventories, but as the combination of features grow we have to make more and more decisions about pruning that space, as collecting everything would be a monumental task - let alone the post-processing steps necessary.

In these two methodologies - define features and greedily select from data, and define features and expertly design the data - two distinct aspects are missing.

  1. The frequency of the units is ignored. Some unit types are much more frequent that others, this fact could help define which corners could be cut without degrading the synthesis except in rarer cases.
  2. Acoustic distinctions, or their lack, are ignored. When multiplying out the various units, some will not be phonetically realized as distinct, e.g. some vowels preceding /t/ may be indistinguishable from those preceding /d/.
What we want to do is find exactly which units are acoustically distinct, and take into account the distribution of units in order to design a corpus for collection that will exactly cover all the variations - with minimal redundancy.

next up previous
Next: Cluster unit selection Up: Optimal Data Selection for Previous: Background
Alan W Black 2001-08-26