It is clear that, although we have had some success, a number of other related experiments must be tried, and we are in the process of carrying them out.
First is a proper investigation into how the initial cluster tree affects the selection, both from the distribution of the original data, and from the speaker. It is clear that each speaker will have a different acoustic differentiation of units, as their physiologies and speaking strategies differ. Thus, using speaker-specific acoustic model cluster trees and selecting data appropriate for them may give better results.
We have built a number of other voices from the 367 utterance set and, although they are surprisingly good, none are as good as the KAL voice. This may of course be due to other factors, such as style and experience in delivering speech to be made into a synthesizer.
We have made initial studies using a tree built for AWB (a Scottish English speaker) from a 452 (US) TIMIT database. When this is used against the same 34,796 utterances, a different set of utterances is selected. Although the first utterance is the same as the KAL selection list, the second and subsequent are different. However, we have not yet had time to compare the quality of a voice built from that selection with speaker AWB, compare with AWB using the KAL based selection.
An important use for this work is to find the best subset to record to reasonable synthesis it, given a finite known corpus. For example, which utterances should be recorded to build a voice capable of building a synthesizer of particular text in a high-quality, characteristic voice. Initial experience on selected for Alice suggest at the quality we are producing here, we would need around 257 utterances, about 13.3% of the total, and the growth is sub-linear in the size of the text. Such compression, while maintaining character and quality, can be very useful for audio books, manuals, and other materials.