In order to investigate the optimal unit size we built synthesizers under four different conditions: syllable, diphone, phone and half phone.
The phone synthesizer, the base case, was built with the phone set, letter to sound rules and syllabification rules defined for Indian language.
To build the diphone synthesizer we tagged each phone with its preceding phone, thus units were still actually one phone in length but they are sub-typed based on their previous phone.
For the syllable based synthesizer, we treated the 2344 distinct syllables in the database as "phones" and listed them in our phoneset. These syllable-sized phones were assigned phonetic features based on their combined consonant and vowel part, with the consonant in onset given more preference over the consonant in coda. Thus the units in the inventory became full syllables rather than traditional phonemes. The lexicon parser was appropriately modified to generate these syllable-based phones rather than traditional phone names.
In implementing half phone synthesizer, each vowel was represented by two half phones, while the consonants were full phones. Two phone symbols were defined for each vowel in the phoneset, for example vowel /a/ was represented by /a_1/ and /a_2/. Labels at half phone level were derived by equally dividing the vowel segment into two half phones. The lexicon parser was also modified accordingly, to generate appropriate phone strings.
For perceptual evaluation of these synthesizers, we selected a set of
24 sentences from a Hindi news bulletin. The content of this
bulletin was mostly about the political affairs of the world in the
middle of March 2003. The syllables and diphones present in these 24
sentences were covered in the corresponding synthesizers. These
sentences were synthesized by phone, diphone, syllable and half phone
synthesizers and were subjected to the perceptual test of native Hindi
speakers. The people who participated in these perceptual tests were
working persons and graduate students and none of them had any
experience in speech synthesis. Each listener was subjected to AB-test i.e the same sentence synthesized by two different
synthesizers was played in random order and the listener was asked to
decide which one sounded better for him/her. They also had the
choice of giving the decision of equality.
The results of AB-test conducted on 11 persons in the case of syllable and diphone synthesizers and on 5 persons for the rest of the synthesizers are shown in Tables 1-6, with a summary in Table 7. Each row in these tables indicates the evaluation results of a native speaker. An entry such as in the first row of Table 1 indicates that the listener rated 8 utterances in favor of syllable, 6 utterances in favor of phone and 10 utterances as equally good or bad. The last row in each of these tables summarizes the results present in the corresponding tables.
|Test No.||Syllable||Phone||No Preference|
|Test No.||Syllable||Halfphone||No Preference|
|Test No.||Syllable||Diphone||No Preference|
|Test No.||Diphone||Phone||No Preference|
|Test No.||Diphone||Halfphone||No Preference|
|Test No.||Phone||Halfphone||No Preference|
|Rank||Syl vs Diph||Syl vs Ph||Syl vs Halfph||Diph vs Ph||Diph vs Halfph||Ph vs HalfPh|
|I||syl 43%||= 45%||= 63 %||= 47%||= 57%||= 65%|
|II||= 32%||syl 33%||syl 23 %||diph 29%||halfph 24%||halfph 18%|
|III||diph 24%||ph 21%||halfph 14%||ph 23%||diph 19%||ph 17%|