Experiment

In order to investigate the optimal unit size we built synthesizers under four different conditions: syllable, diphone, phone and half phone.

The phone synthesizer, the base case, was built with the phone set, letter to sound rules and syllabification rules defined for Indian language.

To build the diphone synthesizer we tagged each phone with its preceding phone, thus units were still actually one phone in length but they are sub-typed based on their previous phone.

For the syllable based synthesizer, we treated the 2344 distinct syllables in the database as "phones" and listed them in our phoneset. These syllable-sized phones were assigned phonetic features based on their combined consonant and vowel part, with the consonant in onset given more preference over the consonant in coda. Thus the units in the inventory became full syllables rather than traditional phonemes. The lexicon parser was appropriately modified to generate these syllable-based phones rather than traditional phone names.

In implementing half phone synthesizer, each vowel was represented by two half phones, while the consonants were full phones. Two phone symbols were defined for each vowel in the phoneset, for example vowel /a/ was represented by /a_1/ and /a_2/. Labels at half phone level were derived by equally dividing the vowel segment into two half phones. The lexicon parser was also modified accordingly, to generate appropriate phone strings.

For perceptual evaluation of these synthesizers, we selected a set of 24 sentences from a Hindi news bulletin. The content of this bulletin was mostly about the political affairs of the world in the middle of March 2003. The syllables and diphones present in these 24 sentences were covered in the corresponding synthesizers. These sentences were synthesized by phone, diphone, syllable and half phone synthesizers and were subjected to the perceptual test of native Hindi speakers. The people who participated in these perceptual tests were working persons and graduate students and none of them had any experience in speech synthesis. Each listener was subjected to AB-test i.e the same sentence synthesized by two different synthesizers was played in random order and the listener was asked to decide which one sounded better for him/her. They also had the choice of giving the decision of equality.

The results of AB-test conducted on 11 persons in the case of syllable and diphone synthesizers and on 5 persons for the rest of the synthesizers are shown in Tables 1-6, with a summary in Table 7. Each row in these tables indicates the evaluation results of a native speaker. An entry such as

in the first row of Table 1 indicates that the listener rated 8 utterances in favor of syllable, 6 utterances in favor of phone and 10 utterances as equally good or bad. The last row in each of these tables summarizes the results present in the corresponding tables.

Table 1: AB Test: Syllable Vs Phone

	Listener Preference
Test No.	Syllable	Phone	No Preference
1.	8	6	10
2.	5	4	15
3.	9	-	15
4.	9	9	6
5.	9	7	8
	40	26	54

Table 2: AB Test: Syllable Vs Halfphone

	Listener Preference
Test No.	Syllable	Halfphone	No Preference
1.	2	4	18
2.	9	3	12
3.	10	6	8
4.	4	-	20
5.	3	4	17
	28	17	75

Table 3: AB Test: Syllable Vs Diphone

	Listener Preference
Test No.	Syllable	Diphone	No Preference
1.	13	8	3
2.	7	2	15
3.	4	4	16
4.	8	5	11
5.	11	6	7
6.	13	5	6
7.	10	8	6
8.	11	8	5
9.	11	6	7
10.	14	1	9
11.	12	12	-
	114	65	85

Table 4: AB Test: Diphone Vs Phone

	Listener Preference
Test No.	Diphone	Phone	No Preference
1.	7	8	9
2.	4	4	16
3.	3	4	17
4.	8	6	10
5.	13	6	5
	35	28	57

Table 5: AB Test: Diphone Vs Halfphone

	Listener Preference
Test No.	Diphone	Halfphone	No Preference
1.	6	5	13
2.	5	7	12
3.	11	5	8
4.	1	5	18
5.	-	7	17
	23	29	68

Table 6: AB Test: Phone Vs Halfphone

	Listener Preference
Test No.	Phone	Halfphone	No Preference
1.	5	3	16
2.	5	6	13
3.	7	8	9
4.	2	-	22
5.	1	5	18
	20	22	78

Table 7: Summary of AB Test (scores are represented in %)

Rank	Syl vs Diph	Syl vs Ph	Syl vs Halfph	Diph vs Ph	Diph vs Halfph	Ph vs HalfPh
I	syl 43%	= 45%	= 63 %	= 47%	= 57%	= 65%
II	= 32%	syl 33%	syl 23 %	diph 29%	halfph 24%	halfph 18%
III	diph 24%	ph 21%	halfph 14%	ph 23%	diph 19%	ph 17%
Sum.	syl	syl	syl	diph	halfph	halfph