next up previous
Next: Future Work Up: Optimal Data Selection for Previous: Example

Evaluation

We took the two sets of sentences, (221 and 146) and recorded KAL delivering them. We built basic unit selection voices, using the standard FestVox build process, and hand corrected the phonetic labels of each. Three test voices were actually built, one with just the 221, one with just the 146, and one with the combined set. We spent some time tuning the voices to find good parameters, such as the right cluster size and features used for selection. However, for the best quality, we know we would need to do more correction of the labels.

A set of 20 sentences had been previous created for testing purposes; one of these sentences (the first sentence of Alice) was contained in the 34,796 utterances training set but not in the selected set, but the rest of the test set were independent.

Judging speech synthesis quality is not easy, even - perhaps especially - if you listen to a lot of it. In general, it is fairly easy to reliably determine what is much better, but in close cases where multiple factors may affect the quality of speech (e.g. joins, prosodic smoothness and segmental quality) such judgments become not so clear, and subjects, when questioned, may differ.

We synthesized the twenty sentences, and one of the authors listened to randomly ordered examples form each comparison. Opinions were collect in terms of A being better than B, B better than A or equal quality.

  A B A=B better
txt_221 vs txt_146 4 8 8 txt_146
txt_221 vs txt_367 6 10 4 txt_367
txt_146 vs txt_367 3 12 3 txt_367

It is clear that the largest set is best, but it is interesting that that it is not so clear which of the 221 and 146 sets are best, even when one is 50% larger than the other. The 146 is often smoother, but typically has less variation in prosody.

Relative quality helps in deciding directions, but this does not determine if the resulting voice is good enough for real applications. A second test was done on these voices with respect to further five different test sets.

alice
20 sentences from ``Alice'', which was part of the training set (none of these test sentences from ``Alice'' were actually in the set of recorded utterances).
timit
20 sentences from the TIMIT databases, a phonetically balanced set of 452 sentences [7].
comm
20 sentences from the CMU Communicator testing suite, used in a speech dialog system.
festvox
20 sentences from the abstract of [1].
story
20 sentences from a novel that was not part of the original database.

A five point score was used, 5 being indistinguishable or nearly indistinguishable from recorded speech, 4 being errors but understandable, 3 being understandable with difficulty, 2 bad but some part discernible and 1 nearly incomprehensible or worse. We had 4 people listen to these examples.



testset Listeners Mean Scores Total
  1 2 3 4 mean rank
alice 4.4 4.15 3.3 3.95 3.95 1
timit 3.75 3.75 2.95 2.85 3.32 4
comm 3.7 3.9 2.4 3.25 3.31 5
festvox 3.75 4.05 3.25 3.4 3.61 3
story 4.0 4.3 3.05 3.95 3.82 2



It is clear to anyone listening to the voices that it is most appropriate for reading stories, which is expected as that is the domain the voices were selected for. The TIMIT sentences are, of course, more complex, as they were deliberated chosen to have good phonetic coverage. The communicator data is a different style, and its quality is not as good, even though it is understandable; it does not have the fluency and appropriateness the limited domain voice has, which confirms our hypothesis that domain voices can always sound better than general purpose voices, because they are more appropriate to the task at hand.


next up previous
Next: Future Work Up: Optimal Data Selection for Previous: Example
Alan W Black 2001-08-26