The goal of this experiment was to determine whether understanding of the content of an oral message heard over the telephone lessened as people aged and as the source of the speech went from natural speech to synthetic. The design of the experiment had to incorporate uncomplicated instructions so that elderly subjects would have no problem understanding what they were to do. It also needed to be short so that those who might be frustrated that they did not hear a token would not give up before the end of the test. We were specifically interested in understandability rather than other aspects of usability and therefore only concentrated on a limited number of variables. The basic scenario was to play several simple utterances, some naturally spoken by a human and some synthesized by a speech synthesis system, over the phone and to have users write down what they heard.
Evaluation of synthetic speech is a nortorious difficult problem, . Subjects' views of synthesis quality are easily influenced by order of presentation, familiarity with synthesized speech and how questions are asked about their perception of quality. Naturalness and understandability can be independent. This work is just the start of a longer project to improve the quality of speech synthesis in general by focussing on a class of users who are known to have particular difficulty in understanding synthetic speech. In this experiment we decided to use standard diphone speech rather than more recent unit selection techniques which can often produce speech indisguishable from recorded prompts especially in limited domains. With diphone quality speech we also have the opportunity to impose predicted prosody.
Four different voices were used to deliver the utterances.
Please write down the following time ...while the second utterance was of the form
Please write down the following words ...The first utterance was intended to be much easier to understand due to the fact that the set of possible times is much more constrained than the second set of possible words. We deliberately chose to play the more predictable utterance first so the human listener could become more accustomed to the voice (be it natural or synthetic).
The sets of words consisted of pairs of relatively hi-frequency bi-grams, in all only four different pairs were used, rose bush, Rose Bowl, holiday season and holiday shopping. In designing the prompts we aimed at a medium level of difficulty. We thus avoided selecting more confusable word pairs, for example pairs which were one phone away from more common bi-grams, e.g. baseball pat. Although we were prepared to change the lexical content and difficulty to ensure the right level of difficulty for our audience, our pre-testing showed that our first selection was adequate.