Experimental Design

The goal of this experiment was to determine whether understanding of the content of an oral message heard over the telephone lessened as people aged and as the source of the speech went from natural speech to synthetic. The design of the experiment had to incorporate uncomplicated instructions so that elderly subjects would have no problem understanding what they were to do. It also needed to be short so that those who might be frustrated that they did not hear a token would not give up before the end of the test. We were specifically interested in understandability rather than other aspects of usability and therefore only concentrated on a limited number of variables. The basic scenario was to play several simple utterances, some naturally spoken by a human and some synthesized by a speech synthesis system, over the phone and to have users write down what they heard.

Evaluation of synthetic speech is a nortorious difficult problem, [3]. Subjects' views of synthesis quality are easily influenced by order of presentation, familiarity with synthesized speech and how questions are asked about their perception of quality. Naturalness and understandability can be independent. This work is just the start of a longer project to improve the quality of speech synthesis in general by focussing on a class of users who are known to have particular difficulty in understanding synthetic speech. In this experiment we decided to use standard diphone speech rather than more recent unit selection techniques which can often produce speech indisguishable from recorded prompts especially in limited domains. With diphone quality speech we also have the opportunity to impose predicted prosody.

Four different voices were used to deliver the utterances.

: was a natural spoken utterance by a female speaker
: was a natural spoken utterance by the same female speaker but after she was told that the listener couldn't understand what was being said, thus the utterance, was slower, more articulated and partly "shouted".
: was synthesized using a standard "diphone" female synthesis voice.
: was again synthesized by the same standard "diphone" voice, but unlike SN, where the prosody was predicted by a statistical model, here the natural durations and F0 were extracted from NS and imposed on the synthetic segmental form.
For each voice two utterances were presented, the first was of the form:
Please write down the following time ...
while the second utterance was of the form
Please write down the following words ...
The first utterance was intended to be much easier to understand due to the fact that the set of possible times is much more constrained than the second set of possible words. We deliberately chose to play the more predictable utterance first so the human listener could become more accustomed to the voice (be it natural or synthetic).

The sets of words consisted of pairs of relatively hi-frequency bi-grams, in all only four different pairs were used, rose bush, Rose Bowl, holiday season and holiday shopping. In designing the prompts we aimed at a medium level of difficulty. We thus avoided selecting more confusable word pairs, for example pairs which were one phone away from more common bi-grams, e.g. baseball pat. Although we were prepared to change the lexical content and difficulty to ensure the right level of difficulty for our audience, our pre-testing showed that our first selection was adequate.

Alan W Black 2002-06-18