Next: Recording in style Up: Unit Selection and Emotional Previous: Background

Emotional Speech

Unit selection techniques will provide synthesizers with the quality of the database they are built from. Thus we can synthesized various emotions if we record database of the appropriate type.

However, before we give some examples of this direction, it is worth better defining what is meant by emotional speech, and more importantly how we might actually use such synthesizers in applications.

Traditionally emotional speech is split in four groups: neutral, happy, sad, and angry (hot and/or cold anger). Various studies show that listeners can fairly reliably distinguish between happy and sad, though may confuse these with hot anger and cold anger in ambiguous situations. Testing output quality is hard, studies usually use lexically neutral statements so just the spectral and prosodic properties vary, while in real life situations, lexical issues and context probably are a bigger clue to the emotional state of the speaker.

The following experiment highlights how lexical choice influences human perception of voice characteristics.

In developing a child voice synthesizer, we specifically required a gender neutral voice. Our recordings were based on an adult voice-over actress with experience in performing child voices. When we first tested recordings from her with a group of potential users we found most people identified the voice as an adult pretending to be a child. However we noted that the sentence contents, designed for phonetic and metrical coverage are not typical sentences that would be spoken by children. It is difficult to imagine situations where a child might say.

A sense of psychological certainty is no proof in itself of epistimelogical validity.

Thus on later tests we synthesized child specific utterances to test the perceived view of the voice.

Are we there yet?
Please read me my a story.
Can't I do it tomorrow?
...

We also synthesized girl specific sentences, and boy specific sentences

Can I go to the Mall with Kimmy?
I like to go shopping for new clothes.
When I grow up I want to help animals.
... Last weekend my Dad took me to a ball game.
I'm starving, is there anything to eat?
My Mom says I'm not old enough to watch Wrestling. ...

We played these utterances to parents, not familiar with synthesis, and rather than ask them the gender of the speaker, asked them to give us a suitable name and suggest the age of the speaker. Overwhelmingly all listeners give boy names when listing to the ``boy'' sentences, and girl names for ``girl'' sentences. However in general the listeners did consider the boy younger than the girl.

These informal tests show that people's perception of voice type is subtle, and content can easily overwhelm prosodic and spectral qualities of voices.

In our experience in building speech synthesis systems, these standard definitions of emotion are actually rarely requested by users. Though much more subtle notions of emotion and style are needed.

Next: Recording in style Up: Unit Selection and Emotional Previous: Background

Alan W Black 2003-09-07