next up previous
Next: Style Up: Unit Selection and Emotional Previous: Recording in style


Even when considering something as basic as emphasis in speech synthesis we quickly discover that our control over stylistic aspects of speech to be very minimal. When humans speak they use a number of different variations to denote emphasis in speech. These include phrasing, duration, F0 excursions, and power. Different speakers may choose to render emphasis with different combinations and even individuals may change their strategies in different styles of speech.

In Festival, [7], emphasis is implemented by rather naive rules. In SABLE [8] marked up words, emphasis is realized by inserting short pauses before and after the emphasized word, extending the duration, and intensifying the F0. In simple cases this is adequate, but is very crude and it is easy to find cases where it sounds unnatural. However in almost all cases it is clear that the synthesizer is emphasizing that word, but potentially in a non-natural way, especially in poly-syllabic words and phrases.

In order to improve the quality of such a basic speech variation as emphasis we tried explicitly recording examples of naturally emphasized speech. As we wished to use these recordings in a standard unit selection synthesizer we had to ensure that there was sufficient phonetic, metrical and prosodic coverage within the data bases.

Thus we took a database originally designed with such coverage. We used the techniques described in [6], to select sentences that optimally provided the best coverage based on an explicit acoustic model of the voice talent's speech. This database consists of 548 sentences selected from out-of-copyright books ([9]).

Then to address the coverage for emphasis, we labelled every other word in each sentence as emphasized. The voice talent (AWB), then read the sentences with emphasis on each word as marked. This was actually harder than expected. It is not easy to read a sentence and put natural emphasis on arbitrary words. This fact is important in elicitation of varying styles for unit selection databases. It is hard for a voice talent, even a trained one, to consistently deliver a desired style. When the request is something as unnatural as common emphasis on multiple words in the same sentence, the result may not always seem natural.

Each of the words to be emphasized were marked with an underscore

_Allow me _to interpret _this interesting _silence.
_Tarzan and _Jane raised _their heads.
These were automatically labelled and a cluster based unit selection synthesizer was built [10]. In the default case units of the same phone type are clustered using a CART method that indexes the clusters by high level features such as phone context, metrical structure etc. In this case we tagged each phone with an emphasis feature. Thus phones from emphasized words can only be used in the synthesis of emphasized words, while phones in non-emphasized words can only be used in the synthesis of non-emphasized words.

Once build, we took a number of short sentences, not in the original database, and synthesized sentences emphasizing each word in turn.

_This is a short example.
This _is a short example.
This is _a short example.
This is a _short example.
This is a short _example.
In all cases it was easy to identify the emphasized word in the synthesized phrase, however in about 15% of the examples the emphasis was judged to be unnatural. Though other problems with this fully automatically built unit selection system do partially interfere with this result.

However despite the limitations of this particular database it is clear that this technique does work. If you record appropriate data with sufficient coverage it is possible to synthesize that style in a natural way.

next up previous
Next: Style Up: Unit Selection and Emotional Previous: Recording in style
Alan W Black 2003-09-07