Speech synthesis has now reached the stage where high quality human sounding speech can be synthesized for many applications. Unit selection speech synthesis, where appropriate sub-word units are selected from large database of natural speech, are best when the domain is known , but can still produce pleasant sounding speech in the general case, . However there are restrictions in such systems. The high quality speech is directly linked to the quality of the speech database itself. Unlike previous synthesis technologies like diphone and formant synthesis, unit selection does not currently offer variation outside its recorded style.
This work is part of an NSF grant to investigate more varied synthesis. Specifically we are looking at modeling prosodic variation in reading children's stories where a human reader will add different voice qualities, appropriate intonation, etc, to carry the meaning of the story over to the listener.
In order to do this well, we need to do analysis of the text in order to uncover some of the underlying structure. This paper looks at one particular aspect of text analysis for speech synthesis. We examine how to automatically identify spoken text within a story and by identifying the characters in the story, assign each quote to a particular character. The result is a system, we call ESPER, which can take in raw text stories and produce markup identifying who speaks when. The markup can then be rendered in a speech synthesis markup language like SABLE or SSML and transformed into speech.
Although we are dealing with well-written published texts, there is still significant variety on how the information within the story is presented. Hence identifying the speakers within the story is a non-trivial task, even when we choose to ignore clearly-difficult cases, e.g. when detailed semantics of the story or external world knowledge is required for the identification.