An important problem in text-to-speech (TTS) synthesis is to find suitable places in the text for the placement of prosodic phrase breaks. In a typical TTS system, phrase breaks are used by a number of modules, including:
Past reviews [Ostendorf and Veilleux, 1994], [Wang and Hirschberg, 1992] describe two approaches. The first makes use of the fact that prosodic structure and syntactic structure are related, and uses some sort of syntactic information to predict prosodic boundaries (often in the form of heuristic rules). This approach has several disadvantages which make its use unattractive for real TTS systems. Rule-driven parsers are notoriously unreliable and can provide poor input to the syntax-to-prosody module. In addition, a rule-driven syntax-to-prosody module suffers from the same disadvantages as all rule driven systems: they are often difficult to write, modify, maintain and adapt to new domains and languages.
In light of these shortcomings, some researchers have tried a second approach whereby prosodic structure is derived from robust, if crude, features of the input text. The simplest of these is based on the content word/function word rule (e.g. Silverman silv:thesis) whereby a phrase break is placed before every function word that follows a content word. Despite its simplicity, such an approach can sometimes produce reasonable results. A number of other proposals based on either rule driven or statistical superficial analysis of the text have also been proposed [Wang and Hirschberg, 1992], [Hirschberg and Preito, 1994], [Ostendorf and Veilleux, 1994], [Veilleux et al., 1990].
This paper describes an algorithm of the second type which assigns phrase breaks using global optimisation techniques on sequences of part-of-speech (POS) tags.