next up previous
Next: Overview of the Algorithm Up: pos_phrase_nice Previous: Abstract

Introduction

An important problem in text-to-speech (TTS) synthesis is to find suitable places in the text for the placement of prosodic phrase breaks. In a typical TTS system, phrase breaks are used by a number of modules, including:

1.
Fundamental frequency contour generation: Major phrase boundaries delimit intonation phrases and are the only position where boundary tones can occur. Correct phrasing facilitates suitable accentuation as the last accent in a phrase is treated as the nuclear accent.

2.
Duration: The duration module lengthens segments which occur immediately prior to a phrase boundary.

3.
Pause Insertion: Pauses can be inserted in the middle of a sentence. The main deciding factor in this is whether a major phrase break has just occurred.

The performance of these modules is heavily dependent on the ability of the phrase break component to place its boundaries in appropriate places.

Past reviews [Ostendorf and Veilleux, 1994], [Wang and Hirschberg, 1992] describe two approaches. The first makes use of the fact that prosodic structure and syntactic structure are related, and uses some sort of syntactic information to predict prosodic boundaries (often in the form of heuristic rules). This approach has several disadvantages which make its use unattractive for real TTS systems. Rule-driven parsers are notoriously unreliable and can provide poor input to the syntax-to-prosody module. In addition, a rule-driven syntax-to-prosody module suffers from the same disadvantages as all rule driven systems: they are often difficult to write, modify, maintain and adapt to new domains and languages.

In light of these shortcomings, some researchers have tried a second approach whereby prosodic structure is derived from robust, if crude, features of the input text. The simplest of these is based on the content word/function word rule (e.g. Silverman silv:thesis) whereby a phrase break is placed before every function word that follows a content word. Despite its simplicity, such an approach can sometimes produce reasonable results. A number of other proposals based on either rule driven or statistical superficial analysis of the text have also been proposed [Wang and Hirschberg, 1992], [Hirschberg and Preito, 1994], [Ostendorf and Veilleux, 1994], [Veilleux et al., 1990].

This paper describes an algorithm of the second type which assigns phrase breaks using global optimisation techniques on sequences of part-of-speech (POS) tags.


next up previous
Next: Overview of the Algorithm Up: pos_phrase_nice Previous: Abstract
Alan W Black
1999-03-20