next up previous
Next: Clustering algorithm Up: Automatically clustering similar units Previous: Automatically clustering similar units


Speech synthesis by concatenation of sub-word units (e.g. diphones) has become basic technology. It produces reliable clear speech and is the basis for a number of commercial systems. However with simple diphones, although the speech is clear, it does not have the naturalness of real speech. In attempt to improve naturalness, a variety of techniques have been recently reported which expand the inventory of units used in concatenation from the basic diphone schema (e.g. [7] [5] [6]). There are a number of directions in which this has been done, both in changing the size of the units, the classification of the units themselves, and the number of occurrences of each unit.

A convenient term for these approaches is selection based synthesis. In general, there is a large database of speech with a variable number of units from a particular class. The goal of these algorithms is to select the best sequence of units from all the possibilities in the database, and concatenate them to produce the final speech.

The higher level (linguistic) components of the system produce a target specification, which is a sequence of target units, each of which is associated with a set of features. In the algorithm described here the database units are phones, but they can be diphones or other sized units. In the work of Sagisaka et al. [9], units are of variable length, giving rise to the term non-uniform unit synthesis. In that sense our units are uniform. The features include both phonetic and prosodic context, for instance the duration of the unit, or its position in a syllable. The selection algorithm has two jobs: (1) to find units in the database which best match this target specification and (2) to find units which join together smoothly.

Alan W Black
Tue Jul 1 17:10:58 BST 1997