next up previous
Next: Measure of accuracy Up: Generating F0 contours for Previous: Speech Database


For each syllable in the database bearing a Tilt event label, a set of 40 features was extracted. The features include the number of syllables, stressed syllables, and accented syllables proceeding and succeeding the syllable within the phrase; distance, in syllables, from the previous and to the next event; the number of non-major phrase breaks since the last major break; onset and rhyme length [11] [8]; percent of the syllable which is unvoiced; and position of the syllable within a word (e.g. initial, final, medial). The features also include, with a two-syllable window on either side, accentedness, lexical stress, onset and coda types (cf. [11]), Tilt event type and syllable break values. Specifically these are the features which are available at F0 generation time during synthesis from raw text.

Once the features have been extracted, training sets are created on the basis of event type (accent, boundary, connection, silence) and individual models were built for each Tilt parameter (starting F0, amplitude, duration, tilt, peak position).

A CART training algorithm [3] is used to develop a decision tree for each parameter, using an optimised subset of the features extracted. The twelve decision trees are used to generate an intonation description file composed of Tilt events and their parameters. The description files are processed to generate the final F0 contours.

Kurt Dusterhoff
Tue Jul 1 11:51:11 BST 1997