Next: Context features Up: Generating F₀ contours for Previous: Comparison of results

Discussion

The goal of this study was to determine whether high quality F₀ contours can be generated using only information available from a text-to-speech system at synthesis time. The results of our contour generation experiments have shown that this is possible. The contours pictured in Figure 2 show the similarity between the generated contours and the originals. Informal listening tests confirm that they produce acceptable contours, though sometimes different from the originals (though usually unimportantly so).

The features used in the generation are all readily available during text-to-speech synthesis. Lexical stress, syllable, word, phonetic and phrasing information are routinely generated as part of the synthesis process before F₀ generation is required.

One aspect that we do not cover in this paper is the automatic assignment of accent and boundary event labels during the synthesis process, likewise [2] and [6] also assume a labelling from which they then generate the F₀. Although we have not yet done tests, predicting Tilt accents and boundaries seems a much easier task than predicting various ToBI pitch accent labels and end tones.

The work in [2] is very similar to this study and was the starting point for our experiments, in terms of the feature sets and the speech database. The phrasal features (e.g. phrase breaks, syllable distances) and stress and accentedness features from that study were all incorporated into our experiment. However, unlike [2], which predicts F₀ for every syllable in the database, we only predict parameters for events. Thus, if a number of syllables fall within a single event, they do not have individual values predicted for them. Our approach is more focused on generating intonation from an accent structure within an utterance than on deriving F₀ from the utterance itself.

The work in [6] is also similar to our approach. By taking similar phrasal and segmental characteristics into account, and working with an accent inventory of minimal size (4 accents, 3 intermediate boundaries, and 2 final boundaries), they generate good quality intonation contours. They also incorporate energy prediction into their experiments, with successful results.

One advantage our study has over [2] and [6] is that the intonation event inventory (accent, boundary, silence, connection) is very simple. Both of the previous studies were forced to collapse the large ToBI inventory into a smaller number of classes in order to achieve their results. The use of the Tilt labelling system eliminates this requirement.

Context features

Next: Context features Up: Generating F₀ contours for Previous: Comparison of results

Kurt Dusterhoff
Tue Jul 1 11:51:11 BST 1997