In the task of rendering natural sounding speech from raw text, one of the many tasks is generating natural sounding intonation. A number of intonation theories have been utilised in various systems to try to do this task. As the quality of speech synthesis improves, a greater demand is put on the intonation system to produce more varied intonation tunes. Because of this demand, and the requirement to quickly and easily add new voices and new accents to our systems, intonation systems should be trainable, where appropriate, from natural speech data.
ToBI  offers a well-defined intonation phonology for labelled speech. It is probably still the most widely available standard labelling system. The ToBI labelling system itself does not define a mechanism to go from the labels to an F0 contour, or the reverse. However there are both hand written rule systems (e.g. ) and statistically trained methods (e.g. ) which do this task.
The Tilt intonation theory has been shown to be a good representational method for natural F0 contours  but prior to the work presented here it has not been shown that Tilt parameters could be predicted reliably from text input. Tilt and ToBI typify two major classes of intonation system. Tilt comes from a data-driven approach attempting to form an abstraction of the natural contour but maintaining mechanism to recreate it. ToBI takes a more linguistic or phonological approach specifying a small set of discrete labels which identify the intonational space of accents and tones.
There are other intonation theories but we highlight ToBI as it has been used in a similar experiment to the one described below on the same database and hence offers a chance to directly compare these theories on the same task.
In order for an intonation theory to be suitable to be used in a speech synthesis system it must be possible to predict its parameters adequately from the information that is available from the raw text (or information that is automatically derivable from such text). Thus it is not just a good representation of the natural F0 that is required but we must also be able to predict its parameters.