Data and Evaluation

We have used the Spoken English Corpus [Arnfield, 1994] for all the experiments described here. This is a database of spoken British English recorded from BBC Radio 4. It is mostly speech read from scripts in the form of news stories, weather reports etc. This corpus has been labelled with POS information and has been hand labelled with major and minor prosodic phrase breaks. The section of the the corpus that we use has 39369 words and contains 7750 breaks (both major and minor). The corpus comprises 40 separate stories, of which 30 were used for training and 10 for testing. This resulted in a training set consisting of 31,707 words and 6346 breaks, and a test set of 7662 words and 1404 breaks.

Alan W Black