We define our problem as follows: the input text consists of a sequence of words and between each pair of words is a word juncture. There is a set of T word juncture types and it is the task of a phrase-break algorithm to assign the most appropriate type to each juncture. Most experiments in this paper use two types of juncture, break and non-break. In principle any number of types is possible, for example splitting the break type into minor and major gives 3 types, or following the ToBI scheme gives 5 types [Silverman et al., 1992]. Here we give an overview of the standard version of our algorithm which assigns breaks and non-breaks to arbitrary input text.
The algorithm is trained and tested on a database of spoken English in which the original text has been hand annotated with a break label between words which are perceived as being phrase-breaks. The text is tagged with a hidden Markov model (HMM) part-of-speech (POS) tagger which replaces each word by its POS tag. The tags, c are chosen from a tagset Vv1,...,vk of size K. The juncture between every pair of words is then marked as one of the word juncture types: in the standard case this is either break or non-break.
The algorithm uses a Markov model in which states represent juncture types, and the transitions between states represent the likelihood of particular sequences of breaks and non-breaks occurring. Each state has an observation probability distribution giving how likely that state is to have produced a sequence of part of speech tags. The state observation probabilities are called the POS sequence model and the set of transition probabilities is called the phrase break model. Bayes equation is used to relate the two, and the most likely juncture sequence for a given input can be found by searching through the model and picking the most likely path.
Training the model involves estimating the POS sequence model (the observation probabilities) and the phrase break model (the transition probabilities).