Next: Varying the Order of Up: pos_phrase_nice Previous: Minor and Major

Using Distance Information: the Phrase Break Model

It has been previously noted [Gee and Grosjean, 1983], [Bachenko and Fitzpatrick, 1990], [Wang and Hirschberg, 1992] [Sanders and Taylor, 1995] that phrases occur at somewhat regular intervals and that the likelihood of a break occurring increases with the distance since the last break. The distribution of phrase lengths in words is shown in figure 3. It is clear that the majority of phrases are between 3 and 6 words long.

Previous work has attempted to use this information [Wang and Hirschberg, 1992] [Sanders and Taylor, 1995] but has been hampered by the problem that in a real situation we cannot be certain where the previous break has occurred, and hence it is impossible to reliably compute the distance since the last break. One could simply work in a left to right fashion, placing breaks where they seem most likely and then using that as a definite point to measure the distance to the next hypothesised break. Although simple, this approach is not recommended as a single error cannot be recovered from and may cause errors in all the subsequent decisions. We adopt a different solution, made possible by the use of probabilities throughout the system. Although we can't say when the previous break to juncture j_i occurred, we can estimate the probability that a break occurred at any of the previous junctures j_i-1, j_i-2, j_i-3 and thus try to examine all the possibilities.

**Figure 3:** Plot of frequency of occurrence of phrases of different lengths in words
$\begin{figure} \centering\leavevmode \epsfxsize = 7 cm \epsffile{border1.eps}\end{figure}$

Instead of explicitly modelling the distance from the last break, we use a n-gram model which gives the probability of all sequences of juncture types in a window of size N. (Note this window is completely separate from the window used in the POS model). N-Grams are the most popular choice of language model in speech recognition systems. The number of possible n-grams is given by T^N, where T is the size of the vocabulary. Because of this, speech recognition language models are usually bigrams or trigrams. Here T is the number of juncture types and in our standard system we have a vocabulary of T=2 (break and non-break) and so it is possible to model n-grams as high as 6 and 7 robustly.

The overall goal of the algorithm is to find the most likely sequence of juncture types for the input POS sequence. The probability of a sequence of juncture types (P(J|C)) is calculated using Bayes' Rule by multiplying the POS sequence likelihood by the phrase probability. In principle, the probability of all possible sequences of juncture types is calculated and the highest is chosen. If calculated directly this entails ST^S possibilities where S is the number of words in the sentence. Even for the case when the juncture types are simply break and non-break (T=2), the number of paths can be very large for long sentences (e.g. over a million for a 20 word sentence) and if the T is larger the number of paths becomes intractable. Instead we use the Viterbi algorithm which finds the most likely path in about T²N calculations. For a detailed explanation see Rabiner and Juang rabiner&juang:94.

Varying the Order of the N-gram

Next: Varying the Order of Up: pos_phrase_nice Previous: Minor and Major

Alan W Black
1999-03-20