next up previous
Next: Using Distance Information: the Up: Part-of-Speech Sequence Models Previous: Varying POS sequence length


Minor and Major


Table 4: Minor and major phrase breaks

Phrase break Model Minor correct Major correct Junctures-correct Junctures-insertions
1-gram 60.125 56.415 90.128 2.035
6-gram 68.703 59.502 90.211 3.923

Most of our work has centred around a model comprising two types of juncture, namely ``break'' and ``non-break'', but the algorithm can easily be applied to an arbitrary number of juncture types. The MARSEC corpus is in fact labelled with two types of break, ``major'' and ``minor'' and so it was straightforward to build a system that predicted these types. The method is as before, but now three POS sequence models are required and the phrase break model has a vocabulary of 3 (implying higher perplexities and more sparse data problems). Note that when more than two types of break are used, substitution errors need to be counted for when major breaks are placed where minor breaks occur and vice versa. Table 4 reports the results.

We have not yet performed as comprehensive an analysis on multiple break types as with the single type, but have noted that the best tagsets for the each case are different. While the V23 tagset was the best for the single break case, larger tagsets gave better results for the multiple case, despite the more severe sparse data problems that occur. This indicates that finer distinctions are needed for discrimination of major and minor types.


next up previous
Next: Using Distance Information: the Up: Part-of-Speech Sequence Models Previous: Varying POS sequence length
Alan W Black
1999-03-20