next up previous
Next: Larger POS Tagsets Up: Part-of-Speech Sequence Models Previous: Part-of-Speech Sequence Models

Punctuation, Content Words and Function Words

First, we report results from two deterministic algorithms, one that uses punctuation only and another which uses this plus the content/function word distinction. No training is necessary in these cases. Row 1 (Det. P) in Table 1 shows the results from the deterministic punctuation-only algorithm which simply places a phrase break after every punctuation mark in the text. As one might expect, this shows that punctuation is a very reliable indicator of phrase-break presence. The insertion rate is very low at 0.852% showing that punctuation nearly always implies a phrase break. This algorithm correctly finds about half the breaks and inserts very few false ones.

Row 2 (Det. PCF) shows the results for when breaks are placed after every punctuation mark and before every function word following a content word. The number of breaks correct increases considerably, but only because of a massive over prediction of break placement. The junctures-correct score and the insertion score are the worst for any of the experiments described here.

Figure 2: Plot of junctures-correct for a 6-gram (upper line) and 1-gram (lower line) against K, the size of the tagset.
\epsfxsize = 7 cm

Rows 3 (Prob P 1-gram) and 4 (Prob P 6-gram) show the results from our algorithm using a POS sequence model with one tag following and one preceding the juncture ( L = 2, M=1, V={non-punctuation-word, punctuation}). Row 3 shows that the probabilistic punctuation-only algorithm produces very similar results to its deterministic counterpart when a 1-gram phrase break model is used, but that break prediction accuracy increases when a higher order n-gram phrase break model (N=6) is used (row 4). Row 5 (Prob. PCF 1-gram) shows that dividing the non-punctuation class into content words and function words has no significant effect on performance ( L = 2, M=1, V={function, content, punctuation}) compared to row 3. If we look at the relevant POS sequence frequency counts in the training data, we find a that there are 1751 non-break instances and 1002 break instances for the content-function sequence. Thus a non-break is always more probable, and given a 1-gram phrase-break model, breaks will never be inserted. However, if we use a higher order phrase-break model, the combined probability will get high enough to make breaks more probable if the distance from the last break is more than a few words. The figures in row 6 (Prob PCF 6-gram) show this effect.

In summary, the deterministic punctuation algorithm is ``safe'' but massively under-predicts, while the deterministic punctuation/content/function word algorithm massively over-predicts. The probabilistic counterparts only perform acceptably if a high order phrase break model is used.

next up previous
Next: Larger POS Tagsets Up: Part-of-Speech Sequence Models Previous: Part-of-Speech Sequence Models
Alan W Black