next up previous
Next: Varying POS sequence length Up: Part-of-Speech Sequence Models Previous: Larger POS Tagsets

Smoothing POS Sequence Models

The standard model uses 23 POS tags in a POS sequence model of length 3. This gives 12167 different possible observations. With a training set of 31,707 words it is clear that there will be a large number of POS sequences which never occur or occur only once. In the basic model, sequences with zero counts are assigned a small fixed floor probability. These cases are not particularly important as the chances of breaks and non-breaks being inserted is now governed by the phrase model. More worrying are single occurrences. If a POS sequence is observed only once and with a break at the juncture, this will be assigned the same probability as when a large number of breaks and zero non-breaks are observed for a POS sequence. Clearly the second case is a better indicator that the POS sequence in question really does carry a high likelihood of a break.

Table 2: Effect of smoothing on accuracy

Experiment Phrase Break Model Breaks-Correct Junctures-Correct Juncture-Insertions
unsmoothed 1-gram 69.940 91.56 3.600
smoothed 1-gram 68.376 91.46 3.227
unsmoothed 6-gram 77.070 91.49 5.270
smoothed 6-gram 79.274 91.60 5.569

To counter this problem we employ a smoothing technique which adjusts the frequency counts of rare and non-occurring POS sequences. First Good-Turing (explained in Church and Gale church&gale:91) smoothing is used to adjust the frequency counts of all occurrences for the break and non-break model. This effectively gives zero counts a small value and reduces the counts of rare cases. Next a form of backing-off is applied whereby a juncture likelihood P(ck-1, ck, ck+1 | jk) is discarded if its adjusted frequency count falls below a threshold, and the estimate P(ck, ck+1 | jk) is used instead. A threshold of 3 usually gave the best results. Table 2 gives the results comparing the V23 tagset under smoothing and no smoothing. The smoothed POS sequence models with the 6-gram phrase break model are significantly better than the unsmoothed equivalents with both word and break accuracy increasing at only a slight word insertion decrease.

The table shows that smoothing significantly increases performance when used with a high order n-gram phrase break model.

next up previous
Next: Varying POS sequence length Up: Part-of-Speech Sequence Models Previous: Larger POS Tagsets
Alan W Black