Smoothing POS Sequence Models

The standard model uses 23 POS tags in a POS sequence model of length 3. This gives 12167 different possible observations. With a training set of 31,707 words it is clear that there will be a large number of POS sequences which never occur or occur only once. In the basic model, sequences with zero counts are assigned a small fixed floor probability. These cases are not particularly important as the chances of breaks and non-breaks being inserted is now governed by the phrase model. More worrying are single occurrences. If a POS sequence is observed only once and with a break at the juncture, this will be assigned the same probability as when a large number of breaks and zero non-breaks are observed for a POS sequence. Clearly the second case is a better indicator that the POS sequence in question really does carry a high likelihood of a break.

Table 2: Effect of smoothing on accuracy

Experiment	Phrase Break Model	Breaks-Correct	Junctures-Correct	Juncture-Insertions
unsmoothed	1-gram	69.940	91.56	3.600
smoothed	1-gram	68.376	91.46	3.227
unsmoothed	6-gram	77.070	91.49	5.270
smoothed	6-gram	79.274	91.60	5.569

To counter this problem we employ a smoothing technique which adjusts the frequency counts of rare and non-occurring POS sequences. First Good-Turing (explained in Church and Gale church&gale:91) smoothing is used to adjust the frequency counts of all occurrences for the break and non-break model. This effectively gives zero counts a small value and reduces the counts of rare cases. Next a form of backing-off is applied whereby a juncture likelihood P(c_k-1, c_k, c_k+1 | j_k) is discarded if its adjusted frequency count falls below a threshold, and the estimate P(c_k, c_k+1 | j_k) is used instead. A threshold of 3 usually gave the best results. Table 2 gives the results comparing the V₂₃ tagset under smoothing and no smoothing. The smoothed POS sequence models with the 6-gram phrase break model are significantly better than the unsmoothed equivalents with both word and break accuracy increasing at only a slight word insertion decrease.

The table shows that smoothing significantly increases performance when used with a high order n-gram phrase break model.