next up previous
Next: Part-of-Speech Sequence Models Up: pos_phrase_nice Previous: Testing Methodology


Part-of-Speech Tagging


Table 1: Comparison of basic models.
Experiment Breaks-correct Junctures-correct Juncture-insertions
Det. P 54.274 90.758 0.852
Det. PCF 84.402 71.288 31.728
Prob P 1-gram 54.986 91.099 0.799
Prob P 6-gram 58.547 88.006 5.385
Prob PCF 1-gram 54.886 91.099 0.799
Prob PCF 6-gram 68.305 89.393 5.849

We use a standard HMM-based tagging framework as is commonly found in a number of systems (e.g. [DeRose, 1988]). This model consists of two parts: an n-gram model for part of speech sequences and a likelihood distribution model of part of speech tags for words. These parts are combined using Bayes Theorem and the Viterbi algorithm is used to find the most probable part of speech sequence given a set of words.

Although the Spoken English Corpus is marked with POS tags, this corpus has too few words to train a HMM POS tagger. Instead we used the Penn Treebank [Marcus et al., 1993] which consists of around 1.2 million words from the the Wall Street Journal (WSJ). Apart from size, we do not think that the two corpora are significantly different with respect to POS behaviour. The words in the WSJ data are tagged automatically with subsequent hand correction. Punctuation is reduced to a single tag giving us a base tagset of K=37. A generic, unknown word POS distribution is made from the POS distributions of a set of less frequent words and there is a special distribution for words containing just digits. The tagger correctly tagged 94.4% of the words of an independent test set of 113,000 words.

The parameters in our POS sequence model are calculated from POS tag occurrences and it is clear that while the full tagset may potentially be the most discriminative, it also leads to sparse data problems. A series of experiments found that a tagset of size 23 was the overall best. The reduction in size can be carried out in two ways, either by mapping the output of the tagger onto a smaller tagset, or by training the tagger on the smaller set. We found that when we post-mapped the tagset the performance was 97.0%, while training and testing on the reduced set gave a worse figure of 96.2%. Hence we always use the full tagset for POS tagging purposes and reduce the size of the set afterwards.


next up previous
Next: Part-of-Speech Sequence Models Up: pos_phrase_nice Previous: Testing Methodology
Alan W Black
1999-03-20