Next: Training Data Up: Assigning Phrase Breaks from Previous: Basic model

Part of speech tagging

The POS tags for our training and test data are determined using a fully automatic POS tagger. Part of speech tagging has become a quite mature field and we simply follow the known technology. We use a standard HMM-based tagger (as in [4]) which estimates the probability of a part of speech tag sequence given a sequence of words. For training our POS tagger we use the WSJ corpus in the Penn Treebank [7], which consists of around one million words. Their basic tagset (after some simple reduction) consists of 37 tags. We treat each punctuation symbol as a word with the tag punc. Using a tri-gram model we achieved 94.03% accuracy on a held out test set. Given our experiments in tagset size described below, we also investigated the accuracy of POS taggers using a reduced tagset. We discovered that reducing the tagset, then building a model gives better results (96.18%) than using the full 37 tags. Even better results were achieved by using the full tagset to tag the data and then reducing to the smaller set (97.04%). Hence we use a tri-gram model built using the full 37 tagset and reduce it as required.

Alan W Black
Tue Jul 1 17:09:00 BST 1997