Exploiting Dictionaries in Named Entity Extraction via Semi-Markov Models

Sunita Sarawagi


  Entities involved in typical information extraction tasks often consist of multiple words or tokens. Most state of the art NER systems do sequential labeling where each token in the input words is assigned a label. We argue that NER tasks should classify {\em segments} of multiple adjacent words instead of single words. We formalize this as a semi-Markov process which relaxes the usual Markov assumptions of word-based labeling tasks. This formalism allows the direct use of useful entity-level as against word-level features, and provides a more natural formulation of the NER problem than sequential word classification. In particular, this allows a natural way of incorporating noisy external dictionaries of multi-word entities through high-performance string similarity measures from the record linkage literature. I will present how Conditional Random Fields (CRFs), a popular and high-performance IE model is extended to perform such semi-markov sequential labeling. Experiments in multiple domains show that the new model can substantially improve extraction performance, relative to previously published methods for using external dictionaries in NER.

(This is joint work with William Cohen)

Back to the Main Page

Pradeep Ravikumar
Last modified: Thu Apr 29 18:23:24 EDT 2004