Corpus-based WSD

In the last fifteen years, empirical and statistical approaches have had a significantly increased impact on NLP. Of increasing interest are algorithms and techniques that come from the machine-learning (ML) community since these have been applied to a large variety of NLP tasks with remarkable success. The reader can find an excellent introduction to ML, and its relation to NLP, in the articles by Mitchell1997, Manning1999, and Cardie1999, respectively. The types of NLP problems initially addressed by statistical and machine-learning techniques are those of language- ambiguity resolution, in which the correct interpretation should be selected from among a set of alternatives in a particular context (e.g., word-choice selection in speech recognition or machine translation, part-of-speech tagging, word-sense disambiguation, co-reference resolution, etc.). These techniques are particularly adequate for NLP because they can be regarded as classification problems, which have been studied extensively in the ML community. Regarding automatic WSD, one of the most successful approaches in the last ten years is supervised learning from examples, in which statistical or ML classification models are induced from semantically annotated corpora. Generally, supervised systems have obtained better results than unsupervised ones, a conclusion that is based on experimental work and international competitions². This approach uses semantically annotated corpora to train machine-learning (ML) algorithms to decide which word sense to choose in which contexts. The words in such annotated corpora are tagged manually using semantic classes taken from a particular lexical semantic resource (most commonly WordNet). Many standard ML techniques have been tried, including Bayesian learning Bruce1994, Maximum Entropy suarezCICLING2002, exemplar-based learning Ng1997,Hoste2002, decision lists Yarowsky1994,Agirre2001, neural networks Towell1998, and, recently, margin-based classifiers like boosting Escudero2000ecml and support vector machines Cabezas2001.

Corpus-based methods are called ``supervised'' when they learn from previously sense-annotated data, and therefore they usually require a large amount of human intervention to annotate the training data Ng1997. Although several attempts have been made <e.g.,>Leacock1998,Mihalcea1999,Cuadros2004, the knowledge acquisition bottleneck (too many languages, too many words, too many senses, too many examples per sense) is still an open problem that poses serious challenges to the supervised learning approach for WSD.

Footnotes

... competitions ²: http://www.senseval.org