Word-Sense Disambiguation

Word sense ambiguity is a central problem for many established Human Language Technology applications (e.g., machine translation, information extraction, question answering, information retrieval, text classification, and text summarization) Ide1998. This is also the case for associated subtasks (e.g., reference resolution, acquisition of subcategorization patterns, parsing, and, obviously, semantic interpretation). For this reason, many international research groups are working on WSD, using a wide range of approaches. However, to date, no large-scale, broad-coverage, accurate WSD system has been built Snyder2004. With current state-of-the-art accuracy in the range 60-70%, WSD is one of the most important open problems in NLP.

Even though most of the techniques for WSD usually are presented as stand-alone techniques, it is our belief, following McRoy1992, that full-fledged lexical ambiguity resolution will require to integrate several information sources and techniques.

In this paper, we present two complementary WSD methods based on two different methodological approaches, a knowledge-based and a corpus-based methods, as well as several methods that combine both into hybrid approaches.

The knowledge-based method disambiguates nouns by matching context with information from a prescribed knowledge source. WordNet is used because it combines the characteristics of both a dictionary and a structured semantic network, providing definitions for the different senses of the English words and defining groups of synonymous words by means of synsets, which represent distinct lexical concepts. WordNet also organizes words into a conceptual structure by representing a number of semantic relationships (hyponymy, hypernymy, meronymy, etc.) among synsets.

The corpus-based method implements a supervised machine-learning (ML) algorithm that learns from annotated sense examples. The corpus-based system usually represents linguistic information for the context of each sentence (e.g., usage of an ambiguous word) in the form of feature vectors. These features may be of a distinct nature: word collocations, part-of-speech labels, keywords, topic and domain information, grammatical relationships, etc. Based on these two approaches, the main objectives of the work presented in this paper are:

To study the performance of different mechanisms of combining information sources by using knowledge-based and corpus-based WSD methods together.
To show that a knowledge-based method can help a corpus-based method to better perform the disambiguation process and vice versa.
To show that the combination of both approaches outperforms each of the methods taken individually, demonstrating that the two approaches can play complementary roles.
Finally, to show that both approaches can be applied in several languages. In particular, we will perform several experiments in Spanish and English.

In the following section a summary of the background of word sense disambiguation is presented. Sections 2.1 and 2.2 describe the knowledge-based and corpus-based systems used in this work. Section 3 describes two WSD methods: the specification marks method and the maximum entropy-based method. Section 4 presents an evaluation of our results using different system combinations. Finally, some conclusions are presented, along with a brief discussion of work in progress.