Domain WSD Heuristic

This heuristic uses a derived resource, ``relevant domains'' Montoyo2003, which is obtained combining both the WordNet glosses and WordNet Domains Magnini20003. WordNet Domains establish a semantic relation between word senses by grouping them into the same semantic domain (Sports, Medicine, etc.). The word bank, for example, has ten senses in WordNet 2.0, but three of them, ``bank#1'', ``bank#3'' and ``bank#6'' are grouped into the same domain label, Economy, whereas ``bank#2'' and ``bank#7'' are grouped into the domain labels Geography and Geology. These domain labels are selected from a set of 165 labels hierarchically organized. In that way, a domain connects words that belong to different subhierarchies and part-of-speech.

``Relevant domains'' is a lexicon derived from the WordNet glosses using WordNet Domains. In fact, we use WordNet as a corpus categorized with domain labels. For each English word appearing in the gloses of WordNet, we obtain a list of their most representative domain labels. The relevance is obtained weighting each possible label with the "Association Ratio" formula (AR), where $w$ is a word and $D$ is a domain.

AR(w\vert D)= P(w\vert D) \ast \log \frac{P(w\vert D)}{P(w)}
\end{displaymath} (1)

This list can also be considered as a weighted vector (or point in a multidimensional space). Using such word vectors of ``Relevant domains'', we can derive new vectors to represent sets of words--for instance, for contexts or glosses. We can then compare the similarity between a given context and each of the possible senses of a polysemous word--by using for instance the cosine function.

Figure 8 shows an example for disambiguating the word genotype in the following text: There are a number of ways in which the chromosome structure can change, which will detrimentally change the genotype and phenotype of the organism. First, the glosses of the word to be disambiguated and the context are pos-tagged and analyzed morphologically. Second, we build the context vector (CV) which combines in one structure the most relevant and representative domains related to the words from the text to be disambiguated. Third, in the same way, we build the sense vectors (SV) which group the most relevant and representative domains of the gloss that is associated with each one of the word senses. In this example, genotype#1 - (a group of organisms sharing a specific genetic constitution) and genotype#2 - (the particular alleles at specified loci present in an organism). Finally, in order to select the appropriate sense, we made a comparison between all sense vectors and the context vector, and we select the senses more approximate to the context vector. In this example, we show the sense vector for sense genotype#1 and we select the genotype#1 sense, because its cosine is higher.

Figure 8: Example of Domain WSD Heuristic
\includegraphics[width=13cm, clip]{heuristica_dominios.eps}

Defining this heuristic as ``knowledge-based'' or ``corpus-based'' can be seen controversial because this heuristic uses WordNet gloses (and WordNet Domains) as a corpus to derive the "relevant domains". That is, using corpus techniques on WordNet. However, WordNet Domains was constructed semi-automatically (prescribed) following the hierarchy of WordNet.