The main hypothesis of this work is that WSD requires different kinds of knowledge sources (linguistic information, statistical information, structural information, etc.) and techniques. The aim of this paper was to explore some methods of collaboration between complementary knowledge-based and corpus-based WSD methods. Two complementary methods have been presented: specification marks (SM) and maximum entropy (ME). Individually, both have benefits and drawbacks. We have shown that both methods can collaborate to obtain better results on WSD.
In order to demonstrate our hypothesis, three different schemes for combining both approaches have been presented. We have presented different mechanisms of combining information sources around knowledge-based and corpus-based WSD methods. We have also shown that the combination of both approaches outperforms each of the methods individually, demonstrating that both approaches could be considered complementary. Finally, we have shown that a knowledge-based method can help a corpus-based method to better perform the disambiguation process, and vice versa.
In order to help the specification marks method, ME disambiguates some nouns in the context of the target word. ME selects these nouns by means of a previous analysis of training data in order to identify which ones seem to be highly accurately disambiguated. This preprocess fixes some nouns reducing the search space of the knowledge-based method. In turn, ME is helped by SM by providing domain information of nouns in the contexts. This information is incorporated into the learning process in the form of features.
By comparing the accuracy of both methods, with and without the contribution of the other, it was demonstrated that such combining schemes of WSD methods are possible and successful.
Finally, we presented a voting system for nouns that included four classifiers, three of them based on ME, and one of them based on SM. This cooperation scheme obtained the best score for nouns when compared with the systems submitted to the SENSEVAL-2 Spanish lexical-sample task and comparable results to those submitted to the SENSEVAL-2 English lexical-sample task.
We are presently studying possible improvements in the collaboration between these methods, both by extending the information that the two methods provide to each other and by taking advantage of the merits of each one.
The authors wish to thank the anonymous reviewers of the Journal of Artificial Intelligence Research and COLING 2002, the 19th International Conference on Computational Linguistics, for helpful comments on earlier drafts of the paper. An earlier paper [Suárez PalomarSuárez Palomar2002b] about the corpus-based method (subsection 3.2) was presented at COLING 2002.
This research has been partially funded by the Spanish Government under project CICyT number TIC2000-0664-C02-02 and PROFIT number FIT-340100-2004-14 and the Valencia Government under project number GV04B-276 and the EU funded project MEANING (IST-2001-34460).