SENSEVAL-2 Spanish lexical-sample task

vME+SM is an enrichment of vME: we added the SM classifier to the combination of the three ME systems in vME (see Section 3.3). The results on the Spanish lexical-sample task from SENSEVAL-2 are shown in Table 17. Because it only works with nouns, vME+SM improves accuracy for them only, but obtains the same score as JHU(R) while the overall score reaches the second place.

Table: vME+SM in the Spanish lexical-sample task of SENSEVAL-2

	ALL		Nouns
0.713	jhu(R)	0.702	jhu(R)
0.684	vME+SM	0.702	vME+SM
0.682	jhu	0.683	MEbfs.pos
0.677	MEbfs.pos	0.681	jhu
0.676	vME	0.678	vME
0.670	css244	0.661	MEbfs
0.667	MEbfs	0.652	css244
0.658	MEfix	0.646	MEfix
0.627	umd-sst	0.621	duluth 8
0.617	duluth 8	0.612	duluth Z
0.610	duluth 10	0.611	duluth 10
0.595	duluth Z	0.603	umd-sst
0.595	duluth 7	0.592	duluth 6
0.582	duluth 6	0.590	duluth 7
0.578	duluth X	0.586	duluth X
0.560	duluth 9	0.557	duluth 9
0.548	ua	0.514	duluth Y
0.524	duluth Y	0.464	ua

These results show that methods like SM and ME can be combined in order to achieve good disambiguation results. Our results are in line with those of Pedersen2002, which also presents a comparative evaluation between the systems that participated in the Spanish and English lexical-sample tasks of SENSEVAL-2. Their focus is on pair comparisons between systems to assess the degree to which they agree, and on measuring the difficulty of the test instances included in these tasks. If several systems are largely in agreement, then there is little benefit in combining them since they are redundant and they will simply reinforce each other. However, if some systems disambiguate instances that others do not, then the systems are complementary and it may be possible to combine them to take advantage of the different strengths of each system to improve overall accuracy.

The results for nouns (only applying SM), shown in Table 18, indicate that SM has a low level of agreement with all the other methods. However, the measure of optimal combination is quite high, reaching 89% (1.00-0.11) for the pairing of SM and JHU. In fact, all seven of the other methods achieved their highest optimal combination value when paired with the SM method.

Table 18: Optimal combination between the systems that participated in the Spanish lexical-sample tasks of SENSEVAL-2

System pair for nouns	Both OK¹	One OK ²	Zero OK ³	Kappa ⁴
SM and JHU	0.29	0.32	0.11	0.06
SM and Duluth7	0.27	0.34	0.12	0.03
SM and DuluthY	0.25	0.35	0.12	0.01
SM and Duluth8	0.28	0.32	0.13	0.08
SM and Cs224	0.28	0.32	0.13	0.09
SM and Umcp	0.26	0.33	0.14	0.06
SM and Duluth9	0.26	0.31	0.16	0.14

¹ Percentage of instances where both systems answers were correct.

² Percentage of instances where only one answer is correct.

³ Percentage of instances where none of both answers is correct.

⁴ The kappa statistic Cohen1960 is a measure of agreement between multiple systems (or judges) that is scaled by the agreement that would be expected just by chance. A value of 1.00 suggests complete agreement, while 0.00 indicates pure chance agreement.

This combination of circumstances suggests that SM, being a knowledge-based method, is fundamentally different from the others (i.e., corpus-based) methods, and is able to disambiguate a certain set of instances where the other methods fail. In fact, SM is different in that it is the only method that uses the structure of WordNet.