Next: Conclusions Up: Analysis Previous: Coverage and Ambiguity of

## Evaluation of the Architectural Components

In this section, we evaluate the architectural components of our algorithm using degradation (ablation) studies. We perform experiments without each component in turn, and then with none of them, to observe the impact on the system's performance. Such studies have been useful in developing practical methods for other kinds of anaphora resolution as well (see, for example, [24]). Specifically, an experiment was performed testing each of the following variations.

1.
The certainty factors of all of the rules are set to 1.

Recall that all rules are applied to each utterance, and each rule that matches produces a Partial-Augmented-ILT (which is assigned the certainty factor of the rule). All maximal mergings of the Partial-Augmented-ILTs are then formed, to create a set of Augmented-ILTs. Then, the final interpretation of the utterance is chosen from among the set of Augmented-ILTs. The certainty factor of each Augmented-ILT is the sum of the certainty factors of the Partial-Augmented-ILTs composing it. Thus, setting the certainty factors to 1 implements the scheme in which the more partial results are merged into an interpretation, the higher the overall certainty factor of that interpretation. In other words, this scheme favors the Augmented-ILT resulting from the greatest number of rule applications.

2.
The certainty factors of all of the rules are set to 0.

This scheme is essentially random selection among the Augmented-ILTs that make sense according to the critics. If the critics did not exist, then setting the rule certainty factors to 0 would result in random selection. With the critics, any Augmented-ILTs to which the critics apply are excluded from consideration, because the critics will lower their certainty factors to negative numbers.

3.
No merging of the rule results is performed.

That is, the Partial-Augmented-ILTs are not merged prior to selection of the final Augmented-ILT. The effect of this is that the result of one single rule is chosen to be the final interpretation.

4.
The critics are not used.

5.
The distance factors are not used.

In this case, the certainty factors for rules that access the focus list are not adjusted based on how far back the chosen focus list item is.

6.
All variations are applied, excluding case 2.

Specifically, neither the critics nor the distance factors are used, no merging of partial results is performed, and the rules are all given the same certainty factor (namely, 1).

Table 10 shows the results for each variation when run over the unambiguous but uncorrected CMU training data. For comparison, the first row shows the results for the system as normally configured. As with the previous evaluations, accuracy is the percentage of the correct answers the system produces, while precision is the percentage of the system's answers that are correct.

Table 10: Evaluation of the Variations on CMU Unambiguous/Uncorrected Data

 Variation Cor Inc Mis Ext Nul Act Poss Acc Prec system as is 1283 44 112 37 574 1938 2013 0.923 0.958 all CFs 1.0 1261 77 101 50 561 1949 2000 0.911 0.935 all CFs 0.0 1202 118 119 49 562 1931 2001 0.882 0.914 -critics 1228 104 107 354 667 2353 2106 0.900 0.805 -dist. factors 1265 52 122 50 591 1958 2030 0.914 0.948 -merge 1277 46 116 54 577 1954 2016 0.920 0.949 combo 1270 53 116 67 594 1984 2033 0.917 0.940

 Legend Cor(rect): System and key agree on non-null value Inc(orrect): System and key differ on non-null value Mis(sing): System has null value for non-null key Ext(ra): System has non-null value for null key Nul(l): Both system and key give null answer Poss(ible): Correct + Incorrect + Missing + Null Act(ual): Correct + Incorrect + Extra + Null Base(line)Acc(uracy): Baseline accuracy (input used as is) Acc(uracy): % Key values matched correctly ((Correct + Null)/Possible) Prec(ision): % System answers matching the key ((Correct + Null)/Actual)

Only two of the differences are statistically significant ( ), namely, the precision of the system's performance when the critics are not used, and the accuracy of the system's performance when all of the certainty factors are 0. The significance analysis was performed using paired t-tests comparing the results for each variation with the results for the system as normally configured.

The performance difference when the critics are not used is due to extraneous alternatives that the critics would have weeded out. The drop in accuracy when the certainty factors are all 0 shows that the certainty factors have some effect. Experimenting with statistical methods to derive them would likely lead to further improvement.

The remaining figures are all only slightly lower than those for the full system, and are all much higher than the baseline accuracies.

It is interesting to note that the unimportance of the distance factors (variation 5) is consistent with the findings presented in Section 8.1 that the last mentioned time is an acceptable antecedent in the vast majority of cases. Otherwise, we might have expected to see an improvement in variation 5, since the distance factors penalize going further back on the focus list.

Next: Conclusions Up: Analysis Previous: Coverage and Ambiguity of