The evaluations presented in this section required detailed, time-consuming manual annotations. The system's annotations would not suffice, because the implementation does not perfectly recognize when a rule is applicable. A sample of four randomly selected dialogs in the CMU training set and the four dialogs in the NMSU training set were annotated.
The counts derived from the manual annotations for this section are defined below. Because this section focuses on the relations, we consider them at the more specific level of the deictic and anaphoric rules presented in Online Appendix 1. In addition, we do not allow trivial extensions of the relations, as we did in the evaluation of the focus model (Section 8.1). The criterion for correctness in this section is the same as for the evaluation of the system: a field-by-field exact match with the manually annotated correct interpretations. There is one exception. The starting and end time of day fields are ignored, since these are known weaknesses of the rules, and they represent a relatively minor proportion of the overall temporal interpretation.
The following were derived from manual annotations.
The values for each data set, together with coverage and ambiguity evaluations,
are presented in Table 7.
Coverage (TimeRefsC / TimeRefs) = 95%
Ambiguity (DiffICorr / TimeRefsC) = 1.15
Overall Ambiguity (DiffI / TimeRefs) = 1.17
Rule Redundancy (CorrI / TimeRefsC) = 142/74 = 1.92 %
Coverage (TimeRefsC / TimeRefs) = 85%
Ambiguity (DiffICorr / TimeRefsC) = 1.28
Overall Ambiguity (DiffI / TimeRefs) = 1.32
Rule Redundancy (CorrI / TimeRefsC) = 154 / 83 = 1.86 %
The ambiguity for both data sets is very low. The Ambiguity figure in Table 7 represents the average number of interpretations per temporal reference, considering only those for which the correct interpretation is possible (i.e., it is DiffICorr / TimeRefsC). The table also shows the ambiguity when all temporal references are included (i.e., DiffI / TimeRefs). As can be seen from the table, the average ambiguity in both data sets is much less than two interpretations per utterance.
The coverage of the relations can be evaluated as (TimeRefsC / TimeRefs), the percentage of temporal references for which at least one rule yields the correct interpretation. While the coverage of the NMSU data set, 85%, is not perfect, it is good, considering that the system was not developed on the NMSU data.
The data also show that there is often more than one way to achieve the correct interpretation. This is another type of redundancy: redundancy of the data with respect to the model. It is calculated in Table 7 as (CorrI / TimeRefsC), that is, the number of correct interpretations over the number of temporal references that have a correct interpretation. For both data sets, there are, on average, roughly two different ways to achieve the correct interpretation.
Table 8 shows the number of times each rule applies in total (column 3) and the number of times each rule is correct (column 2), according to our manual annotations. Column 4 shows the accuracies of the rules, i.e., (column 2 / column 3). The rule labels are the ones used in Online Appendix 1 to identify the rules.
The same four rules are responsible for the majority of applications in both data sets, the ones labeled D2ii, A1, A3ii, and A4. The first is an instance of the frame of reference deictic relation, the second is an instance of the co-reference anaphoric relation, the third is an instance of the frame of reference anaphoric relation, and the fourth is an instance of the modify anaphoric relation.
How often the system considers and actually uses each rule is shown in Table 9. Specifically, the column labeled Fires shows how often each rule applies, and the column labeled Used shows how often each rule is used to form the final interpretation. To help isolate the accuracies of the rules, these experiments were performed on unambiguous data. Comparing this table with Table 8, we see that the same four rules shown to be the most important by the manual annotations are also responsible for the majority of the system's interpretations. This holds for both the CMU and NMSU data sets.