next up previous
Next: Acknowledgements Up: An Empirical Approach to Previous: Evaluation of the Architectural

   
Conclusions

Scheduling dialogs, during which people negotiate the times of appointments, are common in everyday life. This paper reports the results of an in-depth empirical investigation of resolving explicit temporal references in scheduling dialogs. There are four basic phases of this work: data annotation, model development, system implementation and evaluation, and model evaluation and analysis. The system and model were developed primarily on one set of data (the CMU dialogs), and then applied later to a much more complex set of data (the NMSU dialogs), to assess the generalizability of the model for the task being performed. Many different types of empirical methods were applied to both data sets to pinpoint the strengths and weaknesses of the approach.

In the data annotation phase, detailed coding instructions were developed and an intercoder reliability study involving naive subjects was performed. The results of the study are very good, supporting the viability of the instructions and annotations. During the model development phase, we performed an iterative process of implementing a proposed set of anaphoric and deictic relations and then refining them based on system performance (on the CMU training data), until we settled on the set presented here. We also developed our focus model during this phase. The question of what type of focus model is required for various tasks is a question of ongoing importance in the literature. It appeared from our initial observations of the data that, contrary to what we expected, a recency-based focus model might be adequate. To test this hypothesis, we made the strategic decision to limit ourselves to a recency-based model, rather than build some kind of hybrid model whose success or failure would not have told us as much.

During system implementation and evaluation, a system implementing the model was implemented and evaluated on unseen test data, using a challenging field-by-field comparison of system and human answers. To be considered the right answer, the information must not only be correct, but must also be included in the correct field of the output representation. Taking as input the ambiguous output of a semantic grammar, the system achieves an overall accuracy of 81% on unseen CMU test data, a large improvement over the baseline accuracy of 43%. On an unseen test set from the more complex NMSU data, the results are very respectable: an overall accuracy of 69%, with a much lower baseline accuracy of 29%. This also shows the robustness of the CMU semantic parser [20,21], which was given the NMSU dialogs as input without being modified in any way to handle them.

The implementation is an important proof of concept. However, it is not a direct evaluation of the model, because there are errors due to factors we do not focus on in this work. Some of the error is simply due to utterance components being outside the coverage of the CMU parser, or having high semantic ambiguity. The only information we use to perform semantic disambiguation is the temporal context. The Enthusiast researchers have already developed better techniques for resolving the semantic ambiguity in these dialogs [32], which could be used to improve performance.

Thus, in the model evaluation and analysis phase, we performed extensive additional evaluation of the algorithm itself. We focus on the relations and the focus model, because they are the main contributions of this work. Our degradation studies support this, as they show that the other aspects of the algorithm, such as the distance factors and merging process, are responsible for little of the system's success (see Section 8.3).

Our evaluations show the strength of the focus model for the task, not only for the CMU data on which it was developed, but also for the more complex NMSU data. While the NMSU data is more complex, there are few cases in which the last mentioned time is not an appropriate antecedent, highlighting the importance of recency [39]; see Section 8.1. We characterized those cases along a number of dimensions, to identify the particular types of challenges they pose (see Figure 10).

In order to compare our work to that of others, we formally defined subdialogs and the multiple thread structures addressed by Rosé et al. [31] with respect to our model and the specific problem of temporal reference resolution. An interesting finding is that, while subdialogs of the types addressed by Grosz and Sidner [10] were found in the data, no cases of multiple threads were found. That is, some subdialogs, all in the NMSU data, mention times that potentially interfere with the correct antecedent. But in none of these cases would subsequent errors result if, upon exiting the subdialog, the offending information were popped off a discourse stack or otherwise made inaccessible. Changes in tense, aspect, and modality are promising clues for recognizing subdialogs in this data, which we plan to explore in future work.

To assess whether or not using a simpler focus model requires one to use a highly ambiguous set of relations, we performed a separate evaluation of the relations, based on detailed, manual annotations of a set of dialogs. The ambiguity of the relations for both data sets is very low, and the coverage is good (see Table 7). In a comparison of system and human annotations, the same four rules identified to be most important in the manual annotations are responsible for the majority of the system's interpretations for both data sets (see Tables 8 and 9), suggesting that the system is a good implementation of the model.

Recently, many in computational discourse processing have turned to empirical studies of discourse, with a goal to develop general theories by analyzing specific discourse phenomena and systems that process them [38]. We contribute to this general enterprise. We performed many different evaluations, on the CMU data upon which the model was developed, and on the more complex NMSU data. The task and model components were explicitly specified to facilitate evaluation and comparison. Each evaluation is directed toward answering a particular question; together, the evaluations paint an overall picture of the difficulty of the task and of the success of the proposed model.

As a contribution of this work, we have made available on the project web page the coding instructions, the NMSU dialogs, and the various kinds of manual annotations we performed.


next up previous
Next: Acknowledgements Up: An Empirical Approach to Previous: Evaluation of the Architectural