The research reported here is the first that we know of to automatically analyze a corpus of logs from a spoken dialogue system for the purpose of learning to predict problematic situations. This work builds on two strands of earlier research. First, this approach was inspired by work on the PARADISE evaluation framework for spoken dialogue systems which utilizes both multivariate linear regression and CART to predict user satisfaction as a function of a number of other metrics [Walker, Litman, Kamm, AbellaWalker et al.1997,Walker, Kamm, LitmanWalker et al.2000a]. Research using PARADISE has found that task completion is always a major predictor of user satisfaction, and has examined predictors of task completion. Here, our goals are similar in that we attempt to understand the factors that predict task completion. Secondly, this work builds on earlier research on learning to identify dialogues in which the user experienced poor speech recognizer performance [Litman, Walker, KearnsLitman et al.1999]. Because that work was based on features synthesized over the entire dialogue, the hypotheses that were learned could not be used for prediction during runtime. In addition, in contrast to the current study, the previous work automatically approximated the notion of a good or bad dialogue using a threshold on the percentage of recognition errors. There is a danger of this approach being circular when recognition performance at the utterance level is a primary predictor of a good or bad dialogue. In this work, the notion of a good ( TASKSUCCESS) and bad ( PROBLEMATIC) dialogue was labelled by humans.
In previous work, [Walker, Langkilde, Wright, Gorin, LitmanWalker et al.2000b] reported results from training a problematic dialogue predictor in which they noted the extent to which the hand-labelled SLU-success feature improves classifier performance. As a result of this prior analysis, in this work we report results from training an auto-SLU-success classifier for each exchange and using its predictions as an input feature to the Problematic Dialogue Predictor. There are a number of previous studies on predicting recognition errors and user corrections which are related to the auto-SLU-success predictor that we report on here [Hirschberg, Litman, SwertsHirschberg et al.1999,Hirschberg, Litman, SwertsHirschberg et al.2000,Hirschberg, Litman, SwertsHirschberg et al.2001b,LevowLevow1998,Litman, Hirschberg, SwertsLitman et al.2000,Swerts, Litman, HirschbergSwerts et al.2000].
[Hirschberg, Litman, SwertsHirschberg et al.1999] apply RIPPER to predict recognition errors in a corpus of 2067 utterances. In contrast to our work, they utilize prosodic features in combination with acoustic confidence scores. They report a best-classifier accuracy of 89%, which is a 14% improvement over their baseline of 74%. This result can be compared with our binary auto-SLU-success predictor ( RCORRECT vs. RINCORRECT) discussed in Section 5. Examination of the rules learned by their classifier suggests that durational features are important. While we do not use amplitude or F0 features, we do have an asr-duration feature which is logged by the recognizer. Without any of the other prosodic features, the auto-SLU-success predictor has an accuracy of 92.4%, a 29.4% improvement over the baseline of 63%. It is possible that including prosodic features in the auto-SLU-success predictor could improve this result even further.
Previous studies on error correction recognition are also related to our method of misunderstanding recognition. [LevowLevow1998] applied similar techniques to learn to distinguish between utterances in which the user originally provided some information to the system, and corrections, which provided the same information a second time, following a misunderstanding. This may be more related to our research than it first appears since corrections are often misunderstood due to hyper-articulation. Levow's experiments train a decision tree using features such as duration, tempo, pitch, amplitude, and within-utterance pauses. Examination of the trained tree in this study also reveals that the durational features are the most discriminatory. Similarly in our experiments, RIPPER uses asr-duration frequently in the developed rule set. Levow obtains an accuracy rate of 75% with a baseline of 50%.
[Swerts, Litman, HirschbergSwerts et al.2000] and [Hirschberg, Litman, SwertsHirschberg et al.2001b] perform similar studies for automatically identifying corrections using prosody, ASR features and dialogue context. Corrections are likely to be misrecognized, due to hyperarticulation. They observe that corrections that are more distant from the error they correct, are more likely to exhibit prosodic differences. Their system automatically differentiates corrections from non-corrections with an error rate of 15.72%. Dialogue context is used in the study by [Hirschberg, Litman, SwertsHirschberg et al.2001a], whereby they incorporate whether the user is aware of a mistake at the current utterance to help predict misunderstandings and misrecognition of the previous utterances. This study is similar to ours in that they use a predicted feature about an utterance (the 'aware' feature) to predict concept or word accuracy, as we use a predicted feature auto-SLU-success in the PDP. However, our auto-SLU-success feature is automatically available at the time the prediction is being made, whereas they are making the predictions retroactively. In addition, they train their system on the hand-labelled feature rather than the predicted one which they leave as further work.
[KirchhoffKirchhoff2001] performs error correction identification using task independent acoustic and discourse variables. This is a two way distinction between positive and negative error correction. She uses two cascaded classifiers, the first is a decision tree trained using 80% of the data and validating on 10%. Examples that have confidence scores below a threshold go into an exception training set for a second classifier. During testing, if confidence scores are below a threshold then the utterance is passed onto the second classifier. She finds that the most discriminatory features are dialogue context (the type of previous system utterance) followed by lexical features, with prosodic features being the least discriminatory. The system recognizes error corrections with an accuracy of 90% compared to a baseline of 81.9%. In this study [KirchhoffKirchhoff2001] deliberately eschews the use of system specific features, while in our work, we examine the separate contribution of different feature sets. Our results suggest that the use of more general features does not negatively impact performance.
[Krahmer, Swerts, Theune, WeegelsKrahmer et al.1999a] and [Krahmer, Swerts, Theune, WeegelsKrahmer et al.1999b] look at different features related to responses to problematic system turns. The disconfirmations they discuss are responses to explicit or implicit system verification questions. They observe that disconfirmations are longer, have a marked word order, and contain specific lexicon such as ``no''. In addition, there are specific prosodic cues such as boundary tones and pauses. Some of these features such as length, choice of words are captured in our RIPPER ruleset as discussed above.
As described in Section 5, two methodologies were compared for incorporating the feature SLU-success into the PDP. The first was to use the hand-labelled feature in the training set, the second to perform separate experiments to predict the feature for the training set. As the features in the training set are automatically predicted, it is hoped that the system would pick up the idiosyncrasies of the noisy data. This training method has been used previously in [WrightWright2000] where automatically identified intonation event features are used to train an automatic speech-act detector. These automatically derived features provide a better training model than the hand-labelled ones. This is true also in the current study as discussed in Section 6.1.