This paper reports results on automatically training a Problematic Dialogue Predictor to predict problematic human-computer dialogues using a corpus of 4692 dialogues collected with the How May I Help You spoken dialogue system. The Problematic Dialogue Predictor can be immediately applied to the system's decision of whether to transfer the call to a human customer care agent, or be used as a cue to the system's Dialogue Manager to modify its behavior to repair the problems identified. The results show that: (1) Most feature sets significantly improve over the baseline; (2) Using automatic features from the whole dialogue, we can identify problematic dialogues 20% better than the baseline; (3) Just the first exchange provides significantly better prediction (3%) than the baseline; (4) The second exchange provides an additional significant (13%) improvement, (5) A classifier based on task-independent automatic features performs slightly better than one trained on the full automatic feature set.
The improved ability to predict problematic dialogues is important for fielding the HMIHY system without the need for the oversight of a human customer care agent. These results are promising and we expect to be able to improve upon them, possibly by incorporating prosody into the feature set [Hirschberg, Litman, SwertsHirschberg et al.1999] or expanding on the SLU feature sets. In addition, the results suggest that the current PDP is likely to generalize to other dialogue systems.
In future work, we plan to integrate the learned rulesets into the HMIHY dialogue system and evaluate the impact that this would have on the system's overall performance. There are several ways we might be able to show this. Remember that one use of the PDP is to improve the system's decision of whether and when to transfer a call to the human customer care agent. The other use would be as input to the Dialogue Manager's dialogue strategy selection mechanism. Demonstrating the utility of the PDP for dialogue strategy selection requires experiments that test out several different ways that this information could be used by the Dialogue Manager. Demonstrating the utility of the PDP on the decision to transfer a call necessarily involves examining the tradeoffs among different kinds of errors. This is because every call that the HMIHY system can handle successfully saves a company the cost of using a human customer care agent to handle the call. Thus, we can associate this cost with the decision that HMIHY makes to transfer the call. When HMIHY transfers the call unnecessarily, we call this cost the lost automation cost. On the other hand, every call that HMIHY attempts to handle and fails, would potentially accrue a different cost, namely the lost revenue from customers who become irritated with faulty customer service and take their business elsewhere. We call this cost the system failure cost. In the results that we presented here, we report only overall accuracy results and treat lost automation cost and system failure cost as equally costly. However, in any particular installation of the HMIHY system, there may be differences between these costs that would need to be accounted for in the training of the PDP. It would be possible to use RIPPER to do this, if these costs were known, by using its ability to vary the loss ratio.
Another potential issue for future work is the utility of a dialogue level predictor, e.g. the PDP, vs. an utterance level predictor, e.g. the auto-SLU-success predictor, for the goal of automatically adapting a system's dialogue strategy. This is shown to be effective in [Litman PanLitman Pan2000], where they use a problematic dialogue detector in order to adapt the dialogue strategy for a train enquiry system. It would be possible, and others have argued [LevowLevow1998,Hirschberg, Litman, SwertsHirschberg et al.1999,KirchhoffKirchhoff2001] that the dialogue manager's adaptation decisions can be made on the basis of local behavior, i.e. on the basis of recognizing that the current utterance has been misunderstood, or that the current utterance is a correction. However, it is clear that the decision to transfer the call to a human customer care agent cannot be made on the basis of only local information because the system can often recover from a single error. Thus, we expect that the ability to be able to predict the dialogue outcome as we do here will continue to be important even in systems that use local predictors for understanding and correction.