This section presents results for predicting problematic dialogues. Taking into account the fact that a problematic dialogue must be predicted at a point in the dialogue where the system can do something about it, we compare prediction accuracy after having seen only the first exchange or the first two exchanges with identification accuracy after having seen the whole dialogue. For each of these situations, we also compare results for the AUTOMATIC feature set (as described earlier) with and without the auto-SLU-success feature and with the hand-labelled feature SLU-success.
Table 5 summarizes the overall accuracy results. The three columns present results for Exchange 1, Exchanges 1&2 and over the whole dialogue. The first row gives the baseline result which represents the prediction accuracy from always guessing the majority class. Since 67.1% of the dialogues are TASKSUCCESS dialogues, we can achieve 67.1% accuracy from simply guessing TASKSUCCESS for each dialogue. The second row gives results using only automatic features, but without the auto-SLU-success feature. The third row uses the same automatic features but adds in auto-SLU-success. This feature is obtained for both the training and the test set, using the cross-validation method discussed in Section 3. The fourth and fifth rows show results using the subset of features that are both fully automatic and task-independent as described in Section 4.
The automatic results given in row 2 are significantly higher by a paired t-test than the baseline for all three sections of the dialogue (df=866, t=2.1, p=0.035;df=866, t=7.2, p=0.001;df=866 t=13.4, p=0.001).
Rows 6 and 7 show accuracy improvements gained by the addition of hand-labelled features. These rows give a TOPLINE against which to compare the results in rows 2, 3, 4 and 5. Results using all the automatic features plus the hand-labelled SLU-success are given in row 6. In these experiments, the hand-labelled SLU-success feature is used for training and testing. Comparing this result with the second row shows that if one had a perfect predictor of auto-SLU-success in the training and the test set, then this feature would increase accuracy by 5.5% for Exchange 1 (from 70.1% to 75.6%); by 7.6% for Exchanges 1&2 (from 78.1% to 85.7%); and by 5.9% for the whole dialogue (87.0% to 92.9%). These increases are significant by a paired t-test (df=866, t=5.1, p=0.0001; df=866, t=2.1, p=0.035; df=866, t=6.7, p=0.001).
Comparing the result in row 6 with the result in row 3 shows that the auto-SLU-success predictor that we have trained can improve performance, but could possibly help more with different training methods. Ideally, the result in row 3, for automatic features plus auto-SLU-success, should fall between the figures in rows 2 and 6, and be closer to the results in row 6. With Exchanges 1&2, adding auto-SLU-success results in an increase of 1.1% which is not significant (compare rows 2 and 3). For Exchange 1 only, RIPPER does not use the auto-SLU-success feature in its ruleset and does not yield an improvement over the system trained only on the automatic features. The system trained on the whole dialogue with automatic features plus auto-SLU-success also does not yield an improvement over the system trained without auto-SLU-success.