As mentioned in Section 2, there are 3 types of problematic dialogues: TASKFAILURE, WIZARD and HANGUP. In order to determine whether some of these types of problematic dialogues are more difficult to predict than others, we conducted a post-hoc analysis of the proportion of prediction failures for each type of problematic dialogue. Since we were primarily interested in the performance of the PDP using the full automatic feature set, after having seen Exchanges 1&2, we conducted our analysis on this version of the PDP. Table 10 shows the distribution of the 4 types of dialogue in the test set and whether the Exchanges 1&2 PDP was able to predict correctly that the dialogue would be TASKSUCCESS or PROBLEMATIC. One can see that the worst performing category is TASKFAILURE and that the PDP predicts incorrectly that 68.5% of the TASKFAILURE dialogues are TASKSUCCESS.
One reason that this might occur is that this sub-category of dialogues are much more difficult to predict since in this case the HMIHY system has no indication that it is not succeeding in the task. However, another possibility is that the PDP performs poorly on this category because there are fewer examples in the training set, although it does better on the HANGUP subset, which is about the same proportion. We can eliminate the first possibility by examining how a learner performs when trained on equal proportions of TASKSUCCESS and TASKFAILURE dialogues. We conducted an experiment using a subset of TASKSUCCESS dialogues in the same proportion as TASKFAILURE for the training and the test set and trained a second PDP using the fully automatic Exchange 1&2 features. This resulted in a training set of 690 dialogues and a test set of 216. The binary classifier has an accuracy of 70%, the corresponding recognition matrix is presented in table 11. The results show that fewer TASKFAILURES are predicted as successful, suggesting that TASKFAILURES are not inherently more difficult to predict than other classes of problematic dialogues. Below we discuss the potential of using RIPPER's loss ratio to weight different types of classification errors in future work.