next up previous
Next: Problematic Dialogue Predictor Up: Automatically Training a Problematic Previous: The Features

Auto-SLU-success Predictor

The goal of the auto-SLU-success predictor is to identify, for each exchange, whether or not the system correctly understood the user's utterance. As mentioned above, when the dialogues were transcribed by humans after the data collection was completed, the human labelers not only transcribed the users' utterances, but also labelled each utterance with a semantic category representing the task that the user was asking HMIHY to perform. This label is called the human-label. The system's Dialogue Manager decides among several different hypotheses produced by the SLU module, and logs its hypothesis about what task the user was asking HMIHY to perform; the Dialogue Manager's hypothesis is known as the sys-label. We distinguish four classes of spoken language understanding outcomes based on comparing the human-label, the sys-label and recognition results for card and telephone numbers: (1) RCORRECT: SLU correctly identified the task and any digit strings were also correctly recognized; (2) RPARTIAL-MATCH: SLU correctly recognized the task but there was an error in recognizing a calling card number or a phone number; (3) RMISMATCH: SLU did not correctly identify the user's task; (4) NO-RECOG: the recognizer did not get any input to process and so the SLU module did not either. This can arise either because the user did not say anything or because the recognizer was not listening when the user spoke. The RCORRECT class accounts for 7481 (36.1%) of the exchanges in the corpus. The RPARTIAL-MATCH accounts for 109 (0.5%) of the exchanges. The RMISMATCH class accounts for 4197 (20.2%) of the exchanges and the NO-RECOG class accounts for 8943 (43.1%) of the exchanges.

The auto-SLU-success predictor is trained using 45 fully automatic features. These features are the Acoustic/ ASR features, SLU features and Dialogue Manager and Discourse History features, given in Figure 7 . Hand-labelled features were not used.

We evaluate the four-way auto-SLU-success classifier by reporting accuracy, precision, recall and the categorization confusion matrix. This classifier is trained on all the features for the whole training set, and then tested on the held-out test set.

Table 1 summarizes the overall accuracy results of the system trained on the whole training set and tested on the test set described in Section 3. The first line of Table 1 represents the accuracy from always guessing the majority class ( NO-RECOG); this is the BASELINE against which the other results should be compared. The second row, labelled AUTOMATIC, shows the accuracy based on using all the features available from the system modules. This classifier can identify SLU errors 47.0% better than the baseline. An experiment was run to see if the cross-validation method described in Section 3 performs worse than using the whole data on the same test set. This experiment showed that there was little loss of accuracy when using cross-validation (0.6%).

Table 1: Results for detecting SLU Errors using RIPPER
Features Used Accuracy  
BASELINE (majority class) 43.1%  
AUTOMATIC 90.1 %  

Figure 10 shows some top performing rules that RIPPER learns when given all the features. These rules directly reflect the usefulness of the SLU features. Note that some of the rules use ASR features in combination with SLU features such as salpertime. Previous studies [Walker, Wright, LangkildeWalker et al.2000c] have also shown SLU features to be useful. We had also hypothesized that features from the Dialogue Manager and the discourse history might be useful predictors of SLU errors, however these features rarely appear in the rules with the exception of sys-label. This is in accordance with previous experiments which show that these features do not add significantly to the performance of the SLU ONLY feature set [Walker, Wright, LangkildeWalker et al.2000c].

Figure 10: A subset of rules learned by ripper when given the automatic features for determining auto-SLU-success
\rule{6in}{.2mm} \\
{\bf if} (sys-label = DIAL-FOR-ME) $\wedge$...
... {\bf then} {\it rmismatch}\\
\rule{6in}{.2mm} \\

We also report precision and recall for each category on the held-out test set. The results are shown in Tables 2 and 3. Table 2 shows that the classification accuracy rate is a result of a high rate of correct classification for the RCORRECT and NO-RECOG class, at the cost of a lower rate for RMISMATCH and RPARTIAL-MATCH. This is probably due to the fact that there are fewer examples of these categories in the training set.

Table 2: Precision and Recall for Test set using Automatic features
Class Recall Precision
RCORRECT 92.6% 86.8%
NO-RECOG 98.5% 97.5%
RMISMATCH 70.6% 81.0%
RPARTIAL-MATCH 22.7% 40.0%

Table 3: Confusion Matrix for Test set using Automatic features
RCORRECT 2784 6 211 5
NO-RECOG 9 3431 44 0
RMISMATCH 409 83 1204 10

In some situations, one might not need to distinguish between the different misunderstanding categories: NO-RECOG, RMISMATCH and RPARTIAL-MATCH. Therefore, experiments were performed that collapsed these 3 problematic categories into one category ( RINCORRECT). This resulted in a recognition accuracy of 92.4%, a 29.4% improvement over the baseline of 63%, which is the percentage of RINCORRECT exchanges. The precision and recall matrix is given in Table 4.

Table 4: Precision and Recall for Test set using Automatic features
Class Recall Precision
RCORRECT 91.2% 89.0 %
RINCORRECT 93.1% 94.5%

next up previous
Next: Problematic Dialogue Predictor Up: Automatically Training a Problematic Previous: The Features
Helen Hastie