next up previous
Next: Types of Problematic Dialogues Up: Problematic Dialogue Predictor Previous: Cross-validation Method vs. Hand-labelled-training

Feature Sets

It is interesting to examine what types of features are the most discriminatory in determining whether a dialogue is problematic or not. RIPPER was trained separately on sets of features based on the groups given in Figure 7, namely Acoustic/ ASR, SLU, Dialogue and Hand-labelled (including SLU-success). These results are given in Table 9.

For Exchange 1, only the SLU features, out of the automatic feature sets, yields an improvement over the baseline. Interestingly, training the system on the ASR yields the best result out of the automatic feature sets for Exchange 1&2 and the whole dialogue. These systems, for example, use asr-duration, number of recognized words, and type of recognition grammar as features in their ruleset.

Finally, we give results for the system trained only on auto-SLU-success and hlt-SLU-success. One can see that there is not much difference in the two sets of results. For Exchanges 1&2, the system trained on hlt-SLU-success has an accuracy which is significantly higher than the system trained on auto-SLU-success by a paired t-test (df=866, t=3.0, p=0.03). On examining the ruleset, one finds that the hlt-SLU-success uses RPARTIAL-MISMATCH where the auto-SLU-success ruleset does not. The lower accuracy may be due to the fact that the auto-SLU-success predictor has a low recall and precision for RPARTIAL-MISMATCH as seen in Table 2.


 
Table 9: Accuracy % results for subsets of features
Features Exchange 1 Exchange 1&2 Whole
Baseline 67.1 67.1 67.1
ASR 66.7 75.9 85.6
SLU 67.7 71.9 79.8
Dialogue 65.5 74.5 82.6
Hand-labelled 76.9 84.7 86.2
Auto-SLU-success 69.0 70.9 77.1
Hlt-SLU-success 69.0 74.1 77.2
     
 


next up previous
Next: Types of Problematic Dialogues Up: Problematic Dialogue Predictor Previous: Cross-validation Method vs. Hand-labelled-training
Helen Hastie
2002-05-09