next up previous
Next: Hand-labelled Features Up: Problematic Dialogue Predictor Results Previous: Problematic Dialogue Predictor Results

Task-independent Features

Rows 4 and 5 give the results using the AUTO, TASK-INDEPT feature set described in Figure 9 without and with the auto-SLU-success feature, respectively. These results are significantly above the baseline using a paired t-test, with Exchanges 1&2 giving an increase of 13.1% (df=866, t=8.6, p=0.001) using TASK-INDEPT features with auto-SLU-success. By comparing rows 4 and 5, one observes an increase in the AUTO, TASK-INDEPT features set when the feature auto-SLU-success is added using Exchanges 1&2 and whole dialogue. The 1.9% increase for Exchanges 1&2 shows a trend (df=866, t=1.7,p=0.074), whereas the 2% increase for the whole dialogue is statistically significant by a paired t-test (df=866, t=3.0, p=0.003).

Although, the TASK-INDEPT feature sets are a subset of those features used in row 3, it is possible for them to perform better because the TASK-INDEPT features are more general, and because RIPPER uses a greedy algorithm to discover its rule sets. For Exchanges 1&2, the increase from row 3 to row 5 (both of which use auto-SLU-success) is not significant. Comparing rows 2 and 4, neither of which use auto-SLU-success, one sees a slight degradation in results for the whole dialogue using TASK-INDEPT features. However, the increase from rows 2 to 5 from 78.1% to 80.3% for Exchanges 1&2 is statistically significant (df=866, t=2.0, p=0.042). This shows that using auto-SLU-success in combination with the set of TASK-INDEPT features produces a statistically significant increase in accuracy over a set of automatic features that does not include this feature.

The main purpose of these experiments is to determine whether a dialogue is potentially problematic, therefore using the whole dialogue is not useful in a dynamic system. Using Exchanges 1&2 produces accurate results and would enable the system to adapt in order to complete the dialogue in the appropriate manner.


 
Table 5: Accuracy % results for predicting problematic dialogues.
Row Features Exchange 1 Exchange 1&2 Whole
1 Baseline 67.1 67.1 67.1
2 AUTO (no auto-SLU-success) 70.1 78.1 87.0
3 AUTO + auto-SLU-success 69.6 79.2 84.9
4 AUTO, TASK-INDEPT (no auto-SLU-success) 70.1 78.4 83.4
5 AUTO, TASK-INDEPT + auto-SLU-success 69.2 80.3 85.4
6 AUTO + SLU-success 75.6 85.7 92.9
7 ALL ( AUTO + Hand-labelled) 77.1 86.9 91.7
       
 


next up previous
Next: Hand-labelled Features Up: Problematic Dialogue Predictor Results Previous: Problematic Dialogue Predictor Results
Helen Hastie
2002-05-09