Next: Discussion Up: System Evaluation Previous: Experimental Variables

## Experimental Results

The results of this experiment generally supported our hypothesis with respect to efficiency. We provide figures that show average values over all users in a particular group, with error bars showing the 95% confidence intervals. The x axis always shows the progression of user's interactions with the system over time: each point is for the nth conversation completed by either finding an acceptable restaurant or quitting. Figure 2 shows that, for the modeling group, the average number of interactions required to find an acceptable restaurant decreased from 8.7 to 5.5, whereas for the control group this quantity actually increased from 7.6 to 10.3. We used linear regression to characterize the trend for each group and compared the resulting lines. The slope for the modeling line differed significantly (p=0.017) from that for the control line, with the former smaller than the latter, as expected. The difference in interaction times (Figure 3) was even more dramatic. For the modeling group, this quantity started at 181 seconds and ended at 96 seconds, whereas for the control group, it started at 132 seconds and ended at 152 seconds. We again used linear regression to characterize the trends for each group over time and again found a significant difference (p=0.011) between the two curves, with the slope for the modeling subjects being smaller than that for the control subjects. We should also note that these measures include some time for system initialization (which could be up to 10% of the total dialogue time). If we had instead used as the start time the first system utterance of each dialogue, the difference between the two conditions would be even clearer.

The speech recognizer rejected 28 percent of the interactions in our study. Rejections slow down the conversation but do not introduce errors. The misrecognition rate was much lower - it occurred in only seven percent of the interactions in our experiment. We feel both of these rates are acceptable, but expanding the number of supported utterances could reduce the first number further, while potentially increasing the second. In the most common recognition error, the ADAPTIVE PLACE ADVISOR inserted extra constraints that the user did not intend. The results for effectiveness were more ambiguous. Figure 4 plots the rejection rate as a function of the number of sessions. A decrease in rejection rate over time would mean that, as the system gains experience with the user, it asks about fewer features irrelevant to that user. However, for this dependent variable we found no significant difference (p=0.515) between the regression slopes for the two conditions and, indeed, the rejection rate for neither group appears to decrease with experience. These negative results may be due to the rarity of rejection speech acts in the experiment. Six people never rejected a constraint and on average each person used only 0.53 REJECT speech acts after an ATTEMPT-CONSTRAIN per conversation (standard deviation = 0.61). Figure 5 shows the results for hit rate, which indicate that suggestion accuracy stayed stable over time for the modeling group but decreased for the control group. One explanation for the latter, which we did not expect, is that control users became less satisfied with the PLACE ADVISOR's suggestions over time and thus carried out more exploration at item presentation time. However, we are more concerned here with the difference between the two groups. Unfortunately, the slopes for the two regression lines were not significantly different (p=0.1354) in this case.

We also analyzed the questionnaire presented to subjects after the experiment. The first six questions (see Appendix A) had check boxes to which we assigned numerical values, none of which revealed a significant difference between the two groups. The second part of the questionnaire contained more open-ended questions about the user's experience with the ADAPTIVE PLACE ADVISOR. In general, most subjects in both groups liked the system and said they would use it fairly often if given the opportunity.

Next: Discussion Up: System Evaluation Previous: Experimental Variables
Cindi Thompson
2004-03-29