Readings on Spoken Dialog Systems: Discussions

(see the readings page for full bibliography)


PARADISE: A framework for evaluating spoken dialogue agents.

March 12, 1999

Summary:
PARADISE (PARAdigm for Dialogue System Evaluation) is a framework for evaluating spoken dialog systems. Using PARADISE, it is possible to compare the performance of different dialog strategies, evaluate performance over subdialogs as well as whole dialogs, and compare systems designed for different tasks by normalizing for task complexity.

PARADISE uses user satisfaction ratings as an indicator of usability, and calculates the contribution of two potential factors (task success and dialog costs) to user satisfaction using decision-theoretic framework. To measure the task success, PARADISE uses the Kappa coefficient calculated from a confusion matrix which is built based on comparisons between the dialogue and scenario key attribute value matrices. Examples of dialog costs are the number of total utterances and the number of repair utterances.

Discussion Points:
Overall, this evaluation framework seems to be a good way of measuring and comparing performance of spoken dialog systems. It will allow us to compare different dialog strategies and empirically find the main contributing factors to user satisfaction, which we consider to be the major evidence of performance. One concern we have is the amount of work that would be required to tag the dialogs with the attributes for the task. For a complex task, defining an adequate set of attributes can actually be difficult. We are also not sure that (or how) this evaluation framework will apply to subdialogs as claimed in section 2.5.


Second discussion on April 7, 1999

This evaluation framework operationalizes the success rate of the information transfer. This is done using an attribute-value matrix, by comparing the matrices for the system and the user. A confusion matrix, where the off-diagonals repreesent a mismatch between the system and the user matrices, is used to calculate the Kappa statistic.

Their decision to use "user satisfaction" as the top-level goal can be argued for its appropriateness, but it doesn't seem to matter for the sake of evaluating this evaluation framework. Depending on the system one needs to evaluate, the top level goal can be anything--amount of money saved, total elapsed time, etc. Another question that was raised concerns the set of values for an attribute. This framework seems to presuppose that there is a finite set of values and the set of values depend on the dialogs used in the study, not the whole domain. There may be cases where the set of values is not finite, or the whole set is not represented in the dialogs used in the evaluation. Also, what does this framework do to capture a non-uniform distribution of values?

To be continued...


Empirically evaluating an adaptable spoken dialogue system.

March 15, 1999

We liked this paper for providing experimental results to show the effects of different dialog strategies. We can also see that adapting the dialogue strategy increases the performance (as measured in task success and user satisfaction). This paper also shows that users may be willing to sacrifice efficiency for getting the task done. "Our findings draw into question a frequently made assumption in the field regarding the centrality of efficiency to performance ... " We would have liked to see how exactly the users changed the dialog strategies, and in which circumstances. Understanding and somehow generalizing the circumstances in which the users changed the initiative or confirmation (and in which direction) would make it possible to automate the adaptation. We will look forward to reading about their current experiments with other dialog strategies (other than the two extreme initial dialog strategy configurations in this paper). We would also like to see whether the results would be different for "expert" users.


Second discussion on April 7, 1999

Coming soon...


From Novice to Expert: The Effect of Tutorials on User Expertise with Spoken Dialogue Systems.

March 19, 1999

This paper presents the effect of tutorial on novice users' performance, actually quantified by a carefully designed experiment. The result is as expected; the performance and usability ratings for the novice group with tutorial are significantly higher than that of the novice group without tutorial. However, as is noted in the paper, many systems do not get regular use by a single user, and in that case a 4-minute tutorial would not be justified. It would only make sense for the applications with regular use, as the ones in the paper. A few questions about the experiment. First, why did the experiment not include the group expert-notutorial? Also, are the different tasks presented in random order for each of the subjects? Otherwise, would there be an effect of ordering on the results?

This paper is written in the context of a well-thought-out dialog system, so in addition to the main point about the effect of tutorial, this paper also gave us some good ideas about the general design of dialog systems and evaluation methods. For example, having context-sensitive help messages for each state seems to make sense for getting the user out of difficulty. This paper (and the next) also helped to clarify some of the questions we had about the PARADISE evaluation framework.


Second discussion on April 12, 1999

There were 3 groups: experts (with tutorial), novice-notutorial, novice-tutorial. By the end of the 3 tasks, the differences between the three groups were small, although the novice-notutorial group still performed worse than the other two groups. We would like to see the experiment extending to more than 3 tasks. At the end of 5 or 6 tasks, we might see smaller, or even no significant differences between the groups. Also, for the notutorial group, what would happen if they had 4 minutes (the length of the tutorial) to spend on just using the system? This may be a better evaluation of the effect of tutorials. One thing to notice from this paper is that user performance usually goes up with repeated use; we should not look only at how the system performs the first time a novice user uses the system, but also look at how it performs after repeated use to evaluate the overall system performance. We also noted that the ANOVA they used did not consider the interaction terms.


Learning optimal dialogue strategies: a case study of a spoken dialogue agent for email.

March 19, 1999

This paper present a method by which the dialogue agent learns to choose an optimal dialogue strategy for each of the states. This paper gives an example of using PARADISE for sub-dialogues by varying the combination of strategies then applying the PARADISE framework to those dialogues. The drawback would be that it requires a large number of dialogues (108 for initiative and 124 for presentation). As in the other papers from AT&T, MRS turns out to be a big factor.


An evaluation of strategies for selectively verifying utterance meanings in spoken natural language dialog.

March 29, 1999

In a spoken dialog system, verifying user's utterances increases the rate of correct interpretation, but at the cost of unnecessary verification. The author suggests that verification should be selective, tunable, and should operate at the semantic level. The goal of selective verification is to minimize the rate of underverification and the rate of oververification. To achieve this goal, the author tried 4 different strategies: parse cost only, context only, parse cost/context, domain-dependent exceptions.Overall, for task-oriented domains, a context-dependent verification strategy keeps both underverification and oververification rates down. One question was raised about the flexibility of the dialog: does this dialog system allow mixed initiatives and skipping some steps? Or is it a nature of this domain to have a relatively strict dialog structure? To answer this question and to learn more about their system, we will be reading their systems level paper (Smith, Hipp, and Biermann CL 1995).


Automatic Detection of Poor Speech Recognition at the Dialogue Level.

March 29, 1999

Coming soon...


Dialogue strategies guiding users to their communicative goal.

March 29, 1999

This system uses an underspecified feature structure to figure out what the next system question should be. The advantage of this approach would be that the decision of what to disambiguate next is determined by how much entropy would be decreased as a result of that disambiguation. It is not clear, though, that all questions generated that way are reasonable to ask from the user; i.e., a question that would maximize the entropy the most may not be a question that the user can answer; e.g., "what is the first 3 digits of the restaurant's street address?" It is also not clear how the notion of sub-dialogs would be handled.


Second discussion on April 12, 1999

The domain-specific objects are organized in typed feature structures. Within a domain, an assumption is made that there is a limited set of actions. The information user gives is unified with the feature structure for a goal, and the result of the unification triggers a rule for the next system action. If there is enough information for the goal, such as directions to a restaurant, the system will carry out the action of giving the directions. Otherwise, the system will ask a question to disambiguate the solution set, and the decision of which question to ask is computed from the underspecified feature structure.

This seems to be similar to the Circuit Fix-It system, only with a richer structure. One good thing about this approach is that what to ask next is computed directly from the feature structures, not inferred from the dialog state. This approach was designed so that it would be domain-independent. All of the domain-specific information can be captured in the typed feature structures, but that also means the quality of the interaction depends largely on how the feature structure is designed. One way to exploit this approach is to have a database of different feature structures for different (classes of) users, so as to provide a smoother interaction according to users' preferences. In this approach, the dialog strategy is independent of the domain task. It may be difficult to remove information or do repairs. Also, it may be hard to capture open domains (where the domain may not be defined clearly in terms of feature structures).


New features for confidence annotation.

April 19, 1999

This paper introduced two new features for confidence annotation in a spoken natural language system. Previous research in confidence annotation has focused on calculating the probability of correctness of the ASR output. A typical approach is to use an n-best homogeneity, i.e., look at a number of different words within one time segment, weighted by probabilities.

One new feature introduced in this paper is how much the overall likelihood depends on each of the words.

The second new feature is how many neighbors will change the word.

To be continued...


Confidence scoring for speech understanding systems.

April 19, 1999

This paper presents the different approaches they've tried and combined. Some of the differences from other papers include the following:
1. the entire sentence is annotated at the utterance level (as opposed to word level)
2. the annotation of training data automatized the training process

To be continued...


Designing a task-based evaluation methodology for a spoken machine translation system.

April 26, 1999

Coming soon...


Evaluation methodology for a telephone-based conversational system.

April 26, 1999

This paper discusses the evaluation framework used in the Jupiter system (telephone-based weather information) at MIT. They use two major  types of evaluations: a component-based evaluation, and a whole-system evaluation. This is a good cataloging of what one should do when designing a spoken language dialog system.

To be continued...
 
 


Last Updated: June 15, 1999 by Alice Oh