dissertation overview
One of the most important and persistent problems in the development of spoken language interfaces is their lack of robustness when faced with unreliable inputs. The problems stem mostly from current limitations in speech recognition technology and as a consequence appear in most domains and interaction types. Left unchecked, speech recognition errors propagate to the higher levels of these systems where they can lead to misunderstandings or non-understandings. In a misunderstanding, the system incorrectly understands the user: for instance, the user says "Boston" but the system understands "Austin". In a non-understanding, the system does not understand the user at all: for instance, the user says "Urbana Champaign", but the speech recognition engine produces "okay in that same pay" - which makes no sense semantically in the current context. These errors exert a significant negative impact on the overall quality and success of the interactions.

Two pathways towards increased robustness can be easily envisioned: (1) prevent the errors altogether (e.g. build better speech recognition, etc.) and (2) assume the errors will always be there and build the capabilities for recovering from errors through conversation. Of course, these two approaches do not stand in opposition; rather, a combined effort would lead to the best results.

My dissertation work focuses on the second approach. I believe that 3 key components need to be in place for gracefully handling errors through interaction: first, we need to be able to accurately detect errors, preferably as soon as they happen. Second, we need to endow systems with a large repertoire of error recovery strategies (e.g. asking the user to repeat, rephrase, speak softer or louder, providing help, confirming a piece of information, etc.). Last, but definitely not least, we need to develop techniques that allow these systems to optimally choose between these strategies at runtime, i.e. we need to build good error recovery policies (when should we ask the user to repeat? when should we ask the user to rephrase? when should we provide more help?, etc.)

Together with the two types of problems introduced above, these three issues define the coordinates for the research program I have articulated in my dissertation. My work brings several contributions in this problem-space, and raises a number of additional interesting scientific and technical questions. The complete list of contributions (numbered C1 through C10) is summarized in the figure below; more details and related papers for each item are listed in the sequel.

confidence annotation [C1 and C2]

Spoken language interfaces typically rely on confidence scores to assess the reliability of their inputs and detect potential misunderstandings. In my dissertation, I have performed a thorough investigation of supervised-learning based approaches for developing confidence annotation models (C1). The focus was on a number of issues which have not previously received much attention/ in the literature: (1) what is an appropriate metric for evaluating confidence annotation performance in the context of a conversational spoken language interface? (2) what are the advantages and disadvantages of various supervised learning techniques for this task? (3) what is the relationship between training set size and confidence annotation performance? (4) can we successfully transfer a confidence annotator across domains?

Additionally, I have also proposed a novel, implicit-learning approach for developing confidence annotation models (C2). Traditional supervised learning approaches require a corpus of labeled instances, which is often costly and difficult to acquire. In contrast, no developer supervision is required for the proposed implicit-learning approach; instead, the system exploits a certain, naturally-occurring correction pattern in the conversation, and learns from its own experience with the users. Experimental results indicate that the proposed approach can attain 70-80% of the performance of a fully supervised model. I believe this novel learning approach (dubbed implicit-learning) holds a lot of promise and opens up a path towards autonomously self-improving systems. [see more here]

Here is a list of relevant papers published so far (more to come soon):

- Bohus, D., and Rudnicky A. (2002) - Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System, Technical Report CS-190, Carnegie Mellon University, Pittsburgh, PA
- Carpenter P., Jin C., Wilson D., Zhang R., Bohus, D., and Rudnicky A. (2001) - Is This Conversation on Track?, in Eurospeech-2001, Aalborg, Denmark

belief updating [C3 and C4]

A central contribution in my dissertation is the development of a data-driven belief updating framework (C3). Spoken dialog systems typically rely on confidence scores to form an initial assessment of the reliability of the information obtained from the speech recognizer. Ideally however, systems should continuously monitor and improve the accuracy of their beliefs by integrating evidence across multiple turns in a conversation. In this work, I introduce and formalize this belief updating problem, and propose a scalable, supervised-learning based solution. An empirical evaluation with a deployed spoken language interface shows that the proposed approach constructs significantly more accurate beliefs than previous heuristic solutions and leads to large gains (equivalent to a 13.5% absolute reduction in word-error-rate) in both the effectiveness and the efficiency of the interaction. This work stemmed from an earlier collaboration with Tim Paek and Eric Horvitz at Microsoft Research.

Furthermore, in order to gain a better understanding of confirmation strategies, and of the challenges we are facing in the belief updating process, I have performed an in-depth empirical analysis of user responses to explicit and implicit confirmations (C4). In this work, I have followed a methodology that has previously been used in a similar analysis by Krahmer and Swerts. My results corroborate their previous observations (in a different domain) and indicate that user responses to these confirmation strategies cover a wide language spectrum, especially when the information to be confirmed is incorrect.

Here is a list of relevant papers:

- Bohus, D., and Rudnicky, A. (2006) - A K Hypotheses + Other Belief Updating Model, in AAAI Workshop on Statistical and Empirical Approaches to Spoken Dialogue Systems, 2006, Boston, MA
- Bohus, D., and Rudnicky, A. (2005) - Constructing Accurate Beliefs in Spoken Dialog Systems, in ASRU-2005, San Juan, Puerto Rico

rejection threshold optimization [C5]

A common design pattern in spoken dialog systems is to reject an input when the recognition confidence score falls below a preset rejection threshold. However, this introduces a potentially non-optimal tradeoff between various types of errors such as misunderstandings and false rejections. I have proposed a data-driven method for determining the relative costs of these errors, and then use these costs to optimize state-specific rejection thresholds. Experimental results confirm the intuition that the costs are different at different points in the dialog and therefore different rejection thresholds should be used. The resulting, data-driven thresholds corroborates previous anecdotal evidence from observing the system.

Here is a paper on this subject:

- Bohus, D., and Rudnicky, A. (2005) - A Principled Approach for Rejection Threshold Optimization in Spoken Dialog Systems, in Interspeech-2005, Lisbon, Portugal

non-understanding recovery strategies [C6]

In an effort to gain a better understanding of non-understanding recovery strategies, I have performed an in-depth empirical investigation of ten such strategies (C6) (e.g. asking the user to repeat, asking the user to rephrase, repeating the system prompt, notifying the user that a non-understanding has occurred, providing various levels of help, ignoring the non-understanding and moving on with a different dialog plan, etc.) The focus was primarily on understanding what are the relationships between each strategy and subsequent user responses and which behaviors are more likely to lead to successful recovery. Additionally, I have also investigated how various non-understanding recovery strategies compare to each other, when engaged in an uninformed manner and when engaged using a smarter policy. The results of this investigation add to our understanding of the pros and cons of various strategies, and highlight the importance of good non-understanding recovery policies.

Here is a paper on this subject:

- Bohus, D., and Rudnicky, A. (2005) - Sorry, I Didn't Catch That! - An Investigation of Non-understanding Errors and Recovery Strategies, in SIGdial-2005, Lisbon, Portugal

on-line learning of non-understanding recovery policies [C7]

Developing well-performing non-understanding recovery policies is a challenging task, especially when the set of strategies available to the system is large. Spoken dialog systems typically use a limited number of such strategies and simple heuristic policies to engage them; for instance: first ask user to repeat, then give help, then transfer to an operator. I have proposed and evaluated a novel online-learning based approach for developing non-understanding recovery policies over a large set of strategies (C7). The proposed approach consists of two steps: first, we construct runtime estimates for the likelihood of success of each recovery strategy, together with confidence bounds for those estimates. Then, we use these estimates to construct a policy, while balancing the systemís exploration and exploitation goals. An initial experiment with a publicly available spoken dialog system shows that the learned policy produced a 12.5% relative improvement in the non-understanding recovery rate.

Here is a paper on this subject:

- Bohus, D., Langner, B., Raux, A., Black, A., Eskenazi, M. and Rudnicky A. (2006) - Online Supervised Learning of Non-understanding Recovery Policies, to appear in SLT-2006, Palm Beach, Aruba

error cost assessment [C8]

In this work, I propose a novel data-driven approach for assessing the costs of various types of errors (C8) committed by a spoken dialog system. The method uses a regression model to relate the number of errors of different types and at different points in the dialog to a chosen global dialog performance metric, and in the process allows us to infer from data the costs of these errors. I have applied this method on data from a deployed spoken dialog system and the determined costs did indeed corroborate our intuition and prior experience with this system. These costs can then further be used to adjust various error handling behaviors. For instance, I have shown how they can be used to determine state-specific rejection thresholds in a principled manner in contribution C5 [see more]. More generally, I believe these costs could also be used to build misunderstanding recovery policies.

Here are some related papers (the 2005 one is the most complete):

- Bohus, D., and Rudnicky, A. (2005) - A Principled Approach for Rejection Threshold Optimization in Spoken Dialog Systems, in Interspeech-2005, Lisbon, Portugal
- Bohus, D., and Rudnicky A. (2002) - Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System, Technical Report CS-190, Carnegie Mellon University, Pittsburgh, PA
- Bohus, D., and Rudnicky, A. (2001) - Modeling the Cost of Misunderstandings in the CMU Communicator Dialog System, in ASRU-2001, Madonna di Campiglio, Italy

error handling infrastructure [C9 & C10]

The experimental platform for evaluating the solutions proposed in this dissertation is formed by a number of real-world, deployed spoken language interfaces. All these systems use RavenClaw, a plan-based, task-independent dialog management framework (C9) that I have developed to enable research on various spoken language interface issues. Apart from supporting the error handling research program that constitutes the main focus of this dissertation, RavenClaw in itself represents an important contribution to the field of plan-based dialog management. The framework provides a clean separation between the domain-specific and domain-dependent aspects of the dialog control logic, and in the process significantly lessens the system development effort. To date, it has been used by a number of authors to build and successfully deploy about a dozen spoken dialog systems spanning different domains and interaction types. Furthermore, the framework provides a robust basis for several other current research projects addressing issues such as timing and turn-taking and multi-participant conversation.

Finally, to support the error handling work, I have developed a scalable, task- independent error handling architecture (C10) in the context of the plan-based RavenClaw dialog management framework. The proposed error handling architecture decouples the set of error handling strategies, as well as the mechanisms used for engaging them from the domain-specific aspects dialog control-logic. This in turn significantly lessens the system development effort. System authors describe the domain-specific aspects under the assumption that recognition always works perfectly; the responsibility for ensuring that the system uses valid information and that the conversation advances normally towards its goals is delegated to the error handling mechanisms in the dialog engine. The decoupling also facilitates the reuse of error recovery strategies and policies across domains and ensures a certain degree of uniformity and consistency both within and across domains. While such encapsulated approaches to error handling have been developed before in different contexts, to my knowledge this is the first systematic error handling architecture developed in the context of a complex, plan-based dialog management framework.

Here are some related papers describing RavenClaw and the error handling architecture:

- Bohus, D., and Rudnicky A. (2003) - RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda, in Eurospeech-2003, Geneva, Switzerland
- Bohus, D., and Rudnicky, A. (2005) - Error Handling in the RavenClaw dialog management architecture, in HLT-EMNLP-2005, Vancouver, CA