Anuj Kumar
Speech Toolkit for Non-Speech Experts

Speech-user interfaces are fast becoming a popular alternative to many graphical user interfaces. However, the expertise required for the development of a recognizer with reasonable accuracy precludes many individuals (particularly, non-speech-experts) from participating in the development process. The reason is that while building a baseline recognizer is easy, subsequent optimizations to the recognizer, such as adaptation to the application’s acoustic and language context are significantly more challenging. To address this problem, in this project, we develop a toolkit that guides non-experts to build accurate speech recognizers for any context of use. The toolkit has several modules: (a) a rule-based formalization of expert knowledge that embeds the tacit knowledge that experts' use to analyze an acoustic context, (b) a visualization module that assists non-experts in understanding the impact of several degrading factors on recognition accuracy, and (c) a recommendation module that automatically analyzes the context, and suggests appropriate optimization techniques to the non-expert. This research aims to take the “black art” out of development of speech-user interfaces.   InterSpeech2013

C, C++, PocketSphinx Speech Recognizer, Machine Learning, Contextual Interviews, Knowledge Representation

Voice Typing: A New Speech Interaction Technique for Dictation

Dictation using speech recognition could potentially serve as an efficient input method for touchscreen devices. However, dictation systems today follow a mentally disruptive speech interaction model: users must first formulate utterances and then produce them, as they would with a voice recorder. Because utterances do not get transcribed until users have finished speaking, the entire output appears at once, after which the users must break their train of thought to verify and correct it. In this project, we designed, developed, and deployed Voice Typing, a new speech interaction model where users' utterances are transcribed "as they produce" them to enable real-time error identification. For fast correction, users leverage a marking menu using touch gestures. Voice Typing aspires to create an experience akin to having a secretary type for you, while you monitor and correct the text.   CHI 2012

C#, C, Microsoft Speech API

SMART: Speech-enabled Mobile Assisted Reading Technology

Project SMART is an inter-disciplinary research project at Carnegie Mellon University that aims to improve reading skills for early-age second language learners, both in the United States and the developing world, such as India. Using games on mobile phones, SMART applications give ample practice opportunities to children to read aloud, thereby scaffolding there ability to understand written text. SMART applications also use speech-recognition on mobile devices to understand children's speech and give appropriate feedback, whenever necessary. Learning how to read and understand text in a new language is time and practice intensive; thus SMART applications enable children to practice both inside and outside regular school hours, thereby gaining additional exposure at convenient times.   CHI 2012    AIED 2011    IUI 2011

ActionScript3.0, Python, PocketSphinx Speech Recognizer

Legends: A Mobile Assistant for Everyday Social Introductions

Legends is a mobile application that facilitates serendipitous social introductions between two strangers. It mines users’ social network data, and recommends commonalities as potential discussion topics when two people meet. These recommendations are meant to serve as ice-breakers, and provide ground for interesting conversations. Because of its direct integration with mobile devices, this system has high utility in everyday settings: in a coffee shop, in an office, or in a playground i.e. places where you are most likely to meet new people. In this project, we assess the capabilities and usability of our application through a business card sharing application, which we designed, developed, and deployed with fourteen users.

iOS development, JSON, Data Mining

Emotion Recognition in Interative Voice Response Systems

Users of Interactive Voice Response (IVR) systems are often frustrated when their speech is constantly misrecognized. Current IVR implementations can benefit from incorporating user emotions to improve the dialog flow (and ultimately, usability) of the system. In this research, we classify emotions using only acoustic features, with particular focus on short utterances (1-2 seconds long), as typical of IVR systems. We conduct two experiments: first, comparing five classification algorithms – ZeroR, OneR, Naïve Bayes, Decision Trees, and Support Vector Machines for classifying two user emotions (angry and neutral). SVM performed best at 68% accuracy. Furthermore, there was also an additional 9% improvement when the classifier was personalized with just 20 audio files per-speaker. In the second experiment, we included additional acoustic features to improve SVM accuracy upto 91%, and added four more emotions: happy, sad, surprised and fear to support better decision making in IVR systems.

C#, Microsoft Speech API, Matlab, Weka

MILLEE: Mobile and Immersive Learning for Literacy in Emerging Economies

MILLEE aims to design and develop applications on low-cost, affordable mobile phones that enable children in the developing world to acquire literacy skills in immersive, game-like environments. We aim to target localized language learning needs and to make the digital literacy resources more available to children at times and places that are more convenient than schools. Our design methodology draws strength from contextual interviews with school teachers, children, and parents; as well as inspirations from traditional village games that children play and enjoy most.   CHI 2010    CHI 2009    ICTD 2009    DIS 2008

ActionScript3.0, J2ME