Speech user interfaces (SUIs) such as Apple’s Siri, Samsung’s S Voice, and Google Now are becoming increasingly popular. However, despite years of research, such interfaces really only work for specific users, such as adult native speakers of English, when in fact, many other users such as non-native speakers or children stand to benefit at least as much, if not more. The problem in developing SUIs for such users or for other acoustic or language situations is the expertise, time, and cost in building an initial system that works reasonably well, and can be deployed to collect more data or also to establish a group of loyal users. In particular, application developers or researchers who are not speech experts find it excruciatingly difficult to build a testable speech interface on their own, and instead routinely resort to Wizard-of-Oz experiments.
To address the above problem, we take the view that while it can take prohibitive amount of time and cost to train non-experts into the nuances of speech recognition and user-interface development, well-trained speech experts and user-interface specialists who routinely build working recognizers have accumulated years of experiential knowledge that we can study and formalize for the benefit of non-experts. As such, the core speech recognition technology has reached a point where given enough expertise and in-domain data, a working system can be developed for almost every user group, acoustic or language situation. To this end, we design, develop, and evaluate a speech toolkit called SToNE, which embeds expert knowledge and lowers the entry bar for non-experts into the design and development space of speech systems. Our goal is not to render the speech expert superfluous, but to make it easier for non-speech experts to figure out why a speech system is failing, and guide their efforts in the right direction.
We investigate three research goals: (i) how can we elicit and formalize the tacit knowledge that speech experts employ in building an accurate recognizer, (ii) what are the different analysis supports – automatic or semi-automatic – that we can develop to enable speech recognizer development by non-experts, and (iii) to what extent do non-experts benefit from SToNE. Through experiments both in the lab with new datasets, and summative evaluations with non-experts, we show that with the support of SToNE, non-experts are able to build recognizers with accuracy similar to that of experts, as well as achieve significant gains from when SToNE support is unavailable to them.
This work aims to support the “black art” in SUI development. It contributes to human-computer interaction by developing tools that support non-speech experts in building usable SUIs. It also contributes to speech technologies by formalizing expert knowledge and offering a set of tools to analyze speech data systematically.
Florian Metze (Co-Chair)
Matthew Kam (Co-Chair, CMU & American Institutes for Research)
Tim Paek (Microsoft Research)