Speech Recognition: Past, Present, and Future

(A Carnegie Mellon University Perspective)

Kevin Lenzo, Paul Placeway, Kristie Seymore, Matthew A. Siegler

 

Introduction

Since the formation of its speech group, Carnegie Mellon University has been an important part of the global speech recognition community. Although some advances can be attributed to increased computing power, a greater understanding of the speech recognition problem has been achieved through theoretical developments in linguistics, representation, search, and machine learning. CMU has often led innovation in these areas.

In addition, research at CMU has focused on increasingly complex, yet realistic problems. From phones, to isolated words, to continuous or connected words, and most recently to spontaneous speech, it is the CMU perspective to consider the practical consequences of each refinement, balancing parsimony and power.

Phonetic Models

Raj Reddy founded the speech recognition research effort at CMU when he joined the university in 1969, building upon his work on phoneme recognition and phoneme-based word recognition. Since then, a recurrent theme for speech research at CMU had been the search for proper scale and detail for speech modeling.

Although most early systems used word-level templates and patterns as fundamental modeling units, early work at CMU pointed towards a relatively small inventory of composable units. The phonemes used in the Harpy [Lowerre76], Hearsay [Lesser75], and early Sphinx [Lee89] systems gave way to tied states (senones) in the more recent Sphinx systems [Hwang93]. Over time these fundamental modeling units have shortened, while more training data has been shared among similar contexts and generalized to unseen contexts to reduce the effects of sparse data.

Statistical Models and HMMs

Jim Baker first proposed at CMU the use of network representations for speech recognition in the form of hidden Markov models [Baker75]. Previously, many systems attempted to implement context free grammars with little regard for probabilities. Although other work in the field was moving in this direction his work was very significant. Ever since, HMMs have come to dominate modern approaches used by the speech recognition community at large.

However, as the number of detailed acoustic models increased, the computational effort to search the space of possible solutions became substantial. In his 1976 thesis, Bruce Lowerre demonstrated the potential of the beam search for reducing the search computation [Lowerre76]. In addition, the beam search permitted scalable, linear-time algorithms for the first time, and opened the door for practical resource-bound applications of speech recognition.

Typical speech recognition systems use independently trained sets of HMMs for acoustics and language, and these models need to be integrated in a meaningful way. In general, each model is applied separately to incoming speech features, and then recognition hypotheses are generated using the combined scores. In 1989, Xuedong Huang demonstrated that joint rather than independent optimization of these models would improve robustness given a fixed amount of training data [Huang93].

Although the mixture-of-Gaussians statistical model [Rabiner85] used by most HMM-based systems has been noteworthy, it may not be the best representation for speech[Bahl87]. In comparison, neural networks, hybrid systems, and scene analysis each make different assumptions about the nature of the statistical distributions, and leads to a different decision surface. Alex Waibel's original thesis work on TDNNs [Waibel86, Waibel89] has led to a large and successful group at CMU that continues research in speech recognition.

Lexical and Language Modeling

One of the goals at CMU and U. Karlsruhe is to find weaknesses in models due to faulty assumptions and develop optimization strategies that are automatically constructed with data-driven methods. For example, the pronunciation lexicon requires a linguist to maintain consistency in transcribing over 64,000 different pronunciations. Currently, data-driven systems that can automatically create new pronunciations from training examples are being devised [Sloboda96].

Traditional n-gram statistical language model has been difficult to beat in terms of modeling power and computational efficiency. In 1994, Ronald Rosenfeld extended language modeling work at CMU by demonstrating that exponential models built using a maximum entropy criterion can outperform traditional n-gram models and allow for an arbitrary set of model constraints [Rosenfeld96]. Since maximum entropy models remain computationally intensive to build there is active research investigating fast training techniques. In addition, language models that can automatically adapt to detected changes in vocabulary, topic, and speaking style are being developed.

Speaker Independence and Acoustic Robustness

It is often more practical to obtain a small amount of example speech from a very large number of speakers than to obtain several hours of speech from each potential user. In some cases, it may not be possible to get training speech from the user in advance. Realizing this, Kai-Fu Lee led the way in 1989 with the first practical, accurate speaker-independent system, Sphinx, which became a clear milestone in the speech recognition community [Lee89].

Although work on baseline acoustic models has improved performance overall, more serious variations due to speaker, microphone, and environmental noise require robustness under unpredictable acoustic conditions. Richard Stern and his students have explored a variety of successful techniques for addressing these issues, using both explicit and implicit models of noise and distortion [Acero93, Stern96, Stern97]. Their work continues, emphasizing dynamic and automatic model adaptation to speaker, channel, noise, accent, pronunciation, speaking style, and dialogue, with the goal of producing agile systems that have the ability to adjust the number of parameters on the fly.

Spoken Language Systems

Speech recognition systems at CMU have primarily been intended to extract semantic content from a speaker. More generally, spoken language systems that perform some useful task and incorporate speech as an input modality have always been at the heart of the speech group. The earliest example is the CMU Chessboard project, which demonstrated that an imperfect recognizer could be used effectively for a particular task if the search space is sufficiently restricted [CITE Chessboard].

In order to help extract meaning from computer recognition of spoken language, Wayne Ward developed the Phoenix parser and grammars [Ward90, Issar94]. Despite the apparent simplicity of semantic frame parsing, he showed that the technique can be used for non-trivial spoken language problems such as dialogues between user and machine.

Presently, research at CMU is focused on developing recognition capabilities in the context of several applications:

Error Analysis

Speech applications have demonstrated that optimizing recognizer model probabilities and blindly reducing word errors do not necessarily improve utility. A valuable asset to recognition systems would be the ability to analyze errors so that performance can be tuned toward the important words and phrases specific to each application. To accomplish this, a better understanding of what constitutes good speech recognition performance is needed, noting that performance will vary within an application as often as among different applications.

Since today's systems are able to process so much data that manual error analysis is unfeasible, tools that automatically measure performance are needed. Lin Chase's work in this area is certainly a milestone in understanding what types of recognition errors are occurring and why [Chase97]. Currently, new tools are being developed that classify errors into categories so that the sources of error can be remedied.

Looking Forward

The search for the appropriate units of representation for speech, language and dialogue is still active and changing the way we look at problems. Units that challenge the phonemic basis of speech recognition, such as syllabic, polyphonetic, and automatically generated acoustic subword units are changing the model landscape.

On the horizon are new possibilities for expanding the speech recognition feature set at the acoustic level. Models of human auditory perception and speech are beginning to resurface and become integrated with other methods, and these may provide some advantage.

The development of new applications using speech technology has sparked interest in determining when and how to incorporate speech for user input and is leading to the design of efficient dialogue generation. Along these lines, class grammars may provide better modeling of sentence, topic, and discourse structure. In addition, an accurate prosodic model would be useful for choosing between parses in natural language as well as for producing natural speech for both language learning and instruction.

By and large, deploying systems with speech interfaces, even in controlled environments, will lead to greater innovations in continuous speech recognition than persistent study of aging databases.

References