10 October 2003
Computer based speech recognition systems came into existence in the 1950s, shortly after the development of the high-speed analog-to-digital converter. The earliest systems were based on explicit matching of the recorded signal against pre-stored templates (e.g. Forgie and Forgie, 1959). Later advancements included the inclusion of phonotactic rules (e.g. Church, 1983), and dynamic time-warping algorithms for explicit pattern matching (e.g. Sakoe and Chiba, 1978). The 1980s saw great advancement in rule based methods for speech recognition, that combined expert knowledge about the signal and spectral characteristics of the speech signal with AI techniques, in order to perform recognition (e.g. Zue, 1985). In spite of all these advances, the goal of speaker-independent recognition of continuous speech remained elusive to these early researchers.
In the early 1970s, James Baker, then a graduate student at Carnegie Mellon University (CMU), working under the supervision of Professor Raj Reddy, proposed an alternative method for automatic speech recognition - a purely statistical approach based on a then-obscure mathematical model called the Hidden Markov Model (HMM). The HMM models sound units as having a sequence of states, where each state has its own unique distribution. The parameters of the state distributions must be learned from training data, and recognition using these models involves a fairly computationally intensive decoding algorithm, where the computer evaluates HMMs for several hypotheses in order to determine the most probable one. The HMM-based recognizer represented a paradigmatic shift from the signal and rule-based approaches followed by almost all researches until that time, in that it relied almost entirely on statistical knowledge of the speech signal gained through automated analysis of large quantities of speech data, and not on codified human expertise.
It was initially considered that the HMM approach was computationally infeasible, since it required large quantities of data to learn the parameters of the HMMs, and large amounts of computation to perform the actual recognition. Soon, however, researchers realized that it also presented a conceptually powerful modeling paradigm for speech sounds, with greater flexibility than any of the then-current techniques. By the 1980s, several research teams around the world were working intensively on this and related statistical paradigms. Notable among these were the teams at AT&T Bell Laboratories and IBM, both based in USA. By the mid 1980s, these teams had succeeded in developing statistical speech recognition systems that worked well for narrow acoustic domains, such as for individual speakers (speaker-dependent systems) or in speaker-independent recognition of words that were spoken distinctly apart and recorded as separate speech signals (isolated-word systems).
True speaker-independent recognition of continuously spoken speech still presented a problem. By the late 1980s, researchers were beginning to discover ways to deal with them, although it was still generally felt that the computational resources of the time, which consisted of slow processors and tiny memory by today's standards, could simply not support an HMM-based speaker-independent, continuous speech recognition system. This belief was shattered in 1988 by the unveiling of the HMM-based Sphinx speech recognition at CMU, that incorporated several new innovations in the modeling of spoken sounds and the engineering of the actual algorithms used for recognition. This was a continuous-speech speaker-independent system that not only recognized speech with high accuracy, but did so at the natural speed at which words are spoken by an average person (real-time recognition). The system was developed by Kai-Fu Lee, then a doctoral student under the supervision of Professor Raj Reddy at CMU, and Roberto Bisiani, a research scientist at CMU. The Sphinx demonstrated to the world that automatic speaker-independent recognition of continuous speech was not only possible, but also achievable with the computational resources of the day.
Since 1988 until the present day, as computers and algorithms have both grown in sophistication, Sphinx has morphed into a suite of recognition systems, each marking a milestone in HMM-based speech recognition technology. All of these retain the name of Sphinx, and are labeled with different version numbers, in keeping with contemporary style of referencing software. In the paragraphs that follow, we will describe some key technical aspects of these systems.
There are currently four versions of Sphinx in existence:
Sphinx-1: Sphinx-1 was written in the C programming language and was, as described in the paragraphs above, the world's first high performance speaker-independent continuous speech recognition system. It was based on the then viable technology of discrete HMMs, i.e. HMMs that used discrete distributions or simple histograms to model the distributions of the measurements of speech sounds. Since speech itself is a signal that can take a set of values that are continuous in some range, the modeling paradigm required a quantization of speech into a discrete set of symbols. Sphinx-1 accomplished this using a vector quantization algorithm. A primary sequence of LPC-cepstral vectors was derived from the speech signals, and from this sequence, two secondary sequences of difference parameter vectors were derived. The vector quantization algorithm computed codebooks from the vectors in the sequences, and replaced each vector by codeword indices from the codebooks. HMMs were then trained with these sequences of quantized vectors. During recognition, incoming speech was also converted into sequences of codeword indices using the same codebooks.
The sound units that the system modeled with discrete HMMs were called generalized triphones. A triphone is simply a distinct phonetic unit labeled with its immediately adjacent phonetic contexts. Triphones were, and remain, one of the most effective innovations in modeling speech sounds in HMM-based systems. In a system based on generalized triphones, a number of triphones are modeled by a common HMM. In Sphinx-1, this was done for logistical reasons, since computers in those days were not powerful enough to handle all triphones separately. Also, the large stored databases required to train the vast repository of triphones found in normal everyday speech did not exist.
In a recognition system, the actual process of recognition is guided by a grammar or language model, that encapsulates prior knowledge about the structure of the language. Sphinx-1 used a simple word-pair grammar, that indicates which word pairs are permitted in the language and which are not.
Sphinx-1 achieved word recognition accuracies of about 90% on 1000-word vocabulary tasks. Such performance represented a major breakthrough in those times. The system performed in real time on the best machines of the time, such as the SUN-3 and DEC-3000.
Sphinx-2: Sphinx-1 triggered a phase of phenomenal development in HMM-based continuous speech recognition technology. Within five years of its introduction, technology based on semi-continuous HMMs was ready to be used, and much of it had been developed at CMU. Sphinx-2 came into existence in 1992, and was again a pioneer system based on the new technology of semi-continuous HMMs. The system was developed by Xuedong Huang at CMU, then a post-doctoral researcher working with Professor Raj Reddy. Like Sphinx-1, it was written in the C programming language.
The essence of semi-continuous HMM based technology was that speech was no longer required to be modeled as a sequence of quantized vectors. State distributions of a semi-continuous HMM were modeled by mixtures of Gaussian densities. Rather than the speech vectors themselves, it was the parameters of the Gaussian densities that were quantized. Sphinx-2 used 4 parallel feature streams, three of which were secondary streams derived from a primary stream of 13-dimensional cepstral vectors computed from the speech signal. All components of the feature streams were permitted to take any real value. For each feature stream, Gaussian density parameters were allowed to take one of 256 values (the number 256 being dictated by the largest number representable by an 8-bit number). The actual values of the 256 sets of parameters were themselves learned during training.
In order to achieve real-time speeds, the system was hardwired to use 5-state Bakis topology HMMs for all sound units. Each sound unit was a triphone, but unlike Sphinx-1, this system did not use generalized triphones. Instead, state distributions of the HMMs for the triphones were tied, i.e. states of the HMMs for several triphones were constrained to have the same distribution. The state tying was performed using decision trees. This technique of sharing distribution parameters at the state level, invented by Mei-Yuh Hwang, then a doctoral student at CMU under the supervision of Professor Raj Reddy, was yet another major milestone in HMM-based speech recognition technology, as it made it possible to train large numbers of parameters for a recognition system with relatively modest amounts of data.
Sphinx-2 also improved upon its predecessor sphinx-1 in being able to use statistical N-gram language models during search, where N could be any number, in principle. An N-gram language model represents the probability of any word in the language, given the N-1 prior words in a sentence, and is significantly superior to the word-pair grammar used by Sphinx-1 as a representation of the structure of a language.
The semi-continuous HMM based Sphinx-2 required much greater computation to perform recognition than Sphinx-1 did, and consequently the early Sphinx-2 decoders (the part of the recognizer that actually performs the recognition) were relatively slow, and took longer than real time on the hardware of the day. These were soon replaced by the FBS-8 decoder, written by Mosur Ravishankar of CMU (Ravishankar, 1996), which used a lexical-tree-based search strategy that represents the recognizer vocabulary as a tree of phonemes. This lextree decoder could perform recognition in real time on standard computers. The name FBS-8 recalls an interesting history of this decoder. FBS-8 stands for Fast Beam Search version 8. This decoder was actually the 8th in a string of quite different decoders. In fact its most successful predecessor, FBS-6, used a "flat" search strategy that represents each word in the recognizer's lexicon separately, and thus was coded very differently from its immediate successor FBS-7, which was lextree based. Sphinx-2 was able to achieve an accuracy of about 90% on a 30,000 word vocabulary (e.g. Wall Street Journal) recognition task.
Sphinx-3: The next milestone in recognition technology was marked by the creation of Sphinx-3. It was developed four years after Sphinx-2 was unveiled, and incorporated technology that allowed the modeling of speech with continuous density HMMs in a fully continuous vector space. No vector space quantizations were required at any level. Naturally, this resulted in better recognition performance as compared to its predecessor, Sphinx-2. However, at the same time due to statistical requirements imposed by this technology, large amounts of data were required to train such HMMs well. Fortunately, hardware improvements in computers allowed the storage and processing of large amounts of data by this time. Sphinx-3 was thus a very viable system at the time it was developed. It had many new features, but was at the same time designed to be backwardly compatible with Sphinx-2. This meant that it could handle both semi-continuous and fully-continuous HMMs. The system was written in the C programming language, and developed jointly by two people at CMU: its HMM-training modules were developed by Eric Thayer, and its decoding, modules were developed by Mosur Ravishankar. Sphinx-3 was more versatile in its feature-type handling capacity than its predecessors. The primary feature stream was no longer constrained to be 13 dimensional. It could use single and 4-stream feature sets. The HMM topology was user-specifiable. Like its predecessors, it used triphone HMMs, with state sharing information obtained using decision trees.
Sphinx-3 currently has two different decoders, both written by Mosur Ravishankar. Both decoders use statistical N-gram language models with N<=3 during search. They differ in several respects, including their search strategies. One decoder, generally referred to as S3.0, uses the flat search strategy. The second decoder has two variants, usually referred to a S3.2 and S3.3. These differ in their ability to handle streaming input and return partial hypotheses during decoding. S3.2 does not have these capabilities while S3.3 does. Both variants use a lextree based search strategy. The use of lextrees makes these versions much faster than S3.0, and their speed is further enhanced through a subvector quantization strategy for Gaussian selection. Both decoders, and the trainer, have hardwired requirements to some degree, though to a much lesser extent than the Sphinx-2 system. Such requirements are often an essential part of the engineering that goes into implementing speech recognition systems, and are necessitated by the limitations of the computational platforms of the day.
Sphinx-4: The years after the development of Sphinx-3 saw two major advancements in speech research. First, multimodal speech recognition came into existence, with the development of associated algorithms. Second, it became feasible for speech recognition systems to be deployed pervasively in wide-ranging, even mobile, environments.
In multimodal speech recognition, the information in the speech signal is augmented by evidence from other concurrent phenomena, such as gestures, expressions, etc. Although these information streams are related, they differ both in the manner in which they must be modeled and the range of contexts that they capture. The contexts that affect the evidence for any sound can often be highly asymmetric and time dependent. It therefore becomes necessary for the structure and the distributions of the HMMs, and the context dependencies of the sounds they represent, to be variable and user controllable. Also, wide ranging practical applications require the system to function in widely varying language scenarios, and so it also becomes necessary to be able to use language models other than statistical N-grams, such as context free grammars (CFGs), finite state automata (FSAs), or stochastic FSAs, which might be more optimal for the task domain at hand. Statistical language models are just one of a set of different types of well-researched models of human language.
With these considerations, by the beginning of the year 2000, it was clear that Sphinx-3 would need to be upgraded. Sphinx-3 could only use triphone context sound units and required a uniform HMM topology for all sound units. It also required a uniform type of HMM, in the sense that the basic type of statistical density used by each HMM would have to be the same, with the same number of modes, for all sound units, regardless of the amount of training data or the type of feature being used for recognition. Sphinx-3, moreover was only partially ready for multimodal speech recognition - although it could use multiple feature streams, it could combine them only at the state level.
Additionally, in the time since the development of Sphinx-3, the world had seen an exponential growth and innovation in computational resources, both hardware and software, available to the general user. Internet, markup languages, programming languages which augment and integrate smoothly with markup languages and the Internet etc. have developed rapidly. Even better recognition algorithms are now available, and internet based technology permits new ways in which these can be integrated into any real medium. High flexibility and high modularity are the norms by which today's pervasive software is being developed.
Thus, in response to the demands and offers of new technology available to speech recognition systems, and to the demands of the rapidly maturing area of multimodal recognition, Sphinx-4 was initiated by a team of researchers in 2001, joining forces from Carnegie Mellon University, Mitsubishi Electric Research Labs and SUN Microsystems. In 2002, researchers from Hewlett Packard Inc. joined the effort. At the time of writing this article, Sphinx-4 is nearing completion. It is written entirely in the JAVA programming language, which is an extremely powerful language for Internet based applications. It provides vastly superior interfacing capabilities. The system is being developed on an open platform, and is available to researchers, developers, and commercial organizations freely at any stage of development. It is in principle an effort of the world community.
Baker, J. K. (1975). Stochastic modeling for automatic speech understanding, in D. R. Reddy, ed., Speech Recognition: Academic Press, New York, pp. 521-542.
Church, K. (1983). Allophonic and Phonotactic Constraints are Useful, International Joint Conference on Artificial Intelligence, Karlsruhe, West Germany.
Forgie, J. & Forgie, C. (1959). Results obtained from a vowel recognition computer program, Journal of Acoustic Society of America, 31(11), 1480-1489.
Hwang, M.-Y. (1993). Subphonetic Acoustic Modelling for Speaker-Independent Continuous Speech Recognition, PhD Thesis, CMU-CS-93-230, Carnegie Mellon University.
Huang, X., Alleva, F., Hon, H.-W., Hwang, M.-Y., Lee, K.-F. & Rosenfeld, R. (1992). The SPHINX-II speech recognition system: an overview, Computer Speech and Language, 7(2), 137-148.
Lamere, P., Kwok, P. Walker, W., Gouvea, E., Singh, R. Raj, B. & Wolf, P. (2003) Design of the CMU Sphinx-4 decoder, Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003).
Lee, K.-F., Hon, H.-W. & Reddy, R. (1990). An overview of the SPHINX speech recognition system, IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-38(1), 35-44.
Placeway, P., Chen, S., Eskenazi, M., Jain, U., Parikh, V., Raj, B., Ravishankar, M. Rosenfeld, R., Seymore, K., Siegler, M, Stern, R. & Thayer, E. (1997) The 1996 Hub-4 Sphinx-3 System, Proceedings of the 1997 ARPA Speech Recognition Workshop, 85-89.
Ravishankar, M. K. (1996). Efficient Algorithms for Speech Recognition. Phd Thesis, CMU-CS-96-143, Carnegie Mellon University.
Sakoe, H. & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-26(1), 43-49.
Zue, V. (1985). The use of speech knowledge in automatic speech recognition, Proceedings of the IEEE, 73(11), 1602-1615.