CMU Sphinx's Development Page

If you don't know what is CMU Sphinx, here is some brief information, CMU Sphinx ("Sphinx" in short) is a speaker-independent large vocabulary continuous speech recognizer with industrial strength. It is the crystal of many years of speech recognition's research of CMU Speech group .

From February 2000, Sphinx 2 transitioned to an open source project under a BSD-style license. Later an acoustic model trainer (SphinxTrain) and an even more accurate speech recognizer (Sphinx 3) was also released. These allows researchers around the world enjoy the use of Sphinxes and understand its underlying principles.

Motivated by project CALO , development of Sphinx restarted in December 2003 again.

To separate the effort of the past Sphinx 3 development before 2004. Several developers names the new effort Sphinx 3.X which represents the "extension" of the original Sphinx 3.

Sphinx 3.X is a speech recognizer that could run in real-time and has the capability of learning. It also has complete live-mode decoder API which could allows users to incorporate Sphinx 3.X into their applications.

This page is not maintained anymore!!!!

As at 20070801, this page is already obsolete. David Huggins-Daines, the new maintainer of Sphinx{2,3,Base,Train} is working on Sphinx 3.7 now. His Wiki page will give you a much better idea of what's going on of the current sphinx's development. Please visit: http://lima.lti.cs.cmu.edu/mediawiki/index.php/Main_Page

Summary of Development of Sphinx 3.X (X < 7)

The following is a summary of what we have done in each minor version of Sphinx 3.X

Current Development of Sphinx 3.X (X=6)

The major goal of Sphinx 3.6 is to enhance the Viterbi search in the decoders of Sphinx 3.X(X=5). A glimpse of some completed features.

1, Full triphones will be correctly supported in decode_anytopo

2, FST search will be enabled with Sphinx 2.6 format.

3, Further speed-up of GMM computation by enhancing Context Independent senone-based GMM Selection (CIGMMS).

4, Word-level confidence estimation.

5, Fast GMM computation routine would be used in legacy s3.0 applications.

6, Multiple regression classes for MLLR and MAP adaptation in SphinxTrain.

(New!) Sphinx 3.6 Release candidate I is released at Mar 30, 2006. See release note at here

Development of Sphinx 3.X (X=5)

The major feature in Sphinx 3.5 is speaker adaptation using maximum likelihood linear regression (MLLR) method. This will allow transformation of models into a parameter space which is closed to the individual speaker voice.

Sphinx 3.5 also marks our first effort to merge two legacy decoder code base Sphinx 3 slow and Sphinx 3 fast (both pre-bodies of Sphinx 3.x). The combined package provided a comprehensive set of tools that allows users to use both the fast decoder (decode) and the slow decoder (decode_anytopo) as well as functionalities such as force-alignment (align), best-path search (dag), n-best generation (astar) and phoneme recognition (allphone)

Sphinx 3.5 is officially released at Jan 13, 2005

Development of Sphinx 3.X (X=4)

At 3.3, the recognizer could not run under 1xRT (i.e. use time less than the length of the waveform). Our major achievement was speeding-up the recognizer with a framework called Four-level GMM computation categorization which could incorporate multiple types of fast-GMM computation techniquies in Sphinx 3.4. That makes Sphinx 3.4 to have under 1xRT performance for tasks with less than 20000 words

We also incoporated the first live-mode decoder APIs (livedecodeAPI.c) into Sphinx 3.X. This would allow developers to incorporate the recognizer into their applications.

Sphinx 3.4 is officially released at July 13, 2004

Why this page?

This page is not a replacement of the official web page of sphinx of Sphinx. That is a page maintained by both Evandro Gouvea and me. Rather, it is a page aimed at providing information for the developers of Sphinx. In future, we may merge several web pages of Sphinx together. If you have any questions about any speech tools, please contact me at archan at cs dot cmu dot edu.

Freedom of Human-Computer Speech Interaction in Future

The research and development of open-source speech recognition systems are important because speech is probably the most common form of human interaction. In future, we will find that speech will also be an important form of interaction between human and computers. Unfortunately, most of the state of art research algorithms and code of the recognizer are not open to the general public. Or the users need to pay high cost to attain license of the recognizer. This may make the users to have less freedom to talk to a computer. And more subtly, they will have less freedom to choose on how they can talk to a computer because usually the grammar and the mode of recognition is predefined.

In one way, Sphinx provides users and developers around the world to have a zero cost recognizer. At the same time, researchers can be benefited because they could learn from existing source code on how speech recognizer could be implemented. We believe that making Sphinx to be open source and free can benefit the world community.

We need your help.

Many developers enjoy to use Sphinxes. However, the current development version of Sphinx, Sphinx 3, as well as Sphinx 4 are still imperfect in many senses. For example, in terms its interfaces, it will be desirable if Sphinx 3 can also accept acoustic model trained by other speech recognition suites. In terms of language coverage, it will be desirable if users can enjoy the use of speech recognition in other languages. Current effort by CMU and Sun Microsystem can only be limited to refining the core speech recognition search engine. We need more developers' support to make Sphinx a better recognizer in the future. This will truly benefit people around the world.

Here are few resource we wish researchers/developers can contribute

-models trained for different languages.

-models trained for different corpora.

-help to do regression test in different platforms

-building better interface for the recognizer and trainer.

-make the trainer and the recognizer easier to use, we need some better scripts for training and testing. We also need a portable GUI. For details of this project, please read the Sphinx Open Projects page of this site.

Please contact Arthur Chan at archan at cs dot cmu dot edu if you are interested.