A comparison of different open source speech recognizers

Hidden Markov Model Toolkit (HTK)

The first speech recognizer and trainer most speech researcher works on if they started after 99. (I guess it can be much earlier.) I used it in most of my speech hacker's life. It is described by many as "very modular" and implements "good engineering practice". My understanding of these statements is that each application you can found in the toolbox have one single well-defined use. For example HLEd is very versatile editor for editing the HTK transcription .MLF format. Using HLEd is usually much better than using Unix tools. The interface itself is also very handy. Many people who are familiar with HTK can do training just by typing in the command prompts.

In terms of algorithmic aspects, HTK use token passing algorithm in decoding and Baum-Welch training in training. The most amazing thing is their derivations of these algorithms and implementation is very elegant. Even small minuaties such as incorporation of null nodes are correct. This makes many newbies in speech recognition enjoy the use of these advance facilties of training. Many advisors ask their students to match their own speech recognizer with HTK. This is a very good way to start programming of speech recognition. (However, HTK can have bugs. So when you put this amazing ability of your in your resume, remember to put it like "your recognizer has a HTK compatible mode. :-) )

Yet, there are many people don't really like HTK. Most of the time, they are already PhDs or some very experienced researchers of the field. The reason they don't like HTK is because HTK is pretty hard to change. You may say that it may be caused by that fact that it strictly follows many software design principle. For example, HTK's recognizer (HVite) and trainer (HERest) share the same model data structure. Now, it is good to have code to be re-used, it is bad because it will be hard to make changes on it. Every time you make a change, you also need to consider other routines when you make the change.

People also feel lost when they finally realize that HVite actually do full fan-in and fan-out for cross-word triphones. (Usually, they knew it when they run phone recognition in TIMIT :-) ). This is obviously the most "correct" way to implement a recognizer. However, it may not be a "clever" way. Assumptions that replaces full fan-in/out exists for long time. Some of them are found to be quite closed to full fan-in/out in terms of performance.

There are also instances where the code is not consistent, impression I got when I change HHEd is that HHEd seems to be written by another programmer. I was also pretty lost when I saw data structure lying every where. May be at the time I was not using emacs. :-)

As an application developer, many people obviously don't like HTK because of its license. Well, you can use HTK trained models in your application but you have to write your own recognizer. It usually caused a lot of trouble to many groups because writing a recognizer in these days still require about 3 months to half a year to make it well polished.

ISIP Foundation Class and Speech Recognition System (IFC)

I never able to really try this software so I only comments from other people. I feel highly dubious some of these comments and I will try to explain.

Obviously, the whole gesture of this tool kit is to create a software toolkit which is even better than the existing ones. This hope is not well-perceived by many. (If you look at their mailing list, you will see what I meant.) However, if you look at their algorithms and implementations. Actually some of them is pretty thoughtful and insightful. So, when I see people shrudd when they heard ISIP, I feel a little bit uncomfortable.

For example, I heard many people said ISIP run much slower than HTK. However, ISIP people have spent research effort in building faster speech recognizer that includes many nice algorithms in Fast GMM computation and search. This feature does not appear in things like HTK. This makes me doubt whether people has really benchmark the two softwares before they make the statement. Also, it seems to me it is quite hard to compare two recognizers in which one only accepts CFG and one can accept higher LMs. They are apple and orange!

Another feeling among people is that ISIP Foundation Class is not totally necessary in a SR research perspective. I buy this argument more mainly because data structures can be obtained quite easily and many simple stuffs (such as linked-list and hash) can be created quite handily. However, I am not totally convinced because I didn't see any performance figure of testing of their data structure against some standard one. When I play with it more, may be I will get more ideas.