I saw a couple of mails about using SphinxTrain. It isn't obvious for us (the developers of Sphinx) to answer some of those mails because they are just so scattered in multiple forum. Jerry was very kind to answer some of these questions. I really appreciate that. For me, I just don't even know where to start.
So I decided to say something else. Something like how to use SphinxTrain in general.
SphinxTrain is a run-through script. When it was treated as a whole, it is a system that people can run through it and generate acoustic models. There are six to nine steps one need to go through in training.
What I want to do is to list here some common pitfalls (10 of them) in using SphinxTrain. They are not very technical. It is basically an expansion of Rita's Sphinx manual's Section 1, "Before you train" So I won't touch issues such as usage of individual commands and such. I just want to give a big picture of what is training using SphinxTrain in general.
Acoustic model training is a very involved process. As far as I could say, this is true for HTK and SphinxTrain. Though both HTK and SphinxTrain has fairly detail documentation. It is still pretty difficult to go through one complete it. Though, I want to say I love this experience. :-)
There are several things you need to be prepared before you go through the training process. I will list it here.
In the next several pitfalls, I will talk about more in detail each of the above points.
Usage of the trainer is very different from the decoder. Training is a very different process from decoding. Most people get frustrated in training. Mainly because they could use the decoder but found that the training system is harder to use.
About models, there are two main types of models used by Sphinxen. In the past, Sphinx 2 only supported only SCHMM, Sphinx 4 and Sphinx 3.x decode only supported only CDHMM. Sphinx 3.0 family of tools (e.g. decode_anytopo, align, allphone) supported both SCHMM and CDHMM. Rita's sphinxtrain manual has very detail summary of what's going on how to train both SCHMM and CDHMM. These formatting tends to change, e.g. In 2004, Sphinx 2 starts to support CDHMM as well.
This is a point I always want to stress. Did you check out Rita's manual on how to use SphinxTrain? Most of the content of the manual is still correct at this point. The only thing which has been changed is probably the fact that we using perl script to automate the job.
Scripts of SphinxTrain, when it was designed, was mainly used by researcher. So you will find most problems you faced can be quickly solved by inspecting the script and see what's going on.
Why don't we write a very thorough package and so that the user complexity is lower? As a matter of fact, we did, but surprisingly it caused a lot of problems internal to CMU. Training is a process where a lot of parts need to change and researchers need to fiddle. Writing an extremely script with rigid structure makes a lot of the worker stops. The current script is probably a good balance between the two extremes. Of course, we are constantly improving the quality of the script.
Do you have enough training data? For every gaussian in a senone, you robably need 100 samples of frames.
Small system such as TIDIGTS, each HMM were trained by at least 100 waveforms. If you want mixture models to be effective, you need even more.
Usually a medium system (such as RM) need 10 hours of speech to train. Most sucessful large vocabulary system used more than > 50 hours of speech to train. Some even more if a larger user coverage was to be achieved.
Data need to be collected in a balanced way, that means you could not collect a lot of samples for 1 type of sentence but only a little for another. This simply hurts the system.
Baum-Welch algorithm is fairly time-consuming. SphinxTrain's algorithm is optimized in a way that the time required is significantly reduced. However, training a model with 80 hours of data still required approximately 200 machine hours. (For a P4 1.1G)
In most of user's cases, their training set is significantly smaller. However, the effect of training time still overwhelm many people (literally). I heard a lot of people giving up training just because they don't like to wait for a long time. This is unfortunately not the expectation one should have with acoustic model training.
Ask yourself again. Do you have enough computation time? Usually, using 1 machine, small vocabulary system such as TIDIGITS could take you few hours. RM take probably half a day. A lot of things I am working on take 2 to 3 weeks if there is not parrallelization of the script.
Do you have enough expertise? I will judge that acoustic model training requires at least a bright undergraduate student to do.
There are a lot of things you need to learn in training and in general speech recognition. HMM, signal processing, language modeling. None of them are trivial. Rabiner's book on speech recogntion is a must for starter. XD Huang's "Spoken Language Processing" is necessary for you to become an expert.
What if I could not afford this expensive books? I would recommend you some more accesible source. For example:
"A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". This is a short version of Prof. Rabiner's book. You can get a very good idea of what's going on with speech recognition. It used discrete HMM but the general idea is the same.
This might sound very scary but hey! Don't worry :-), you can always learn when you work on SphinxTrain. It is actually a satisfying process. I feel pretty happy myself when I work out an acoustic model. (Feeling similar to cooking.) I also learn a lot in this process.
Though remember this: no pain, no gain. I saw a lot of Sphinx' users just want to do something to complete their homework or term project. They even don't have the motivation to learn what's going on in speech recognition. I usually consider these cases are helpless.
Do you have enough patience? For patience I mean patience in learning, patience in waiting and patience in debugging.
Using the trainer is harder than using the decoder. This is a universal characteristic for all systems including HTK, Sphinx (I used them both, I also wrote some training algorithm myselfs). What I found is if you have patience to learn, most issues will be solved by yourself.
I say this because I saw a lot of you sending to Sphinx's forums just because you were stopped by a small problem. Or a lot of times many people promised their boss that they could train a system in 1 day. You better be prepared because no serious speech people will think in this way.
This is generally not a SphinxTrain specific question. This is actually very general problem that happen in every Sphinx's family of software.
In general, it makes a lot of sense to ask questions in this forum. I also love to answer users' question because I can learn a lot from them. Though I found that certain netiquette will make all our experience to be more pleasant.
The document is actually there and it is pretty well-organized in cmusphinx.org. Things which are harder to find usually mean we have no intention to open it or we haven't tested it thoroughly. We chose to do it because we genuinely hope that our code can go to the user's hands as soon as possible.
Though I have to admit here for one thing. Not everything in Sphinx can be described as perfect. Despite our continous effort to improve it, there are still a lot of aspect I personally feel it could be even better. Documentation is one of those. Multiple authors have been contributing and inconsistency is somehow unavoidable. Our effort in merging the manual is always going on. Of course, this is not something that could be done in one to two day.
Final word: I hope this mail could be useful for all of you and I will replicate this mail somewhere in my web page.