Sphinx 3.X with more than 65536 words

Disclaimer : Information provided here may not represent the standpoints of Carnigie Mellon University or CMU Sphinx Group.

How?

Here are the steps.

Download and compile sphinx 3.x with 32-bit extension
Download and compile CMU-Cambridge Language Modeling Toolkit with 32-bit extension
Create and use language model with more than 65536 words

1, Download and compile sphinx 3.x with 32-bit extension

You could download the extended code at here .

This is not a standard distribution (as you could find in Sourceforge), so compilation is a little bit different. Try the following:

> ./autogen.sh ; ./autogen.sh

> make

The compilation takes about 5 minutes. At the end you would find things you need at ./src/programs/ . You will need decode, decode_anytopo and lm_convert

Make sure you also test the code. You should have line count 81 if you grep PASS from the output of the test log

2, Download and compile CMU-Cambridge Language Modeling Toolkit with 32-bit extension

You could downlaod the extension at here .

You **need** to compile the code in 32 bit mode. I am not very good in make and configure. So I just give you one good hack at here.

a, Go to ./src/Makefile.in, change "CFLAGS := @CFLAGS@" to "CFLAGS := @CFLAGS@ -DTHIRTYTWOBITS"

Then do standard dance: configure, make

b, make sure you test the code by make test32. You should get 22 counts at this case.

3, Create and use the language model.

To create the LM, just following the standard procedure as you could find in version 2

To use the LM, you could just feed the LM into decode and decode_anytopo. However, we strongly recommed you to use lm_convert to first convert the model to DMP format.

By default, if the number of words in the lm in ARPA lm format has more than 65536 words, lm_convert will automatically choose to use a binary layout with 32 bits data structure. If you want to enforce this feature, just use format-type DMP32 in the output format.

Some caveats of the tools

Binaries layouts in both sphinx 3.x and CMU-Cambridge Language Modeling Toolkit are the most difficult issue in the development. Here are some hints of how to use the tools without getting hurt.

The LM toolkit is not backward compatible when compiled in the 32 bit mode. Though our observation is that not many people have actually use the binary LM format produced by the toolkit. So as long as you create the LM using a consistent set of tools in the toolkit, nothing wrong should happen.
The decoder is designed to be compatible to both DMP and DMP32 format. The decoder will make use of the version information of the DMP format to decide the binary layout of the DMP format. In another words, there should be no need for the users to give any attention at all. If that is not the case, the users should inform me at archan at cs dot cmu dot edu.

As a final note, the code at this point (20060415) is still not very well tested, do kindly inform me about any positive/negative results you have. At a certain point, we will also incorporate the code into the canonical Sphinx. I wish you have fun when you use this ltoolkit.