Sphinx-II User Guide

Original by Mosur Ravishankar (Ravi)
Maintained by Kevin A. Lenzo (lenzo@cs.cmu.edu)

School of Computer Science
Carnegie Mellon University
Copyright (c) 1997-2001 Carnegie Mellon University.

This document is not complete, but should be helpful during construction.
Last updated 2001-03-29.

Introduction

Sphinx2 is a decoding engine for the Sphinx-II speech recognition system developed at Carnegie Mellon University. It can be used to build small, medium or large vocabulary applications. It's main features are:

Continuous speech decoding (as opposed to isolated word recognition)
Speaker-independent (doesn't require the user to train the system)
Ability to provide a single best or several alternative recognitions
Semi-continuous acoustic models
Bigram or trigram language models

Sphinx2 consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The libraries are written in C and have been compiled on several Unix platforms (DEC Alpha, Sun Sparc, HPs) and Pentium/PentiumPro PCs running WindowsNT or Windows95. A number of demo applications based on this recognition engine are also provided.

Several features specifically intended for developing real applications have been included in Sphinx2. For example, many aspects of the decoder can be reconfigured at run time. New language models can be loaded or switched dynamically. Similarly, new words and pronunciations can be added. The audio input data can be automatically logged to files for any future analysis.

The rest of this document is structured as follows:

Sphinx2 Software
The Recognition Engine
The Application Programming Interface
Application Examples
Compiling the Libraries and Demos
Allphone Mode
Forced Time-Alignment Mode
Arguments Reference
Frequently Asked Questions

Sphinx2 Software

The Sphinx2 software is available on SourceForge.
Note: This section is under construction.

Models for Running Sphinx2 Applications

In order to run Sphinx2 applications, several model files are needed. Applications must configure these for the decoder through several arguments that are described later in this document. The decoder must be initialized with the following databases:

Acoustic Model : Several files that make up a semi-continuous acoustic model set. The decoder is initialized with one such model. Acoustic models produced by the Sphinx-II trainer store probability values in 32-bit int variables. The memory requirements can be considerably reduced by converting the senone PDF values to 8-bit quantities (see Section Building 8-Bit Senone Dump Files).
Pronunciation Dictionary : The decoder must be initialized with one pronunciation dictionary that defines all the words of interest to the application and the phoneme pronunciation for each word. For the chosen dictionary and acoustic model, there is also an associated mapping information that defines the senone mapping for each triphone state encountered in the dictionary. (The mapping information is usually stored in a .phone and a .map file.)
It is not necessary to create a new senone mapping for every distinct dictionary. Smaller dictionaries can use the senone mapping created for larger dictionaries. However, the mapping information must be consistent with the acoustic model being used.
Language Model(s) : Sphinx2 accepts word trigram language model (LM) files. One can load multiple LMs into the decoder during initialization. (At least one must be loaded.) An application can even create LM files dynamically at run time and load them on-the-fly. However, only one LM can be active at a time. The active vocabulary for each utterance is given by the intersection of the pronunciation dictionary and the currently active LM.
Large LMs load very slowly. The delay can be avoided by providing LM dump files along with the original LMs. The Sphinx2 decoder automatically creates LM dump files for large LMs (see Section Building LM Dump Files).

There are also other databases, such as pre-recorded speech data, which are generally only accessible within CMU-SCS.

The Recognition Engine

The core speech decoder operates on finite-length segments of speech or utterances, one utterance at a time. An utterance can be upto some tens of seconds long. (Though most human-computer interactions would probably proceed in short phrases or sentences.)

Basic Recognition

Each utterance is decoded using upto three passes, two of which are optional:

A lexical-tree Viterbi search. It produces a recognition result as well as a word lattice.
An optional flat-structured Viterbi search restricted to the word lattice from above. It produces a new recognition result and word lattice.
An optional global best-path search of the word lattice that produces a final result.

The optional passes improve accuracy. However, the second pass (the flat Viterbi search) can increase latency significantly. The active passes are configured once at initialization. From then on, the presence of the multiple passes is invisible to the application. It only receives the result from the final pass. However, the word lattice can subsequently searched for additional, alternative--or N-best--hypotheses by the application.

Details of the recognition engine can be found in Ravishankar's Ph.D thesis poscript file.

Configuring the Active Language Model and Vocabulary

Several language models can be loaded into the recognizer, but exactly one is active during the recognition of any utterance. Language models are identified by a string name. Typically, the main language model for an application in the unnamed one whose name is the empty string.

The active vocabulary is the intersection of the words in the active language model and the pronunciation dictionary. The recognizer can only output words from this intersection.

The recognition engine can be recognfigured in several ways, but generally only in between utterances:

The active language model can be switched in between utterances.
New words and their pronunciations can be added to the dictionary in between utterances. The words are automatically added to the unnamed language model as unigrams. (This is a HACK. There ought to be a mechanism for adding a word to a specified language model.)

Forced Alignment and Allphone Recognition Modes

The recognizer can be run to time-align given transcripts to input speech, producing time segmentations for the input transcripts, as well as identifying silence regions. Time-alignment is only available in batch mode. It is covered in more detail below.

Sphinx2 can also be used in allphone mode to produce a purely phonetic recognition instead of the normal word recognition. The allphone recognition API is available to user-written applications as well. However, the input can only be from pre-recorded files.

Note: The recognition engine is configured in one of normal, forced-alignment, or allphone modes during initialization. It cannot be dynamically switched between these modes later.

The Application Programming Interface

There are three main groups of functions or application programming interface (API) available with Sphinx2: raw audio access, continuous listening/silence filtering, and the core decoder itself.

As we shall see below, none of the core decoder API functions directly accesses any audio device. Rather, the application is responsible for collecting audio data to be decoded. This gives applications the freedom to decode audio data originating at any source at all---standard audio devices, pre-recorded files, data from a remote location over a network, etc. Since most applications ultimately need to access common audio devices and to perform some form of silence filtering to detect speech/no-speech conditions, the two additional modules are provided with Sphinx2 as a convenience.

(NOTE: The APIs often use int32 and int16 for 32-bit and 16-bit integer types. These are #defined at compile time, usually as int and short, respectively.)

Low-Level Audio Access

No two platforms provide the same interface to audio devices. To accommodate this diversity, the platform-dependent code is encapsulated within a generic interface for low-level audio recording and playback. The following functions are for recording. Complete details can be found in include/ad.h.

`ad_open`:	Opens an audio device for recording. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.)
`ad_start_rec`:	Starts recording on the audio device associated with the specified handle.
`ad_read`:	Reads upto a specified number of samples into a given buffer. It returns the number of samples actually read, which may be less than the number requested. In particular it may return 0 samples if no data is available. Most systems typically have a limited amount of internal buffering (at most a few seconds). Hence, this function must be called frequently enough to avoid buffer overflow.
`ad_stop_rec`:	Stops recording. (However, the system may still have internally buffered data remaining to be read.)
`ad_close`:	Closes the audio device associated with the specified audio handle.

See examples/adrec.c and examples/adpow.c for two examples demonstrating the use of the above functions.

A similar set of playback functions are provided (currently implemented only on WindowsNT/Windows95 PC platforms):

ad_open_play:
Opens an audio device for playback. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.)

ad_start_play:
Starts playback on the device associated with the given handle.

ad_write:
Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback.

ad_stop_play:
End of playback. Playback is continued until all buffered data has been consumed.

ad_close_play:
Closes the audio device associated with the specified handle.

Finally, the audio library includes a function ad_mu2li for converting 8-bit mu-law samples into 16-bit linear PCM samples.

See examples/adplay.c for an example that plays back audio samples from a given input file.

The implementation of the audio API for various platforms is contained in analog-to-digital library for the given architecture.

Continuous Listening and Silence Filtering

As mentioned earlier, Sphinx2 can only decode utterances that are limited to less than about 30 sec at a time. However, one often wants to leave the audio recording running continuously and automatically determine utterance boundaries based on pauses in the input speech. The continuous listening module in Sphinx2 provides the mechanisms for this purpose.

The silence filtering module is interposed between the raw audio input source and the application. The application calls the function cont_ad_read instead of directly reading the raw A/D input source (e.g., via the ad_read function described above). cont_ad_read returns only those segments of input audio that it determines to be non-silence. Additional timestamp information is provided to inform the application about silence regions that have been dropped.

The complete continuous listening API is defined in include/cont_ad.h and is summarized below:

cont_ad_init:
Associates a new continuous listening module instance with a specified raw A/D handle and a corresponding read function pointer. E.g., these may be the handle returned by ad_open and function ad_read described above.

cont_ad_calib:
Calibrates the background silence level by reading the raw audio for a few seconds. It should be done once immediately after cont_ad_init, and after any environmental change.

cont_ad_read:
Reads and returns the next available block of non-silence data in a given buffer. (Uses the read function and handle supplied to cont_ad_init to obtain the raw A/D data.) More details are provided below.

cont_ad_reset:
Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech.

cont_ad_set_thresh:
Useful for adjusting the silence and speech thresholds.

cont_ad_detach:
Detaches the specified continuous listening module from the associated audio device.

cont_ad_attach:
Attaches the specified continuous listening module to the specified audio device. (Similar to cont_ad_init, but without the need to calibrate the audio device.)

cont_ad_close:
Closes the continuous listening module.

Some more details on the cont_ad_read function: Operationally, every call to cont_ad_read causes the module to read the associated raw A/D source (as much data as possible and available), scan it for speech (non-silence) segments and enqueue them internally. It returns the first available segment of speech data, if any. In addition to returning non-silence data, the function also updates a couple of parameters that may be of interest to the application:

The signal level for the most recently read data. This is the siglvl member variable of the cont_ad_t structure returned by cont_ad_init().
A timestamp value indicating the total number of raw audio samples that have been consumed at the end of the most recent cont_ad_read() call. This is in the read_ts member variable of the cont_ad_t structure.

So, for example, if on two successive calls to cont_ad_read, the timestamp is 100000 and 116000, respectively, the application can determine that 1 sec (16000 samples) of silence have been gobbled up between the two calls.

Silence regions aren't chopped off completely. About 50-100ms worth of silence is preserved at either end of a speech segment and passed on to the application.

Finally, the continuous listener won't concatenate speech segments separated by silence. That is, the data returned by a single call to cont_ad_read will not span raw audio separated by silence that has been gobbled up.

cont_ad_read must be called frequently enough to avoid loss of input data owing to buffer overflow. The application is responsible for turning actual recording on and off, if applicable. In particular, it must ensure that recording is on during calibration and normal operation.

See examples/cont_adseg.c for an example that uses the continuous listening module to segment live audio input into separate utterances. Similarly, examples/cont_fileseg.c segments a given pre-recorded file containing audio data into utterances.

The implementation of continuous listening is in src/libfe/cont_ad.c. Applications that use this module are required to link with libfe and libcommon (and libad if necessary).

Speech-to-Text Decoding

There are several aspects to speech decoding: initialization, basic speech decoding, dynamic management of domains (LMs), logging and book-keeping, etc. This section briefly describes the related Sphinx2 API functions. The complete specification can be found in include/fbs.h.

The two functions pertaining to initialization and final cleanup are:

fbs_init:
Initializes the decoder. The input arguments (in the form of the common command line argument list argc,argv) specify the input databases (acoustic, lexical, and language models) and various other decoder configuration options. (See Arguments Reference.) If batch-mode processing is indicated (see -ctlfn option below) it happens as part of this initialization.

fbs_end:
Cleans up the internals of the decoder before the application exits.

Sphinx2 applications can use the following functions to decode speech into text, one utterance at a time:

uttproc_begin_utt:
Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned.

uttproc_rawdata:
Processes (decodes) the next chunk of raw A/D data in the current utterance. This can be non-blocking, in which case much of the data may be simply queued internally for later processing. Note that only 16-bit linear PCM-encoded samples can be processed. The A/D library provides a separate function ad_mu2li for converting 8-bit mu-law encoded data into 16-bit PCM format.

uttproc_cepdata:
This is an alternative to uttproc_rawdata if the application wishes to decode cepstrum data instead of raw A/D data.

uttproc_end_utt:
Indicates that no more input data is forthcoming in the current utterance.

uttproc_result:
Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data.

uttproc_result_seg:
Like uttproc_result, but returns word segmentation information (measured in 10msec frames) instead of the recognition string. One can use either this function or uttproc_result to finish decoding, but not both.

uttproc_partial_result:
Before the final result is available, this function can be used to obtain the most up-to-date partial result (for example, as feedback to the user).

uttproc_partial_result_seg:
Like uttproc_partial_result, but returns word segmentation information (measured in 10msec frames) instead of the recognition string.

uttproc_abort_utt:
This is an alternative to uttproc_end_utt that terminates the current utterance. No further recognition results can be obtained for it.

search_get_alt:
Returns N-best hypotheses for the utterance (see further details in include/fbs.h).

The non-blocking option in some of the above functions is useful if decoding is slower than real-time, and there is a chance of losing input A/D data if processing them takes too long. In the non-blocking mode, the data may simply be queued up internally and processed only after all the input data for the current utterance has been acquired. Similarly, the non-blocking option in uttproc_result allows the application to respond to user-interface events in real-time.

The application code fragment for decoding one utterance typically looks as follows:

    uttproc_begin_utt (....)
    while (not end of utterance) {   /* indicated externally, somehow */
	read any available A/D data; /* possibly 0 length */
        uttproc_rawdata (A/D data read above, non-blocking);
    }
    uttproc_end_utt ();
    uttproc_result (...., blocking);

See demo applications in examples for several variations:

Multiple, named LMs can be resident with the decoder module, either read in during initialization, or dynamically at run time. However, exactly one LM must be selected and active for decoding any given utterance. As mentioned earlier, the active vocabulary for each utterance is given by the intersection of the pronunciation dictionary and the currently active LM. The following auxiliary functions allow the application to control language modelling related aspects of the decoder:

lm_read:
Reads in a new language model from a given file, and is associated with a given name. The application only needs this function to create and load LMs dynamically at run time, rather than at initialization.

lm_delete:
Deletes the named LM from the decoder repertory.

uttproc_set_lm:
Sets the currently active LM to the named one. Must only be invoked in-between utterances.

uttproc_set_context:
Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM.

The raw input data for each utterance and/or the cepstrum data derived from it can be logged to specified directories:

uttproc_set_rawlogdir:
Specifies the directory to which utterance A/D data should be logged. An utterance is logged to file <id>.raw, where <id> is the string assigned to it by uttproc_begin_utt.

uttproc_set_mfclogdir:
Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc.

uttproc_get_uttid:
Retrieves the string id for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example.

Allphone Decoding

The API for allphone decoding includes a single function that supports recognition from pre-recorded files:

uttproc_allphone_cepfile: Performs allphone recognition on the given file and returns the resulting phone segmentation.

Application Examples

Two simple speech decoding applications, implemented with a tty-based interface as well as with a Windows interface, are included in directory examples:

sphinx2-ptt: demonstrates an application in which the user explicitly indicates the start and end of each utterance using the <RETURN> keyboard key. (On WindowsNT/Windows95 systems, the ending <RETURN> is not used. Instead, the utterance is terminated after a fixed duration.)
sphinx2-continuous: demonstrates the interaction of continuous listening and decoding. An endless audio input stream is automatically segmented into utterances using the continuous listening module, and the utterances are decoded. The timestamps returned by the continuous listening module are used to locate gaps in speech data of at least 1 sec, thus marking the utterance boundaries.

Compiling the Libraries and Demos

To compile Sphinx2 libraries on Unix platforms:

Unpack the distribution
Run sh autogen.sh if necessary
Run ./configure
make
make test
make install

Allphone Mode

Sphinx2 runs in allphone mode if the -allphone flag is TRUE during the initialization. In this mode, no language model should be provided; i.e., the -lmfn and -lmctlfn arguments should be omitted.

Forced Time-Alignment Mode

Sphinx2 (in batch mode) can be used for aligning transcripts to speech, in order to obtain time-segmentations at the word, phone, or state levels. In this mode, no language model should be provided; i.e., the -lmfn and -lmctlfn arguments should be omitted. The set of utterances (speech data) is given by the -ctlfn argument, as usual. In addition, the corresponding transcripts should be given in a parallel file, which should be the -tactlfn argument. Each line in this file should contain the transcript for one utterance (and nothing else; in particular, no utterance-id). The first line of this file should contain just the string *align_all*.

Alignments at the word, phone and state levels can be obtained by setting the flags -taword, -taphone, and -tastate individually to TRUE or FALSE. Alignments are written to stdout (the log file).

Arguments Reference

The core Sphinx2 decoding engine accepts a long list of arguments during initialization. These are the arguments to the library function fbs_init(int argc, char *argv[]) defined in include/fbs.h. (Applications built around the Sphinx2 libraries, of course, can have additional arguments.) Many arguments, such as the input model databases, must be specified by the user. We cover the more important ones below (the remaining have reasonable default values):

Input Model Databases

Flag	Description	Default
`-lmfn`	Optional DARPA format bigram/trigram backoff LM file with the empty string as its name.	None.
`-lmctlfn`	Optional LM control file with a list of LM files and associated names (one line per entry). This is how multiple LMs can be loaded during initialization.	None.
`-kbdumpdir`	Optional directory containing precompiled binary versions of LM files (see Building LM Dump Files).	None.
`-dictfn`	Main pronunciation dictionary file.	None.
`-oovdictfn`	Optional out-of-vocabulary (OOV) pronunciation dictionary. These are added to the unnamed LM (read from `-lmfn` file) with unigram probability given by `-oovugprob`.	None.
`-ndictfn`	Optional "noise" words pronunciation dictionary. Noise words are not part of any LM and, like silence, can be inserted transparently anywhere in the utterance.	None.
`-phnfn` `-mapfn`	Phone and map files with senone mapping information for the given dictionary and acoustic model.	None.
`-hmmdir` `-hmmdirlist` `-cbdir`	Directory with Sphinx-II semi-continuous HMM acoustic models and codebooks.	None.
`-sendumpfn` `-8bsen`	Optional 8-bit senone model file created from the 32-bit HMM models (see Building 8-Bit Senone Dump Files). `-8bsen` should be `TRUE` if the 8-bit senones are used.	None.

Decoder Configuration

Flag	Description	Default
`-ctlfn` `-ctloffset` `-ctlcount`	Batch-mode control file listing utterance files (without their file-extension) to decode. `-ctloffset` is the number of initial utterances in the file to be skipped, and `-ctlcount` the number to be processed (after the skip, if any). `-ctlfn` must not be specified for live-mode or application-driven operation.	None 0 All
`-datadir`	If the control file (-ctlfn argument) entries are relative pathnames, an optional directory prefix for them may be specified using this argument.	None
`-allphone`	Should be `TRUE` to configure the recognition engine for allphone mode operation.	`FALSE`
`-tactlfn`	Input transcript file, parallel to the control file (`-ctlfn`) in forced alignment mode.	None
`-adcin` `-adcext` `-adchdr` `-adcendian`	In batch mode, `-adcin` selects A/D (`TRUE`) or cepstrum input data (`FALSE`). If `TRUE`, `-adcext` is the file extension to be appended to names listed in the `-ctlfn` argument file, `-adchdr` the number of bytes of header in each input file, and `-adcendian` their byte ordering: 0 for big-endian, 1 for little-endian. With these flags, most A/D data file formats can be processed directly.	`FALSE` `raw` 0 1
`-normmean` `-nmprior`	Cepstral mean normalization (CMN) option. If `-nmprior` is `FALSE`, CMN computed on current utterance only (usually batch mode), otherwise based on past history (live mode).	`TRUE` `FALSE`
`-compress` `-compressprior`	Silence deletion (within decoder, not related to continuous listening). If `-compressprior` is `FALSE`, based on current utterance statistics (batch mode), otherwise based on past history (live mode). `-compress` should be `FALSE` if continuous listening is used.	`FALSE` `FALSE`
`-agcmax` `-agcemax`	Automatic gain control (AGC) option. In batch mode only `-agcmax` should be `TRUE`, and in live mode only `-agcemax`.	`FALSE` `FALSE`
`-live`	Forces some live mode flags: `-nmprior` `-compressprior` and `-agcemax` to `TRUE` if any AGC is on.	`FALSE`
`-samp`	Sampling rate; must be 8000 or 16000.	`16000`
`-fwdflat`	Run flat-lexical Viterbi search after tree-structured pass (for better accuracy). Usually `FALSE` in live mode.	`TRUE`
`-bestpath`	Run global best path search over Viterbi search word lattice output (for better accuracy).	`TRUE`
`-compallsen`	Compute all senones, whether active or inactive, in each frame.	`FALSE`
`-latsize`	Word lattice entries to be allocated. Longer sentences need larger lattices.	50000

Beam Widths

Flag	Description	Default
`-top`	Number of codewords computed per frame. Usually, narrowed to 1 in live mode.	4
`-beam` `-npbeam`	Main pruning thresholds for tree search. Usually narrowed down to 2e-6 in live mode.	1e-6 1e-6
`-lpbeam`	Additional pruning threshold for transitions to leaf nodes of lexical tree. Usually narrowed down to 2e-5 in live mode.	1e-5
`-lponlybeam` `-nwbeam`	Yet more pruning thresholds for leaf nodes and exits from lexical tree. Usually narrowed down to 5e-4 in live mode.	3e-4 3e-4
`-fwdflatbeam` `-fwdflatnwbeam`	Main and word-exit pruning thresholds for the optional, flat lexical Viterbi search.	1e-8 3e-4
`-topsenfrm` `-topsenthresh`	No. of lookahead frames for predicting active base phones. (If <=1, all base phones assumed to be active every frame.) `-topsenthresh` is log(pruning threshold) applied to raw senone scores to determine active phones in each frame.	1 -60000

Language Weights/Penalties

Flag	Description	Default
`-langwt` `-fwdflatlw` `-rescorelw`	Language weights applied during lexical tree Viterbi search, flat-structured Viterbi search, and global word lattice search, respectively.	6.5 8.5 9.5
`-ugwt`	Unigram weight for interpolating unigram probabilities with uniform distribution. Typically in the range 0.5-0.8.	1.0
`-inspen` `-silpen` `-fillpen`	Word insertion penalty or probability (for words in the LM), insertion penalty for the silence word, and insertion penalty for noise words (from `-ndictfn` file) if any.	0.65 0.005 1e-8
`-oovugprob`	Unigram probability (logprob) for OOV words from `-oovdictfn` file, if any.	-4.5

Output Specifications

Flag	Description	Default
`-matchfn`	Filename to which final recognition string for each utterance written. (Old format, word-id at the end.)	None
`-matchsegfn`	Like `-matchfn`, but contains word segmentation info: startframe #frames word... (New format, word-id at the beginning.)	None
`-reportpron`	Causes word pronunciation to be included in output files.	`FALSE`
`-rawlogdir`	If specified, logs raw A/D input samples for each utterance to the indicated directory. (One file per utterance, named <uttid>.raw.)	None
`-mfclogdir`	If specified, logs cepstrum data for each utterance to the indicated directory. (One file per utterance, named <uttid>.mfc.)	None
`-dumplatdir`	If specified, dumps word lattice for each utterance to a file in this directory.	None
`-logfn`	Filename to which decoder logging information is written.	stdout/stderr
`-backtrace`	Includes detailed word backtrace information in log file.	`TRUE`
`-nbest`	No. of N-best hypotheses to be produced. Currently, this flag is only useful in batch mode. But an application can always directly invoke `search_get_alt` to obtain them. Also, the current implementation is lacking in some details (e.g., in returning detailed scores).	0
`-nbestdir`	Directory to which N-best files written (one/utterance).	Current dir.
`-taword` `-taphone` `-tastate`	Whether word, phone, and state alignment output should be produced when running in forced alignment mode.	`TRUE` `TRUE` `FALSE`

Finally, one of the arguments can be: -argfile filename. This causes additional arguments to be read in from the given file. Lines beginning with the '#' character in this file are ignored. Recursive -argfile specifications are not allowed.

Alphabetical List of Arguments

8bsen:	Use 8-bit senone dump file.
adcendian:	A/D input file byte-ordering.
adcext:	A/D input file extension.
adchdr:	No. bytes of header in A/D input file.
adcin:	Input file contains A/D samples or cepstra (TRUE/FALSE).
agcemax:	Compute AGC (max C0 normalized to 0; estimated, live mode).
agcmax:	Compute AGC (max C0 normalized to 0 based on current utterance).
argfile:	Arguments file.
backtrace:	Provide detailed backtrace in log file.
beam:	Main pruning beamwidth.
bestpath:	Run global best path algorithm on word lattice.
cbdir:	Codebooks directory.
compallsen:	Compute all senones.
compress:	Remove silence frames (based on C0 statistics).
compressprior:	Remove silence frames (based on C0 statistics from prior history).
ctlcount:	No. of utterances to decode in batch mode.
ctlfn:	Control file listing utterances to decode in batch mode.
ctloffset:	No. of initial utterances to be skipped from control file.
datadir:	Directory prefix for control file entries.
dictfn:	Main pronunciation dictionary.
dumplatdir:	Directory for dumping word lattices.
fillpen:	Noise word penalty (probability).
fwdflat:	Run flat-lexical Viterbi search.
fwdflatbeam:	Main beam width for flat search.
fwdflatlw:	Language weight for flat search.
fwdflatnwbeam:	Word-exit beam width for flat search.
hmmdir:	Directory containing acoustic models.
hmmdirlist:	Directory containing acoustic models.
inspen:	Word insertion penalty (probability).
kbdumpdir:	Directory containing LM dump files.
langwt:	Language weight for lexical tree search.
latsize:	Size of word lattice to be allocated.
live:	Live mode.
lmctlfn:	Control file listing named language model files to be loaded at initialization.
lmfn:	Unnamed language model file to load at initialization.
logfn:	Output log file.
lpbeam:	Transition to last phone beam width.
lponlybeam:	Last phone internal beam width.
mapfn:	Senone mapping file.
matchfn:	Output match file.
matchsegfn:	Output match file with word segmentation.
mfclogdir:	Directory for logging cepstrum data for each utterance.
nbest:	No. of N-best hypotheses to be produced/utterance.
nbestdir:	Directory for writing N-best hypotheses files.
ndictfn:	Noise words dictionary.
nmprior:	Cepstral mean normalization based on prior utterances statistics.
normmean:	Cepstram mean normalization.
npbeam:	Next phone beam width for tree search.
nwbeam:	Word-exit beam width for tree search.
oovdictfn:	Out-of-vocabulary words pronunciation dictionary.
oovugprob:	Unigram probability for OOV words.
phnfn:	Phone file (senone mapping information).
rawlogdir:	Directory for logging A/D data for each utterance.
reportpron:	Show actual word pronunciation in output match files.
rescorelw:	Language weight for best path search.
samp:	Input audio sampling rate(16000/8000).
sendumpfn:	(8-bit) Senone dump file.
silpen:	Silence word penalty (probability).
tactlfn:	Forced alignment transcript file.
taphone:	Whether phone-level alignment information should be output.
tastate:	Whether state-level alignment information should be output.
taword:	Whether word-level alignment information should be output.
top:	No. of top codewords to evaluate in each frame.
topsenfrm:	No. of frames to lookahead to determine active base phones.
topsenthresh:	Pruning threshold applied to determine active base phones.
ugwt:	Unigram weight for interpolating unigram probability with uniform probability.

Frequently Asked Questions

Speeding up Decoding

There are several ways to speed up decoding:

Tightening pruning thresholds: Increasing -beam -npbeam -lpbeam -lponlybeam, and -nwbeam uniformly by a factor >1.
Reducing -top from 4 to 1.
Using phone activation with -topsenfrm >1, and adjusting the corresponding pruning beamwidth -topsenthresh. The former can be set to 3, and the latter between -50000 and -70000. (Threshold values closer to 0 provide tigher pruning.)
Controlling -compallsen. When -top is 1, it is generally more efficient to compute all senones, but not when -top is 4. However, when using very small vocabularies of just tens of words, it is preferable to compute only the active senones, regardless of the value of -top. (But if -topsenfrm >1, all senones are computed anyway.)
Using acoustic models with fewer senones. (The Sphinx-3 trainer can be used to build such new models.)
Switching to a context specific language model to restrict the active vocabulary for each utterance. (Remember that the active vocabulary is the intersection of the currently active LM and the dictionary.)

Building LM Dump Files

LM files are usually ASCII files. If they are large, it is time consuming to read them into the decoder. A binary "dump" file is much faster to read and more compact.

LM dump files can be created by either a standalone program examples/lm3g2dmp.c or the decoder. The standalone version can be compiled from the examples directory. The program takes two arguments, the LM source file and a directory in which the dump file is to be created. It reads the header from the original LM file to determine the size of the LM. It then forms the binary dump file name by appending a .DMP extension to the LM file name. This file is written to the second (directory) argument. (NOTE: The dump file must not already exist!!)

Any version of the decoder can also automatically create binary "dump" files similar to the standalone version described above. It first looks for the dump file in the directory given by the -kbdumpdir argument. If the dump file is present it reads it and ignores the rest of the original LM file. Otherwise, it reads the LM file and creates a dump file in the -kbdumpdir directory so that it can be used in subsequent decoder runs.

The decoder does not create dump files for small LMs that have fewer than an internally defined number of bigrams and trigrams.

Building 8-Bit Senone Dump Files

The Sphinx-II senonic acoustic model files contain 32-bit data. (These are in the directory specified by the -hmmdir argument.) However, they can be clustered down to 8-bits for memory efficiency, without loss of recognition accuracy. The clustering is carried out by an offline process as follows:

Create a temporary 32-bit senone dump file by running the decoder with the -sendumpfn flag set to the temporary file name, the -8bsen flag set to FALSE, and omitting the -lmfn argument. The decoder can be killed after it creates the 32-bit senone dump file, which happens during the initialization and is announced in the log output.
Run: /afs/cs/project/plus-2/s2/Sphinx2/bin/alpha/pdf32to8b 32bit-file 8bit-file
to create the 8-bit senone dump file. That is, the first argument to pdf32to8b is the temporary 32-bit dump file created above, and the second argument is the 8-bit output file.
Delete the temporary 32-bit file.

The 8-bit senone dump file can now be used as the -sendumpfn argument to the decoder with the -8bsen argument set to TRUE.

`ad_open_play`:	Opens an audio device for playback. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.)
`ad_start_play`:	Starts playback on the device associated with the given handle.
`ad_write`:	Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback.
`ad_stop_play`:	End of playback. Playback is continued until all buffered data has been consumed.
`ad_close_play`:	Closes the audio device associated with the specified handle.

`cont_ad_init`:	Associates a new continuous listening module instance with a specified raw A/D handle and a corresponding read function pointer. E.g., these may be the handle returned by `ad_open` and function `ad_read` described above.
`cont_ad_calib`:	Calibrates the background silence level by reading the raw audio for a few seconds. It should be done once immediately after `cont_ad_init`, and after any environmental change.
`cont_ad_read`:	Reads and returns the next available block of non-silence data in a given buffer. (Uses the read function and handle supplied to `cont_ad_init` to obtain the raw A/D data.) More details are provided below.
`cont_ad_reset`:	Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech.
`cont_ad_set_thresh`:	Useful for adjusting the silence and speech thresholds.
`cont_ad_detach`:	Detaches the specified continuous listening module from the associated audio device.
`cont_ad_attach`:	Attaches the specified continuous listening module to the specified audio device. (Similar to `cont_ad_init`, but without the need to calibrate the audio device.)
`cont_ad_close`:	Closes the continuous listening module.

`fbs_init`:	Initializes the decoder. The input arguments (in the form of the common command line argument list `argc,argv`) specify the input databases (acoustic, lexical, and language models) and various other decoder configuration options. (See Arguments Reference.) If batch-mode processing is indicated (see `-ctlfn` option below) it happens as part of this initialization.
`fbs_end`:	Cleans up the internals of the decoder before the application exits.

`uttproc_begin_utt`:	Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned.
`uttproc_rawdata`:	Processes (decodes) the next chunk of raw A/D data in the current utterance. This can be non-blocking, in which case much of the data may be simply queued internally for later processing. Note that only 16-bit linear PCM-encoded samples can be processed. The A/D library provides a separate function `ad_mu2li` for converting 8-bit mu-law encoded data into 16-bit PCM format.
`uttproc_cepdata`:	This is an alternative to `uttproc_rawdata` if the application wishes to decode cepstrum data instead of raw A/D data.
`uttproc_end_utt`:	Indicates that no more input data is forthcoming in the current utterance.
`uttproc_result`:	Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data.
`uttproc_result_seg`:	Like `uttproc_result`, but returns word segmentation information (measured in 10msec frames) instead of the recognition string. One can use either this function or `uttproc_result` to finish decoding, but not both.
`uttproc_partial_result`:	Before the final result is available, this function can be used to obtain the most up-to-date partial result (for example, as feedback to the user).
`uttproc_partial_result_seg`:	Like `uttproc_partial_result`, but returns word segmentation information (measured in 10msec frames) instead of the recognition string.
`uttproc_abort_utt`:	This is an alternative to `uttproc_end_utt` that terminates the current utterance. No further recognition results can be obtained for it.
`search_get_alt`:	Returns N-best hypotheses for the utterance (see further details in `include/fbs.h`).

`lm_read`:	Reads in a new language model from a given file, and is associated with a given name. The application only needs this function to create and load LMs dynamically at run time, rather than at initialization.
`lm_delete`:	Deletes the named LM from the decoder repertory.
`uttproc_set_lm`:	Sets the currently active LM to the named one. Must only be invoked in-between utterances.
`uttproc_set_context`:	Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM.

`uttproc_set_rawlogdir`:	Specifies the directory to which utterance A/D data should be logged. An utterance is logged to file <id>.raw, where <id> is the string assigned to it by `uttproc_begin_utt`.
`uttproc_set_mfclogdir`:	Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc.
`uttproc_get_uttid`:	Retrieves the string id for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example.