Sphinx-II User Guide

CMU Sphinx Group

Original by Mosur Ravishankar (Ravi)
Maintained by Kevin A. Lenzo (lenzo@cs.cmu.edu)

School of Computer Science
Carnegie Mellon University
Copyright (c) 1997-2001 Carnegie Mellon University.


This document is not complete, but should be helpful during construction.
Last updated 2001-03-29.

Introduction

Sphinx2 is a decoding engine for the Sphinx-II speech recognition system developed at Carnegie Mellon University. It can be used to build small, medium or large vocabulary applications. It's main features are:

Sphinx2 consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The libraries are written in C and have been compiled on several Unix platforms (DEC Alpha, Sun Sparc, HPs) and Pentium/PentiumPro PCs running WindowsNT or Windows95. A number of demo applications based on this recognition engine are also provided.

Several features specifically intended for developing real applications have been included in Sphinx2. For example, many aspects of the decoder can be reconfigured at run time. New language models can be loaded or switched dynamically. Similarly, new words and pronunciations can be added. The audio input data can be automatically logged to files for any future analysis.

The rest of this document is structured as follows:


Sphinx2 Software

The Sphinx2 software is available on SourceForge.
Note: This section is under construction.

Models for Running Sphinx2 Applications

In order to run Sphinx2 applications, several model files are needed. Applications must configure these for the decoder through several arguments that are described later in this document. The decoder must be initialized with the following databases: There are also other databases, such as pre-recorded speech data, which are generally only accessible within CMU-SCS.


The Recognition Engine

The core speech decoder operates on finite-length segments of speech or utterances, one utterance at a time. An utterance can be upto some tens of seconds long. (Though most human-computer interactions would probably proceed in short phrases or sentences.)

Basic Recognition

Each utterance is decoded using upto three passes, two of which are optional: The optional passes improve accuracy. However, the second pass (the flat Viterbi search) can increase latency significantly. The active passes are configured once at initialization. From then on, the presence of the multiple passes is invisible to the application. It only receives the result from the final pass. However, the word lattice can subsequently searched for additional, alternative--or N-best--hypotheses by the application.

Details of the recognition engine can be found in Ravishankar's Ph.D thesis poscript file.


Configuring the Active Language Model and Vocabulary

Several language models can be loaded into the recognizer, but exactly one is active during the recognition of any utterance. Language models are identified by a string name. Typically, the main language model for an application in the unnamed one whose name is the empty string.

The active vocabulary is the intersection of the words in the active language model and the pronunciation dictionary. The recognizer can only output words from this intersection.

The recognition engine can be recognfigured in several ways, but generally only in between utterances:


Forced Alignment and Allphone Recognition Modes

The recognizer can be run to time-align given transcripts to input speech, producing time segmentations for the input transcripts, as well as identifying silence regions. Time-alignment is only available in batch mode. It is covered in more detail below.

Sphinx2 can also be used in allphone mode to produce a purely phonetic recognition instead of the normal word recognition. The allphone recognition API is available to user-written applications as well. However, the input can only be from pre-recorded files.

Note: The recognition engine is configured in one of normal, forced-alignment, or allphone modes during initialization. It cannot be dynamically switched between these modes later.


The Application Programming Interface

There are three main groups of functions or application programming interface (API) available with Sphinx2: raw audio access, continuous listening/silence filtering, and the core decoder itself.

As we shall see below, none of the core decoder API functions directly accesses any audio device. Rather, the application is responsible for collecting audio data to be decoded. This gives applications the freedom to decode audio data originating at any source at all---standard audio devices, pre-recorded files, data from a remote location over a network, etc. Since most applications ultimately need to access common audio devices and to perform some form of silence filtering to detect speech/no-speech conditions, the two additional modules are provided with Sphinx2 as a convenience.

(NOTE: The APIs often use int32 and int16 for 32-bit and 16-bit integer types. These are #defined at compile time, usually as int and short, respectively.)


Low-Level Audio Access

No two platforms provide the same interface to audio devices. To accommodate this diversity, the platform-dependent code is encapsulated within a generic interface for low-level audio recording and playback. The following functions are for recording. Complete details can be found in include/ad.h.
  • ad_open:
  • Opens an audio device for recording. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.)
  • ad_start_rec:
  • Starts recording on the audio device associated with the specified handle.
  • ad_read:
  • Reads upto a specified number of samples into a given buffer. It returns the number of samples actually read, which may be less than the number requested. In particular it may return 0 samples if no data is available. Most systems typically have a limited amount of internal buffering (at most a few seconds). Hence, this function must be called frequently enough to avoid buffer overflow.
  • ad_stop_rec:
  • Stops recording. (However, the system may still have internally buffered data remaining to be read.)
  • ad_close:
  • Closes the audio device associated with the specified audio handle.
    See examples/adrec.c and examples/adpow.c for two examples demonstrating the use of the above functions.

    A similar set of playback functions are provided (currently implemented only on WindowsNT/Windows95 PC platforms):
  • ad_open_play:
  • Opens an audio device for playback. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.)
  • ad_start_play:
  • Starts playback on the device associated with the given handle.
  • ad_write:
  • Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback.
  • ad_stop_play:
  • End of playback. Playback is continued until all buffered data has been consumed.
  • ad_close_play:
  • Closes the audio device associated with the specified handle.
    Finally, the audio library includes a function ad_mu2li for converting 8-bit mu-law samples into 16-bit linear PCM samples.

    See examples/adplay.c for an example that plays back audio samples from a given input file.

    The implementation of the audio API for various platforms is contained in analog-to-digital library for the given architecture.


    Continuous Listening and Silence Filtering

    As mentioned earlier, Sphinx2 can only decode utterances that are limited to less than about 30 sec at a time. However, one often wants to leave the audio recording running continuously and automatically determine utterance boundaries based on pauses in the input speech. The continuous listening module in Sphinx2 provides the mechanisms for this purpose.

    The silence filtering module is interposed between the raw audio input source and the application. The application calls the function cont_ad_read instead of directly reading the raw A/D input source (e.g., via the ad_read function described above). cont_ad_read returns only those segments of input audio that it determines to be non-silence. Additional timestamp information is provided to inform the application about silence regions that have been dropped.

    The complete continuous listening API is defined in include/cont_ad.h and is summarized below:
  • cont_ad_init:
  • Associates a new continuous listening module instance with a specified raw A/D handle and a corresponding read function pointer. E.g., these may be the handle returned by ad_open and function ad_read described above.
  • cont_ad_calib:
  • Calibrates the background silence level by reading the raw audio for a few seconds. It should be done once immediately after cont_ad_init, and after any environmental change.
  • cont_ad_read:
  • Reads and returns the next available block of non-silence data in a given buffer. (Uses the read function and handle supplied to cont_ad_init to obtain the raw A/D data.) More details are provided below.
  • cont_ad_reset:
  • Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech.
  • cont_ad_set_thresh:
  • Useful for adjusting the silence and speech thresholds.
  • cont_ad_detach:
  • Detaches the specified continuous listening module from the associated audio device.
  • cont_ad_attach:
  • Attaches the specified continuous listening module to the specified audio device. (Similar to cont_ad_init, but without the need to calibrate the audio device.)
  • cont_ad_close:
  • Closes the continuous listening module.
    Some more details on the cont_ad_read function: Operationally, every call to cont_ad_read causes the module to read the associated raw A/D source (as much data as possible and available), scan it for speech (non-silence) segments and enqueue them internally. It returns the first available segment of speech data, if any. In addition to returning non-silence data, the function also updates a couple of parameters that may be of interest to the application:

    So, for example, if on two successive calls to cont_ad_read, the timestamp is 100000 and 116000, respectively, the application can determine that 1 sec (16000 samples) of silence have been gobbled up between the two calls.

    Silence regions aren't chopped off completely. About 50-100ms worth of silence is preserved at either end of a speech segment and passed on to the application.

    Finally, the continuous listener won't concatenate speech segments separated by silence. That is, the data returned by a single call to cont_ad_read will not span raw audio separated by silence that has been gobbled up.

    cont_ad_read must be called frequently enough to avoid loss of input data owing to buffer overflow. The application is responsible for turning actual recording on and off, if applicable. In particular, it must ensure that recording is on during calibration and normal operation.

    See examples/cont_adseg.c for an example that uses the continuous listening module to segment live audio input into separate utterances. Similarly, examples/cont_fileseg.c segments a given pre-recorded file containing audio data into utterances.

    The implementation of continuous listening is in src/libfe/cont_ad.c. Applications that use this module are required to link with libfe and libcommon (and libad if necessary).


    Speech-to-Text Decoding

    There are several aspects to speech decoding: initialization, basic speech decoding, dynamic management of domains (LMs), logging and book-keeping, etc. This section briefly describes the related Sphinx2 API functions. The complete specification can be found in include/fbs.h.

    The two functions pertaining to initialization and final cleanup are:
  • fbs_init:
  • Initializes the decoder. The input arguments (in the form of the common command line argument list argc,argv) specify the input databases (acoustic, lexical, and language models) and various other decoder configuration options. (See Arguments Reference.) If batch-mode processing is indicated (see -ctlfn option below) it happens as part of this initialization.
  • fbs_end:
  • Cleans up the internals of the decoder before the application exits.


    Sphinx2 applications can use the following functions to decode speech into text, one utterance at a time:
  • uttproc_begin_utt:
  • Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned.
  • uttproc_rawdata:
  • Processes (decodes) the next chunk of raw A/D data in the current utterance. This can be non-blocking, in which case much of the data may be simply queued internally for later processing. Note that only 16-bit linear PCM-encoded samples can be processed. The A/D library provides a separate function ad_mu2li for converting 8-bit mu-law encoded data into 16-bit PCM format.
  • uttproc_cepdata:
  • This is an alternative to uttproc_rawdata if the application wishes to decode cepstrum data instead of raw A/D data.
  • uttproc_end_utt:
  • Indicates that no more input data is forthcoming in the current utterance.
  • uttproc_result:
  • Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data.
  • uttproc_result_seg:
  • Like uttproc_result, but returns word segmentation information (measured in 10msec frames) instead of the recognition string. One can use either this function or uttproc_result to finish decoding, but not both.
  • uttproc_partial_result:
  • Before the final result is available, this function can be used to obtain the most up-to-date partial result (for example, as feedback to the user).
  • uttproc_partial_result_seg:
  • Like uttproc_partial_result, but returns word segmentation information (measured in 10msec frames) instead of the recognition string.
  • uttproc_abort_utt:
  • This is an alternative to uttproc_end_utt that terminates the current utterance. No further recognition results can be obtained for it.
  • search_get_alt:
  • Returns N-best hypotheses for the utterance (see further details in include/fbs.h).
    The non-blocking option in some of the above functions is useful if decoding is slower than real-time, and there is a chance of losing input A/D data if processing them takes too long. In the non-blocking mode, the data may simply be queued up internally and processed only after all the input data for the current utterance has been acquired. Similarly, the non-blocking option in uttproc_result allows the application to respond to user-interface events in real-time.

    The application code fragment for decoding one utterance typically looks as follows:

        uttproc_begin_utt (....)
        while (not end of utterance) {   /* indicated externally, somehow */
    	read any available A/D data; /* possibly 0 length */
            uttproc_rawdata (A/D data read above, non-blocking);
        }
        uttproc_end_utt ();
        uttproc_result (...., blocking);
    
    See demo applications in examples for several variations:


    Multiple, named LMs can be resident with the decoder module, either read in during initialization, or dynamically at run time. However, exactly one LM must be selected and active for decoding any given utterance. As mentioned earlier, the active vocabulary for each utterance is given by the intersection of the pronunciation dictionary and the currently active LM. The following auxiliary functions allow the application to control language modelling related aspects of the decoder:
  • lm_read:
  • Reads in a new language model from a given file, and is associated with a given name. The application only needs this function to create and load LMs dynamically at run time, rather than at initialization.
  • lm_delete:
  • Deletes the named LM from the decoder repertory.
  • uttproc_set_lm:
  • Sets the currently active LM to the named one. Must only be invoked in-between utterances.
  • uttproc_set_context:
  • Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM.


    The raw input data for each utterance and/or the cepstrum data derived from it can be logged to specified directories:
  • uttproc_set_rawlogdir:
  • Specifies the directory to which utterance A/D data should be logged. An utterance is logged to file <id>.raw, where <id> is the string assigned to it by uttproc_begin_utt.
  • uttproc_set_mfclogdir:
  • Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc.
  • uttproc_get_uttid:
  • Retrieves the string id for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example.


    Allphone Decoding

    The API for allphone decoding includes a single function that supports recognition from pre-recorded files:
  • uttproc_allphone_cepfile:
  • Performs allphone recognition on the given file and returns the resulting phone segmentation.


    Application Examples

    Two simple speech decoding applications, implemented with a tty-based interface as well as with a Windows interface, are included in directory examples:


    Compiling the Libraries and Demos

    To compile Sphinx2 libraries on Unix platforms:


    Allphone Mode

    Sphinx2 runs in allphone mode if the -allphone flag is TRUE during the initialization. In this mode, no language model should be provided; i.e., the -lmfn and -lmctlfn arguments should be omitted.


    Forced Time-Alignment Mode

    Sphinx2 (in batch mode) can be used for aligning transcripts to speech, in order to obtain time-segmentations at the word, phone, or state levels. In this mode, no language model should be provided; i.e., the -lmfn and -lmctlfn arguments should be omitted. The set of utterances (speech data) is given by the -ctlfn argument, as usual. In addition, the corresponding transcripts should be given in a parallel file, which should be the -tactlfn argument. Each line in this file should contain the transcript for one utterance (and nothing else; in particular, no utterance-id). The first line of this file should contain just the string *align_all*.

    Alignments at the word, phone and state levels can be obtained by setting the flags -taword, -taphone, and -tastate individually to TRUE or FALSE. Alignments are written to stdout (the log file).


    Arguments Reference

    The core Sphinx2 decoding engine accepts a long list of arguments during initialization. These are the arguments to the library function fbs_init(int argc, char *argv[]) defined in include/fbs.h. (Applications built around the Sphinx2 libraries, of course, can have additional arguments.) Many arguments, such as the input model databases, must be specified by the user. We cover the more important ones below (the remaining have reasonable default values):

    Input Model Databases

    Flag Description Default
    -lmfn Optional DARPA format bigram/trigram backoff LM file with the empty string as its name. None.
    -lmctlfn Optional LM control file with a list of LM files and associated names (one line per entry). This is how multiple LMs can be loaded during initialization. None.
    -kbdumpdir Optional directory containing precompiled binary versions of LM files (see Building LM Dump Files). None.
    -dictfn Main pronunciation dictionary file. None.
    -oovdictfn Optional out-of-vocabulary (OOV) pronunciation dictionary. These are added to the unnamed LM (read from -lmfn file) with unigram probability given by -oovugprob. None.
    -ndictfn Optional "noise" words pronunciation dictionary. Noise words are not part of any LM and, like silence, can be inserted transparently anywhere in the utterance. None.
    -phnfn
    -mapfn
    Phone and map files with senone mapping information for the given dictionary and acoustic model. None.
    -hmmdir
    -hmmdirlist
    -cbdir
    Directory with Sphinx-II semi-continuous HMM acoustic models and codebooks. None.
    -sendumpfn
    -8bsen
    Optional 8-bit senone model file created from the 32-bit HMM models (see Building 8-Bit Senone Dump Files). -8bsen should be TRUE if the 8-bit senones are used. None.


    Decoder Configuration

    Flag Description Default
    -ctlfn
    -ctloffset
    -ctlcount
    Batch-mode control file listing utterance files (without their file-extension) to decode. -ctloffset is the number of initial utterances in the file to be skipped, and -ctlcount the number to be processed (after the skip, if any). -ctlfn must not be specified for live-mode or application-driven operation. None
    0
    All
    -datadir
    If the control file (-ctlfn argument) entries are relative pathnames, an optional directory prefix for them may be specified using this argument. None
    -allphone
    Should be TRUE to configure the recognition engine for allphone mode operation. FALSE
    -tactlfn
    Input transcript file, parallel to the control file (-ctlfn) in forced alignment mode. None
    -adcin
    -adcext
    -adchdr
    -adcendian
    In batch mode, -adcin selects A/D (TRUE) or cepstrum input data (FALSE). If TRUE, -adcext is the file extension to be appended to names listed in the -ctlfn argument file, -adchdr the number of bytes of header in each input file, and -adcendian their byte ordering: 0 for big-endian, 1 for little-endian. With these flags, most A/D data file formats can be processed directly. FALSE
    raw
    0
    1
    -normmean
    -nmprior
    Cepstral mean normalization (CMN) option. If -nmprior is FALSE, CMN computed on current utterance only (usually batch mode), otherwise based on past history (live mode). TRUE
    FALSE
    -compress
    -compressprior
    Silence deletion (within decoder, not related to continuous listening). If -compressprior is FALSE, based on current utterance statistics (batch mode), otherwise based on past history (live mode). -compress should be FALSE if continuous listening is used. FALSE
    FALSE
    -agcmax
    -agcemax
    Automatic gain control (AGC) option. In batch mode only -agcmax should be TRUE, and in live mode only -agcemax. FALSE
    FALSE
    -live Forces some live mode flags: -nmprior -compressprior and -agcemax to TRUE if any AGC is on. FALSE
    -samp Sampling rate; must be 8000 or 16000. 16000
    -fwdflat Run flat-lexical Viterbi search after tree-structured pass (for better accuracy). Usually FALSE in live mode. TRUE
    -bestpath Run global best path search over Viterbi search word lattice output (for better accuracy). TRUE
    -compallsen Compute all senones, whether active or inactive, in each frame. FALSE
    -latsize Word lattice entries to be allocated. Longer sentences need larger lattices. 50000


    Beam Widths

    Flag Description Default
    -top Number of codewords computed per frame. Usually, narrowed to 1 in live mode. 4
    -beam
    -npbeam
    Main pruning thresholds for tree search. Usually narrowed down to 2e-6 in live mode. 1e-6
    1e-6
    -lpbeam Additional pruning threshold for transitions to leaf nodes of lexical tree. Usually narrowed down to 2e-5 in live mode. 1e-5
    -lponlybeam
    -nwbeam
    Yet more pruning thresholds for leaf nodes and exits from lexical tree. Usually narrowed down to 5e-4 in live mode. 3e-4
    3e-4
    -fwdflatbeam
    -fwdflatnwbeam
    Main and word-exit pruning thresholds for the optional, flat lexical Viterbi search. 1e-8
    3e-4
    -topsenfrm
    -topsenthresh
    No. of lookahead frames for predicting active base phones. (If <=1, all base phones assumed to be active every frame.) -topsenthresh is log(pruning threshold) applied to raw senone scores to determine active phones in each frame. 1
    -60000


    Language Weights/Penalties

    Flag Description Default
    -langwt
    -fwdflatlw
    -rescorelw
    Language weights applied during lexical tree Viterbi search, flat-structured Viterbi search, and global word lattice search, respectively. 6.5
    8.5
    9.5
    -ugwt Unigram weight for interpolating unigram probabilities with uniform distribution. Typically in the range 0.5-0.8. 1.0
    -inspen
    -silpen
    -fillpen
    Word insertion penalty or probability (for words in the LM), insertion penalty for the silence word, and insertion penalty for noise words (from -ndictfn file) if any. 0.65
    0.005
    1e-8
    -oovugprob Unigram probability (logprob) for OOV words from -oovdictfn file, if any. -4.5


    Output Specifications

    Flag Description Default
    -matchfn Filename to which final recognition string for each utterance written. (Old format, word-id at the end.) None
    -matchsegfn Like -matchfn, but contains word segmentation info: startframe #frames word... (New format, word-id at the beginning.) None
    -reportpron Causes word pronunciation to be included in output files. FALSE
    -rawlogdir If specified, logs raw A/D input samples for each utterance to the indicated directory. (One file per utterance, named <uttid>.raw.) None
    -mfclogdir If specified, logs cepstrum data for each utterance to the indicated directory. (One file per utterance, named <uttid>.mfc.) None
    -dumplatdir If specified, dumps word lattice for each utterance to a file in this directory. None
    -logfn Filename to which decoder logging information is written. stdout/stderr
    -backtrace Includes detailed word backtrace information in log file. TRUE
    -nbest No. of N-best hypotheses to be produced. Currently, this flag is only useful in batch mode. But an application can always directly invoke search_get_alt to obtain them. Also, the current implementation is lacking in some details (e.g., in returning detailed scores). 0
    -nbestdir Directory to which N-best files written (one/utterance). Current dir.
    -taword
    -taphone
    -tastate
    Whether word, phone, and state alignment output should be produced when running in forced alignment mode. TRUE
    TRUE
    FALSE


    Finally, one of the arguments can be: -argfile filename. This causes additional arguments to be read in from the given file. Lines beginning with the '#' character in this file are ignored. Recursive -argfile specifications are not allowed.


    Alphabetical List of Arguments

  • 8bsen:
  • Use 8-bit senone dump file.
  • adcendian:
  • A/D input file byte-ordering.
  • adcext:
  • A/D input file extension.
  • adchdr:
  • No. bytes of header in A/D input file.
  • adcin:
  • Input file contains A/D samples or cepstra (TRUE/FALSE).
  • agcemax:
  • Compute AGC (max C0 normalized to 0; estimated, live mode).
  • agcmax:
  • Compute AGC (max C0 normalized to 0 based on current utterance).
  • argfile:
  • Arguments file.
  • backtrace:
  • Provide detailed backtrace in log file.
  • beam:
  • Main pruning beamwidth.
  • bestpath:
  • Run global best path algorithm on word lattice.
  • cbdir:
  • Codebooks directory.
  • compallsen:
  • Compute all senones.
  • compress:
  • Remove silence frames (based on C0 statistics).
  • compressprior:
  • Remove silence frames (based on C0 statistics from prior history).
  • ctlcount:
  • No. of utterances to decode in batch mode.
  • ctlfn:
  • Control file listing utterances to decode in batch mode.
  • ctloffset:
  • No. of initial utterances to be skipped from control file.
  • datadir:
  • Directory prefix for control file entries.
  • dictfn:
  • Main pronunciation dictionary.
  • dumplatdir:
  • Directory for dumping word lattices.
  • fillpen:
  • Noise word penalty (probability).
  • fwdflat:
  • Run flat-lexical Viterbi search.
  • fwdflatbeam:
  • Main beam width for flat search.
  • fwdflatlw:
  • Language weight for flat search.
  • fwdflatnwbeam:
  • Word-exit beam width for flat search.
  • hmmdir:
  • Directory containing acoustic models.
  • hmmdirlist:
  • Directory containing acoustic models.
  • inspen:
  • Word insertion penalty (probability).
  • kbdumpdir:
  • Directory containing LM dump files.
  • langwt:
  • Language weight for lexical tree search.
  • latsize:
  • Size of word lattice to be allocated.
  • live:
  • Live mode.
  • lmctlfn:
  • Control file listing named language model files to be loaded at initialization.
  • lmfn:
  • Unnamed language model file to load at initialization.
  • logfn:
  • Output log file.
  • lpbeam:
  • Transition to last phone beam width.
  • lponlybeam:
  • Last phone internal beam width.
  • mapfn:
  • Senone mapping file.
  • matchfn:
  • Output match file.
  • matchsegfn:
  • Output match file with word segmentation.
  • mfclogdir:
  • Directory for logging cepstrum data for each utterance.
  • nbest:
  • No. of N-best hypotheses to be produced/utterance.
  • nbestdir:
  • Directory for writing N-best hypotheses files.
  • ndictfn:
  • Noise words dictionary.
  • nmprior:
  • Cepstral mean normalization based on prior utterances statistics.
  • normmean:
  • Cepstram mean normalization.
  • npbeam:
  • Next phone beam width for tree search.
  • nwbeam:
  • Word-exit beam width for tree search.
  • oovdictfn:
  • Out-of-vocabulary words pronunciation dictionary.
  • oovugprob:
  • Unigram probability for OOV words.
  • phnfn:
  • Phone file (senone mapping information).
  • rawlogdir:
  • Directory for logging A/D data for each utterance.
  • reportpron:
  • Show actual word pronunciation in output match files.
  • rescorelw:
  • Language weight for best path search.
  • samp:
  • Input audio sampling rate(16000/8000).
  • sendumpfn:
  • (8-bit) Senone dump file.
  • silpen:
  • Silence word penalty (probability).
  • tactlfn:
  • Forced alignment transcript file.
  • taphone:
  • Whether phone-level alignment information should be output.
  • tastate:
  • Whether state-level alignment information should be output.
  • taword:
  • Whether word-level alignment information should be output.
  • top:
  • No. of top codewords to evaluate in each frame.
  • topsenfrm:
  • No. of frames to lookahead to determine active base phones.
  • topsenthresh:
  • Pruning threshold applied to determine active base phones.
  • ugwt:
  • Unigram weight for interpolating unigram probability with uniform probability.


    Frequently Asked Questions

    Speeding up Decoding

    There are several ways to speed up decoding:


    Building LM Dump Files

    LM files are usually ASCII files. If they are large, it is time consuming to read them into the decoder. A binary "dump" file is much faster to read and more compact.

    LM dump files can be created by either a standalone program examples/lm3g2dmp.c or the decoder. The standalone version can be compiled from the examples directory. The program takes two arguments, the LM source file and a directory in which the dump file is to be created. It reads the header from the original LM file to determine the size of the LM. It then forms the binary dump file name by appending a .DMP extension to the LM file name. This file is written to the second (directory) argument. (NOTE: The dump file must not already exist!!)

    Any version of the decoder can also automatically create binary "dump" files similar to the standalone version described above. It first looks for the dump file in the directory given by the -kbdumpdir argument. If the dump file is present it reads it and ignores the rest of the original LM file. Otherwise, it reads the LM file and creates a dump file in the -kbdumpdir directory so that it can be used in subsequent decoder runs.

    The decoder does not create dump files for small LMs that have fewer than an internally defined number of bigrams and trigrams.


    Building 8-Bit Senone Dump Files

    The Sphinx-II senonic acoustic model files contain 32-bit data. (These are in the directory specified by the -hmmdir argument.) However, they can be clustered down to 8-bits for memory efficiency, without loss of recognition accuracy. The clustering is carried out by an offline process as follows:
    1. Create a temporary 32-bit senone dump file by running the decoder with the -sendumpfn flag set to the temporary file name, the -8bsen flag set to FALSE, and omitting the -lmfn argument. The decoder can be killed after it creates the 32-bit senone dump file, which happens during the initialization and is announced in the log output.
    2. Run: /afs/cs/project/plus-2/s2/Sphinx2/bin/alpha/pdf32to8b 32bit-file 8bit-file
      to create the 8-bit senone dump file. That is, the first argument to pdf32to8b is the temporary 32-bit dump file created above, and the second argument is the 8-bit output file.
    3. Delete the temporary 32-bit file.
    The 8-bit senone dump file can now be used as the -sendumpfn argument to the decoder with the -8bsen argument set to TRUE.