Sphinx-3 (S3) is the successor to the Sphinx-II speech recognition system from Carnegie Mellon University. It includes both an acoustic trainer (link) and various decoders, i.e., text recognition, phoneme recognition, N-best list generation, etc.
The s3.2 decoder is a recent implementation for speech-to-text recognition, its main goal being speed improvements over the original Sphinx-3 decoder. It runs about 10 times faster than the latter on large vocabulary tasks. The following is a brief summary of its main features and limitations:
This package contains the following programs:
decode: The Sphinx-3 s3.2 decodergausubvq: A utility for creating a compact, quantized form of continuous
density acoustic models. The compact form is used to speed up acoustic likelihood
computation in the decoderlmtest: A language model test utilityThis distribution (specifically the Makefiles) has been prepared for Unix platforms.
However, it should be easy to create an equivalent setup on the
Windows platform, and compile the software there as well. Note that the s3.2 decoder
depends on the libutil module (link), which is also a part of the
Sphinx distribution.
This document is a brief user's manual for the above programs. It is not meant to be a detailed description of the decoding algorithm, or an in-depth tutorial on speech recognition technology. However, a set of (Microsoft) PowerPoint slides are available that give additional information about the decoder.
The initial part of this document provides an overview of the decoder. It is followed by descriptions of the main input and output databases; i.e., the lexicon, language model, acoustic model, etc.
The s3.2 decoder is based on the conventional Viterbi search algorithm and beam search heuristics [cite]. It uses a lexical-tree search structure somewhat like the Sphinx-II decoder [citeRavi] (link), but with some improvements for greater accuracy than the latter. However, unlike the Sphinx-II system, this one as yet provides no convenient API for the development of applications based on speech-recognition. Instead, it only has a stand-alone, batch-mode recognition application (i.e., the decoder). It takes its input from pre-recorded speech that has already been processed into cepstrum data files (cite,link), and writes its recognition results to output files.
We first give a brief outline of the input and output characteristics of the decoder. More detailed information is available in later sections. The decoder needs the following inputs:
libfe in Sphinx-II). Also note that the decoder cannot handle
arbitrary lengths of speech input. Each separate piece (or utterance) to be
processed by the decoder must be no more than 300 sec. long. Typically, one uses
a segmenter (link to CMUSeg) to chop up a cepstrum stream
into manageable segments of up to 20 or 30 sec. duration.
The decoder can produce two types of recognition output:
In addition, the decoder also produces a detailed log to stdout/stderr that can be useful in debugging, gathering statistics, etc.
The Makefiles provided are set up for Unix platforms. The following steps are needed to compile the decoder and other utilities:
libutil library module (link) wherever it is
convenientsetenv MACHINE linux (or alpha, sun,
etc. as appropriate)setenv S3DIR to your installation of the Sphinx-3 (s3.2) root
directoryUTILDIR variable in
$S3DIR/Makefile.defines to point
to your installation of the libutil modulecd $S3DIR/srcmake clean; This should remove any old object filesmake; This first compiles libutil if it is out of date.
Then it compiles the decoder and other utilities.
The executable files are placed in machine specific directories under
$S3DIR/bin.
Note that the Makefiles aren't foolproof; they do not eliminate the need for sometimes manually determining dependencies, especially upon updates to header files. When in doubt, first clean out the compilation directories entirely.
Running the decoder is simply a matter of invoking the binary
(i.e., decode), with a number
of command line arguments specifying the various input and output files, as
well as decoding
configuration parameters.
Invoking the binary without any argument produces a help message with short descriptions of all the command line arguments.
This section gives a brief overview of the main command line arguments. They are broken down into separate groups, based on whether they are the primary flags specifying input and output data, arguments for optional configuration, or for performance tuning.
Note that not all the available command line arguments are covered below. There are a few additional and undocumented flags, intended mainly for debugging purposes.
Many of the flags have reasonable defaults. The ones that a user minimally needs to provide are the input and output databases or files, which have been discussed above:
|
Model definition input file |
|
Acoustic model files |
|
Main and filler lexicons |
|
Language model binary dump file |
|
Filler word probabilities |
|
Output hypotheses file |
It may often be necessary to provide additional parameters to obtain the right decoder configuration:
|
Feature type configuration |
|
Directory prefix for cepstrum files specified in the control file |
|
Selecting a portion of the control file to be processed |
|
Directory, file-extension for word lattices output |
In yet other cases, it may be necessary to tune the following parameters to obtain the optimal computational efficiency or recognition accuracy:
|
Beam pruning parameters |
|
Absolute pruning parameters |
|
Language weight, word insertion penalty |
|
Number of lexical tree instances (link) |
This section is a bit of a mish-mash; its contents probably belong in an FAQ section. But, hopefully, through this section a newcomer to Sphinx can get an idea of the structure, capabilities, and limitations of the s3.2 decoder.
The decoder is configured during the initialization step, and the configuration holds for the entire batch run. This means, for example, that the decoder does not dynamically reconfigure the acoustic models to adapt to the input. To choose another example, there is no mechanism in this decoder to switch language models from utterance to utterance, unlike in Sphinx-II (link). The main initialization steps are outlined below.
Log-Base Initialization. Sphinx performs all likelihood computations in
the log-domain. Furthermore, for computational efficiency, the base
of the logarithm is chosen such
that the likelihoods can be maintained as 32-bit integer values. Thus,
all the scores reported by the decoder are log-likelihood values in
this peculiar log-base. The default base
is typically 1.0003, and can be changed using the -logbase command line
argument. The main reason for modifying the log-base would be to control the length
(duration) of an input utterance before the accumulated log-likelihood values overflow
the 32-bit representation, causing the decoder to fail catastrophically. The log-base
can be changed over a wide range without affecting the recognition.
Models Initialization. The lexical, acoustic, and language models specified via the command line arguments are loaded during initialization. This set of models is used to decode all the utterances in the input. (The language model is actually only partly loaded, since s3.2 uses a disk-based LM strategy.)
The Effective Vocabulary. After the models are loaded, the effective vocabulary is determined. It is the set of words that the decoder is capable of recognizing. Recall that the decoder is initialized with three sources of words: the main and filler lexicon files, and the language model. The effective vocabulary is determined from them as follows:
<s> and
</s>.The effective vocabulary remains in effect throughout the batch run. It is not possible to add to or remove from this vocabulary dynamically, unlike in the Sphinx-II system.
Lexical Tree Construction. The decoder constructs lexical trees
from the effective vocabulary described above. Separate trees are constructed
for words in the main and filler lexicons.
Furthermore, several copies may be instantiated for the two, depending on the
-Nlextree command line argument. Further details of the lexical
tree construction are available on the PowerPoint slides.
Folloing initialization, the decoder processes the entries in the control file
sequentially, one at a
time, and independent of each other. It is possible to process a contiguous
subset of the control file, using the -ctloffset and -ctlcount
flags, as mentioned earlier. The independent processing of each entry implies
that there is no learning or adaptation capability as decoding progresses.
But it also implies that rearranging the order of the entries in the control file
has no effect on the individual results, which is a useful attribute in
experimentation and debugging.
Each entry in the control file, or utterance, is processed using the given input models, and using the Viterbi search algorithm. In order to constrain the active search space to computationally manageable limits, pruning is employed, which means that the less promising hypotheses are continually discarded during the recognition process. There are two kinds of pruning in s3.2, beam pruning and beam pruning.
Beam Pruning. Each utterance is processed in a time-synchronous manner, one frame at a time. At each frame the decoder has a number of currently active HMMs to match with the next frame of input speech. But it first discards or deactivates those whose state likelihoods are below some threshold, relative to best HMM state likelihood at that time. The threshold value is obtained by multiplying the best state likelihood by a fixed beamwidth. The beamwidth is a value between 0 and 1, the former permitting all HMMs to survive, and the latter permitting only the best scoring HMMs to survive.
Similar beam pruning is also used in a number of other situations in the decoder, e.g., to determine the candidate words recognized at any time, or to determine the component densities in a mixture Gaussian that are closest to a given speech feature vector. The various beamwidths have to be determined empirically and are set using command line arguments.
Absolute Pruning. Even with beam pruning, the number of active entities can sometimes become computationally overwhelming. If there are a large number of HMMs that fall within the pruning threshold, the decoder will keep all of them active. However, when the number of active HMMs grows beyond certain limits, the chances of detecting the correct word among the many candidates are considerably reduced. Such situations can occur, for example, if the input speech is noisy or quite mismatched to the acoustic models. In such cases, there is no point in allowing the active search space to grow to arbitrary extents. It can be contained using pruning parameters that limit the absolute number of active entities at any instant. These parameters are also determined empirically, and set using command line arguments.
During recognition, the decoder builds an internal backpointer table data structure, from which the final outputs are generated. This table records all the candidate words recognized during decoding, and their attributes such as their time segmentation, acoustic and LM likelihoods, as well as their predecessor entries in the table. When an utterance has been fully processed, the best recognition hypothesis is extracted from this table. Optionally, the table is also converted into a word-lattice and written out to a file.
More information on the backpointer table is available in the PowerPoint slides.
Role of <s> and </s>.
The distinguished beginning-of-sentence and
end-of-sentence tokens <s> and
</s> are not in the effective vocabulary, and no part of the
input speech is decoded into either of them. They are merely
anchors at the ends of each utterance, and provide context for the LM.
This is in contrast to earlier versions of Sphinx, which required some
silence at the either end of each speech utterance, to be decoded into these
tokens.
To obtain the best recognition performance, it is necessary to select the appropriate front-end and feature type computation, train the various models, as well as tune the decoder configuration parameters. This section deals with the last issue. There are mainly two groups of parameters to be tuned, pertaining to pruning and LM related. Unfortunately, there are no automatic methods for determining the values of these parameters; it is necessary to derive them by trial and error. Additionally, the following points should be kept in mind with regard to the pruning parameters:
The pruning parameters are the following:
-beam: Determines which HMMs remain active at
any given point (frame) during recognition. (Based on the best state score
within each HMM.)
-pbeam: Determines which active HMM can transition to its
successor in the lexical tree at any point.
(Based on the exit state score of the source HMM.)
-wbeam: Determines which words are recognized at any frame
during decoding. (Based on the exit state scores of leaf HMMs in the
lexical trees.)-maxhmmpf: Determines the number of HMMs (approx.) that can
remain active at any frame.-maxwpf: Controls the number of distinct words recognized at
any given frame.-maxhistpf: Controls the number of distinct word histories
recorded in the backpointer table at any given frame.-subvqbeam: For each senone
and its underlying acoustic model, determines its active mixture components at
any frame.In order to determine the pruning parameter values empirically, it is first necessary to obtain a test set, i.e., a collection of test sentences not used in any training data. The test set should be sufficiently large to ensure statistically reliable results. For example, a large-vocabulary task might require a test set that includes a half-hour of speech, or more.
It is difficult to tune a handful of parameters simultaneously, especially when the input models are completely new. The following steps may be followed to deal with this complex problem.
-beam and
-pbeam to 1e-60, and -wbeam to
1e-30. Set -subvqbeam to a small value
(e.g., the same as -beam).
Run the decoder on the chosen test set and obtain accuracy results.
(Use default values for the LM related parameters
when tuning the pruning parameters for the first time.)
-beam up and down, until the
setting for best accuracy is identified.
(Keep -pbeam the same as -beam every time.)
-wbeam up and down and identify
its best possible setting (keeping -beam and
-pbeam fixed at their most recently obtained value).
-beam and -wbeam, until convergence. Note that
during these iterations
-pbeam should always be the same as -beam.
(This step can be omitted if the accuracy attained after the first iteration is
acceptable.)
-subvqbeam (i.e., towards 1.0 for a narrower
setting), stopping when recognition accuracy
begins to drop noticeably. Values near the default are reasonable. (This step is
needed only if a sub-vector quantized model
is available for speeding up acoustic model evaluation.)
-pbeam (i.e., towards 1.0),
stopping when recognition accuracy begins to drop noticeably. (This step
is optional; it mainly optimizes the computational effort a little more.)
-maxhmmpf gradually until accuracy begins to be affected.
Repeat the process with -maxwpf, and then with -maxhistpf.
(However, in some situations, especially when the vocabulary size is small, it may
not be necessary to tune these absolute pruning parameters.)
In practice, it may not always be possible to follow the above steps strictly. For
example, considerations of computational cost might dictate that the absolute pruning
parameters or the -subvqbeam parameter be tuned earlier in the sequence.
The parameters needed to be tuned are the following:
-lw: The language weight.
-wip: The word insertion penalty.
Like the pruning parameters, the above two are tuned on a test set. Since the decoder is much more sensitive to the language weight, that is typically tuned first, using the default word insertion penalty. The latter is then tuned. It is usually not necessary to repeat the process.
A pronunciation lexicon (or dictionary) file specifies word pronunciations. In Sphinx, pronunciations are specified as a linear sequence of phonemes. Each line in the file contains one pronunciation specification, except that any line that begins with a "#" character in the first column is treated as a comment and is ignored. Example dictionary for digits:
ZERO Z IH R OW ONE W AH N TWO T UW THREE TH R IY FOUR F AO R FIVE F AY V SIX S IH K S SEVEN S EH V AX N EIGHT EY TD NINE N AY N
The lexicon is completely case-insensitive (unfortunately). For example,
it's not possible to have two different entries Brown and
brown in the dictionary.
A word may have more than one pronunciation, each one on a separate line. They are distinguished by a unique parenthesized suffix for the word string. For example:
ACTUALLY AE K CH AX W AX L IY ACTUALLY(2nd) AE K SH AX L IY ACTUALLY(3rd) AE K SH L IY
If a word has more than one pronunciation, its first appearance must be the unparenthesized form. For the rest, the parenthesized suffix may be any string, as long as it is unique for that word. There is no other significance to the order of the alternatives; each one is considered to be equallly likely.
In Sphinx-3, the lexicon may also contain compound words. A compound word is usually a short phrase whose pronunciation happens to differ significantly from the mere concatenation of the pronunciations of its constituent words. Compound word tokens are formed by concatenating the component word strings with an underscore character; e.g.:
WANT_TO W AA N AX
(The s3.2 decoder, however, treats a compound word as just another word in the language, and does not do anything special with it.)
The Sphinx-3 decoders actually need two separate lexicons: a "regular" one
containing the words in the language of interest, and also a filler or
noise lexicon.
The latter defines "words" not in the language. More specifically, it defines legal
"words" that do not appear in the language model (link) used by the decoder, but are
nevertheless encountered in normal speech.
This lexicon must include the silence word <sil>,
as well as the special beginning-of-sentence and end-of-sentence tokens
<s>, and
</s>, respectively. All of them usually
have the silence-phone SIL as their pronunciation. In addition,
this lexicon may also contain "pronunciations" for other noise event words such as
breath noise, "UM" and "UH" sounds made during spontaneous speech, etc.
Sphinx-3 is based on subphonetic acoustic models [citeMeiYuh].
First, the basic
sounds in the language are classified into phonemes or phones.
There are roughly 50 phones in the English language. For example, here is a
pronunciation for the word LANDSAT:
L AE N D S AE TD
Phones are then further refined into context-dependent triphones,
i.e., phones occurring in given left and right phonetic contexts. The reason is
that the same phone within different contexts can have widely different acoustic
manifestations, requiring separate acoustic models. For example, the two occurrences
of the AE phone above have different contexts, only the first of which
is nasal.
In contrast to triphones, a phone considered without any specific context is referred to as a context-independent phone or basephone. Note also that context-dependency gives rise to the notion of cross-word triphones. That is, the left context for the leftmost basephone of a word depends on what was the previous word spoken.
Phones are also distinguished according to their position
within the word: beginning, end, internal, or single
(abbreviated b, e, i and s,
respectively).
For example, in the word MINIMUM with the following pronunciation:
M IH N AX M AX M
the three occurrences of the phone M have three different position
attributes. The s attribute applies if a word has just a single phone as
its pronunciation.
For most applications, one builds acoustic models for triphones, qualified by the four position attributes. (This provides far greater modelling detail and accuracy than if one relies on just basephone models.) Each triphone is modelled by a hidden Markov model or HMM [citeRabiner]. Typically, 3 or 5 state HMMs are used, where each state has a statistical model for its underlying acoustics. But if we have 50 basephones, with 4 position qualifiers and 3-state HMMs, we end up with a total of 503*4*3 distinct HMM states! Such a model would be too large and impractical to train. To keep things manageable, HMM states are clustered into a much smaller number of groups. Each such group is called a senone (in Sphinx terminology), and all the states mapped into one senone share the same underlying statistical model. (The clustering of HMM states into senones is described in [citeMeiYuh].)
Each triphone also has a state transition probability matrix that defines the topology of its HMM. Once again, to conserve resources, there is a considerable amount of sharing. Typically, there is one such matrix per basephone, and all triphones derived from the same parent basephone share its state transition matrix.
The information regarding triphones and mapping from triphone states to senones and transition matrices is captured in a model definition, or mdef input file.
For various reasons, it is undesirable to build acoustic models directly in terms of the raw audio samples. Instead, the audio is processed to extract a vector of relevant features. All acoustic modelling is carried out in terms of such feature vectors.
In Sphinx, feature vector computation is a two-stage process. An off-line front-end module (link) is first responsible for processing the raw audio sample stream into a cepstral stream [cite], which can then be input to the Sphinx software. The input audio stream consists of 16-bit samples, at a rate of 8 or 16 KHz depending on whether the input is narrow or wide-band speech. The output is a stream of 13-dimensional real-valued cepstrum vectors, at a rate of 100 vectors/sec. The 10msec step between successive vectors is usually called a frame.
In the second stage, the Sphinx software (both trainer and decoder) internally converts the stream of cepstrum vectors into a feature stream. This process consists of the following steps:

This refers to the computation of a (statistical) model for each senone in the model. As a very rough approximation, this process can be described by the following conceptual steps:

Note that there is a circularity in the above description. We wish to train the senone models, but in the penultimate step, we need the senone models to compute the best possible state alignment. This circularity is resolved by using the iterative Baum-Welch or forward-backward training algorithm [citeRabiner]. The algorithm begins with some initial set of models, which could be completely flat, for the senones. It then repeats the last two steps several times. Each iteration uses the model computed at the end of the previous iteration.
Although not mentioned above, the HMM state-transition probability matrices are also trained from the state alignments. Acoustic modelling is described in greater detail in the Sphinx-3 trainer module (link) and [citeRabiner].
The acoustic models trained as described above can be of different degrees of sophistication. Two forms are commonly used:
In a continuous model, each senone has its own, private mixture-Gaussian distribution that describes the statistics of its underlying speech feature space. In a semi-continuous model, all the senones share a single codebook of Gaussian distributions, but each senone has its own set of mixture weights applied to the codebook components. Sphinx-3 supports both models, and other, intermediate degrees of state-tying as well. (The s3.2 decoder, however, can only handle continuous density acoustic models.)
Similarly, Sphinx-3 in general supports "arbitrary" HMM topologies, unlike Sphinx-II, which is restricted to a specific 5-state topology. However, for efficiency's sake, the s3.2 decoder is hardwired to deal with only two types of HMM topologies: 3-state and 5-state, described briefly in src/hmm.h.
Continuous density acoustic models are computationally expensive to deal with, since they can contain hundreds of thousands of Gaussian densities that must be evaluated in each frame. To reduce this cost, one can use an approximate model that efficiently identifies the top scoring candidate densities in each Gaussian mixture in any given frame. The remaining densities can be ignored during that frame.
In Sphinx-3, such an approximate model is built by sub-vector quantizing the
acoustic model densities [citeRavi]. The utility that performs this conversion is
included in this distribution and is called gausubvq,
which stands for Gaussian Sub-Vector Quantization).
Note that if the original model consists of mixture Gaussians that only contain a few component densities (say, 4 or fewer per mixture), a sub-vector quantized model may not be effective in reducing the computational load.
An acoustic model is represented by the following collection of files:
The mean, var, mixw, and tmat files are produced by the Sphinx-3 trainer, and their file formats should be documented there.
The main language model (LM) used by the Sphinx decoder is a conventional bigram or
trigram backoff language model. The CMU-Cambridge SLM toolkit (link) is
capable of generating such a model from LM training data. Its output is an ascii
text file. But a large text LM file can be very slow to load into memory. To
speed up this process, the LM must be compiled into a
binary form. This distribution also includes an
lmtest utility program, which reads in a binary LM file and prints
out language model scores (log-likelihoods) for sentences typed in by the user.
A trigram LM primarily consists of the following:
<s>, and
</s> repsectively.The vocabulary of the LM is the set of words covered by the unigrams.
The LM probability of an entire sentence is the product of the individual word
probabilities. For example, the LM probability of the sentence
"HOW ARE YOU" is:
P(HOW | <s>) * P(ARE | <s>, HOW) * P(YOU | HOW, ARE) * P(</s> | ARE, YOU)
In Sphinx, the LM cannot distinguish between different pronunciations of the
same word. For example, even though the lexicon might contain two different
pronunciation entries for the word READ (present and past tense forms),
the language model cannot distinguish between the two. Both pronunciations would
inherit the same probability from the language model.
Secondly, the LM is case-insensitive. For example, it cannot contain
two different tokens READ and read.
The reasons for the above restrictions are historical. Precise pronunciation and case information has rarely been present in LM training data. It would certainly be desirable to do away with the restrictions at some time in the future.
The binary LM file (also referred to as the LM dump file) is more or less a disk image of the LM data structure constructed in memory. This data structure was originally designed during the Sphinx-II days, when efficient memory usage was the focus. In Sphinx-3, however, memory usage is no longer an issue since the binary file enables the decoder to use a disk-based LM strategy. That is, the LM binary file is no longer read entirely into memory. Rather, the portions required during decoding are read in on demand, and cached. For large vocabulary recognition, the memory resident portion is typically about 10-20% of the bigrams, and 5-10% of the trigrams.
Since the decoder uses a disk-based LM, it is necessary to have efficient access to the binary LM file. Thus, network access to an LM file at a remote location is not recommended. It is desirable to have the LM file be resident on the local machine.
The binary dump file can be created from the ascii form using
the lm3g2dmp utility (link), which is part of the Sphinx distribution.
(The header of the dump file itself contains a brief description of the
file format.)
Language models typically do not cover acoustically significant events such as silence, breath-noise, UM or UH sounds made by a person hunting for the right phrase, etc. These are known generally as filler words, and are excluded from the LM vocabulary. The reason is that a language model training corpus, which is simply a lot of text, usually does not include such information.
Since the main trigram LM ignores silence and filler words, their "language model probability" has to be specified in a separate file, called the filler penalty file. The format of this file is very straightforward; each line contains one word and its probability, as in the following example:
++UH++ 0.10792 ++UM++ 0.00866 ++BREATH++ 0.00147
The filler penalty file is not required. If it is present, it does not
have to contain entries for every filler word. The decoder allows a default value
to be specified for filler word probabilities (through the -fillprob
command line argument), and a default silence word probability (through
the -silprob argument).
Like the main trigram LM, filler and silence word probabilities are obtained from appropriate training data. However, training them is considerably easier since they are merely unigram probabilities.
Filler words are invisible or transparent to the trigram language model.
For example, the LM probability of the sentence
"HAVE CAR <sil> WILL TRAVEL" is:
P(HAVE | <s>) * P(CAR | <s>, HAVE) * P(<sil>) * P(WILL | HAVE, CAR) * P(TRAVEL | CAR WILL) * P(</s> | WILL, TRAVEL)
During recognition the decoder combines both acoustic likelihoods and language model probabilities into a single score in order to compare various hypotheses. This combination of the two is not just a straightforward product. In order to obtain optimal recognition accuracy, it is usually necessary to exponentiate the language model probability using a language weight before combining the result with the acoustic likelihood. (Since likelihood computations are actually carried out in the log-domain in the Sphinx decoder, the LM weight becomes a multiplicative factor applied to LM log-probabilities.)
The language weight parameter is typically obtained through trial and error. In the case of Sphinx, the optimum value for this parameter has usually ranged between 6 and 13, depending on the task at hand.
Similarly, though with lesser impact, it has also been found useful to include a word insertion penalty parameter which is a fixed penalty for each new word hypothesized by the decoder. It is effectively another multiplicative factor in the language model probability computation (before the application of the language weight). This parameter has usually ranged between 0.2 and 0.7, depending on the task.
The Sphinx-3 decoder processes entries listed in a control file. Each line in the control file identifies a separate utterance. A line has the following format (the brackets indicate a group of fields that is optional):
CepFile [ StartFrame EndFrame UttID ]
CepFile is the speech input file containing cepstrum data. (The filename extension should be omitted from the specification.) If this is the only field in the line, the entire file is processed as one utterance. In this case, an utterance ID string is automatically derived from the cepstrum filename, by stripping any leading directory name components from it. E.g.: if the control file contains the following entries:
/net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0201 /net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0202 /net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0203
three utterances are processed, with IDs 4t0c0201, 4t0c0202,
and 4t0c0203, respectively.
If, on the other hand, a control file entry includes the StartFrame and EndFrame fields, only that portion of the cepstrum file is processed. This form of the control file is frequently used if the speech input can be arbitrarily long, such as an entire TV news show. There is one big cepstrum file, but it is processed in smaller chunks or segments. In this case, the final UttID field is the utterance ID string for the entry.
The utterance ID associated with a control file entry is used to identify all the output from the decoder for that utterance. For example, if the decoder is used to generate word lattice files, they are named using the utterance ID. Hence, each ID, whether automatically derived or explicitly specified, should be unique over the entire control file.
Any line in the control file beginning a # character is a comment line,
and is ignored.
The Sphinx-3 decoder produces a single recognition hypothesis for each utterance it processes. The hypotheses for all the utterances processed in a single run are written to a single output file, one line per utterance. The line format is as follows:
u S s T t A a L l sf wa wl wd sf wa sl wd ... nf
The S, T, A, and L fields are keywords and
appear in the output as shown. The remaining fields are briefly described below:
The lscr field is followed by groups of four fields, one group for each successive word in the output hypothesis. The four fields are:
The final field, nf, in each hypothesis line is the total number of frames in the utterance.
Note that all scores are log-likelihood values in the peculiar logbase used by the decoder. Secondly, the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0. Finally, the language model scores reported include the language weight and word-insertion penalty parameters.
Here is an example hypothesis file for three utterances.
During recognition the decoder maintains not just the single best hypothesis, but also
a number of alternatives or candidates. For example, REED is a perfectly
reasonable alternative to READ. The alternatives are useful in many
ways: for instance, in N-best list generation. To facilitate such
post-processing, the decoder can optionally produce a word lattice
output for each input utterance. This output records all the candidate words
recognized by the decoder at any point in time, and their main attributes such as
time segmentation and acoustic likelihood scores.
The term "lattice" is used somewhat loosely. The word-lattice is really a
directed acyclic graph or DAG.
Each node of the DAG denotes a word instance
that begins at a particular frame within the utterance. That is, it is a unique
<word,start-time> pair. (However, there could be a number of
end-times for this word instance. One of the features of a time-synchronous Viterbi
search using beam pruning is that word candidates hypothesized by the decoder have a
well-defined start-time, but a fuzzy range of end-times. This is because the start-time
is primarily determined by Viterbi pruning, while the possible end-times are
determined by beam pruning.)
There is a directed edge between two nodes in the DAG if the start-time of the destination node immediately follows one of the end times of the source node. That is, the two nodes can be adjacent in time. Thus, the edge determines one possible segmentation for the source node: beginning at the source's start-time and ending one frame before the destination's start-time. The edge also contains an acoustic likelihood for this particular segmentation of the source node.
Note: The beginning and end of sentence tokens, <s> and
</s>, are not decoded as part of an utterance by the s3.2 decoder.
However, they have to be included in the word lattice file, for compatibility with
the older Sphinx-3 decoder software. They are assigned 1-frame
segmentations, with log-likelihood scores of 0. To accommodate them, the
segmentations of adjacent nodes have to be "fudged" by 1 frame.
A word lattice file essentially contains the above information regarding the nodes and edges in the DAG. It is structured in several sections, as follows:
Frames section, specifying the number of frames in utteranceNodes section, list the nodes in the DAGInitial and Final nodes
(for <s> and </s>, respectively)
BestSegAscr section, a historical remnant now essentially emptyEdges section, listing the edges in the DAGThe file is formatted as follows. Note that any line in the file that begins with the "#" character in the first column is considered to be a comment.
# getcwd: <current-working-directory> # -logbase <logbase-in-effect> # -dict <main lexicon> # -fdict <filler lexicon> # ... (other arguments, written out as comment lines) # Frames <number-of-frames-in-utterance> # Nodes <number-of-nodes-in-DAG> (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME) <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> ... (for all nodes in DAG) # Initial <Initial-Node-ID> Final <Final-Node-ID> # BestSegAscr 0 (NODEID ENDFRAME ASCORE) # Edges (FROM-NODEID TO-NODEID ASCORE) <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> ... (for all edges in DAG) End
Note that the node-ID values for DAG nodes are assigned sequentially, starting from 0. Furthermore, they are sorted in descending order of their earliest-end-time attribute.
Here is an example word lattice file.
In addition to the s3.2 decoder, this distribution also provides a number of other utility programs, some of which are simply intended for debugging.
In alphabetical order:
src/agc.c |
Automatic gain control (on signal energy) |
src/ascr.c |
Senone acoustic scores |
src/beam.c |
Pruning beam widths |
src/bio.c |
Binary file I/O support |
src/cmn.c |
Cepstral mean normalization and variance normalization |
src/corpus.c |
Control file processing |
src/cont_mgau.c |
Mixture Gaussians (acoustic model) |
src/dict.c |
Pronunciation lexicon |
src/dict2pid.c |
Generation of triphones for the pronunciation dictionary |
src/feat.c |
Feature vectors computation |
src/fillpen.c |
Filler word probabilities |
src/gausubvq.c |
Standalone acoustic model sub-vector quantizer |
src/hmm.c |
HMM evaluation |
src/hyp.h |
Recognition hypotheses data type |
src/kb.h |
All knowledge bases, search structures used by decoder |
src/kbcore.c |
Collection of core knowledge bases |
src/lextree.c |
Lexical search tree |
src/lm.c |
Trigram language model |
src/lmtest.c |
Standalone LM test program |
src/logs3.c |
Support for log-likelihood operations |
src/main.c |
Main decoder driver |
src/mdef.c |
Acoustic model definition |
src/s3types.h |
Various data types, for ease of modification |
src/subvq.c |
Sub-vector quantized acoustic model |
src/tmat.c |
HMM transition matrices (topology definition) |
src/vector.c |
Vector operations, quantization, etc. |
src/vithist.c |
Backpointer table (Viterbi history) |
src/wid.c |
Mapping between LM and lexicon word IDs |