Notes
Outline
Sphinx-3 to 3.2
Mosur Ravishankar
School of Computer Science, CMU
Nov 19, 1999
Outline
Recognition problem
Search for the most likely word sequence matching the input speech, given the various models
Illustrated using Sphinx-3 (original)
Lextree search (Sphinx-3.2)
Search organization
Pruning
Experiments
Conclusion
Recognition Problem
Search organization
Continuous speech recognition
Cross-word triphone modeling
Language model integration
Pruning for efficiency
Search Organization in Sphinx-3
Flat lexical structure
Cross-word triphone modeling
Multiplexing at word beginning
Replication at word end
Single-phone words: combination of both
LM score applied upon transition into word
Trigram language model
However, only single best history maintained
Beam-based pruning
Long-tailed distribution of active HMMs/frame
Sphinx-3 Lexical Structure
Flat lexicon; every word treated independently:
Evaluating an HMM w.r.t. input speech: Viterbi search
Score the best state-sequence through HMM, given the input
Viterbi Search
Viterbi Search (contd.)
Viterbi Search (contd.)
Viterbi Search (contd.)
Viterbi Search Summary
Instantaneous score: how well a given HMM state matches the speech input at a given time frame
Path: A sequence of HMM states traversed during a given segment of input speech
Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path
Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of input speech
Single Word Recognition
Search all words in parallel
Initialize start state of every word with path-score 1.0
For each frame of input speech:
Update path scores within each HMM
Propagate exit state score from one HMM to initial state of its successor (using Viterbi criterion)
Select word with best exit state path-score
Continuous Speech Recognition
Add null transitions from word ends to beginnings:
Apply Viterbi search algorithm to the modified network
Q: How to recover the recognized word sequence?
The Backpointer Table
Each word exit recorded in the BP table:
Upon transitioning from an exited word A to another B:
Inject pointer to BP table entry for A into start state of B.  (This identifies the predecessor of B.)
Propagate these pointers along with path-scores during Viterbi search
At end of utterance, identify best exited word and trace back using predecessor pointers
The Backpointer Table (contd.)
Some additional information available from BP table:
All candidate words recognized during recognition
Word segmentations
Word segment acoustic scores
“Lattice density”: No. of competing word hypotheses at any instant
Useful for postprocessing steps:
Lattice rescoring
N-best list generation
Confidence measures
Beam Search (Pruning)
Exhaustive search over large vocabulary too expensive, and unnecessary
Use a “beam” to “prune” the set of active HMMs:
At start of each frame, find best available path-score S
Use a scale-factor f  (< 1.0) to set a pruning threshold T = S*f
Deactivate an HMM if no state in it has path score >= T
Effect: No. of active HMMs larger if no clear frontrunner
Two kinds of beams:
To control active set of HMMs
No. of active HMMs per frame typically 10-20% of total space
To control word exits taken (and recorded in BP table)
No. of words exited typically 10-20 per frame
Recognition accuracy essentially unaffected
Incorporating a Language Model
Language models essential for recognition accuracy
Reduce word error rate by an order of magnitude
Reduce active search space significantly
Implementation: associate LM probabilities with transitions between words.  E.g.:
Bigram Backoff Language Model
Two issues with large vocabulary bigram LMs:
With vocabulary size V and N word exits per frame, NxV cross-word transitions per frame
Bigram probabilities very sparse; mostly “backoff” to unigrams
Optimize cross-word transitions using “backoff node”:
Viterbi decision at backoff node selects single-best predecessor
Cross-Word Triphone Modeling
Sphinx uses “triphone” or “phoneme-in-context” HMMs
Cross-word transitions use appropriate exit-model, and inject left-context into entry state
Sphinx-3 Search Algorithm
initialize start state of <S> with path-score = 1;
for each frame of input speech {
        evaluate all active HMMs; find best path-score, pruning thresholds;
        for each active HMM {
                if above pruning threshold {
                        activate HMM for next frame;
                        transition to and activate successor HMM within word, if any
                        if word-final HMM and above word-pruning threshold
                                record word-exit in BP table;
                }
        }
        transition from words exited into initial state of entire lexicon (using the
                LM), and activate HMMs entered;
}
find final </S> BP table entry and back-trace through table to retrieve result;
Lextree Search: Motivation
Most active HMMs are word-initial models, decaying rapidly subsequently
On 60K-word Hub-4 task, 55% of active HMMs are word-initial
(Same reason for handling left/right contexts differently.)
But, no. of distinct word-initial model types much fewer:
Use a “prefix-tree” structure to maximize sharing among words
Lextree Structure in Sphinx-3.2
Nodes shared if triphone “State-Sequence ID” (SSID) identical
Leaf (word-final) nodes not shared
In 60K-word BN task, word-initial models reduced ~50x
Cross-Word Triphones (left context)
Root nodes replicated for left context
Again, nodes shared if SSIDs identical
During search, very few distinct incoming left-contexts at any time; so only very few copies activated
Cross-Word Triphones (right context)
Leaf nodes use “composite” SSID models
Simplifies lextree and backpointer table implementation
Simplifies cross-word transitions implementation
Lextree Search: LM Integration
Problem: LM probabilities cannot be determined upon transition to lextree root nodes
Root nodes shared among several unrelated words
Several solutions possible:
Incremental evaluation, using composite LM scores
Lextree replication (Ney, Antoniol)
Rescoring at every node (BBN)
Post-tree evaluation (Sphinx-II)
LM Integration: Lextree Replication
Incremental LM score accumulation; e.g. (bigram LM):
Large computation and memory requirements
Overhead for dynamic lextree creation/destruction
Lextree Copies With Explicit Backoff
Again, incremental LM score accumulation:
Still, large computation/memory requirements and overhead for dynamic lextree maintenance
Multiple LM transitions between some word pairs
Post-Lextree LM Evaluation (Sphinx-II)
Single lextree
Null transitions from leaf nodes back to root nodes
No LM score upon transition into root or non-leaf node of lextree
If reached a leaf node for word W:
Find all possible LM histories of W (from BP table)
Find LM scores for W w.r.t. each LM history
Choose best resulting path-score for W
Drawbacks:
Inexact acoustic scores
Root node evaluated w.r.t. a single left context, but resulting score used w.r.t. all histories (with possibly different left contexts)
Impoverished word segmentations
Word Segmentation Problem
Q: Which transition wins?
Flat lexicon (separate model per word):
At A: word ninety entered with LM score P (ninety | ninety)
At B: word ninety entered with P (ninety | nineteen)
Since the latter is much better, it prevails over the former
Result: correct recognition, and segmentation for ninety
Word Segmentation Problem (contd.)
Tree lexicon:
At A: root node for ninety entered without any LM score
At B: Attempt to enter root node for ninety again
Transition may or may not succeed (no LM score used)
At C: obtain LM score for ninety w.r.t. all predecessors
If transition at B failed, the only candidate predecessor is ninety; result: incorrect segmentation for ninety(2), incorrect recognition
Lextree-LM Integration in Sphinx-3.2
Post-lextree LM scoring (as above); however:
Limited, static lextree replication
Limits memory requirements
No dynamic lextree management overhead
Transitions into lextrees staggered across time:
At any time, only one lextree entered
“-epl” (entries per lextree) parameter: block of frames one lextree entered, before switching to next
More word segmentations (start times) survive
Full LM histories; if reached a leaf node for word W:
Find all possible LM histories of W (from BP table)
Include LM scores for W w.r.t. each LM history
Create a separate BP table entry for each resulting history
Pruning in Sphinx-3.2
Pure beam-pruning has long-tailed distribution of active HMMs/frame
Absolute pruning to control worst-case performance:
Max. active HMMs per frame
Implemented approximately, using “histogram pruning” (avoids expensive sorting step)
Max. unique words exiting per frame
Max. LM histories saved in BP table per frame
Word error rate unaffected
Additional beam for lextree-internal, cross-HMM transitions
Unigram “lookahead” scores used in lextree for yet more pruning
Sphinx-3.2 Performance
1997 BN eval set, excluding F2 (telephone speech)
6K tied-state CHMM, 20 density/state model (1997)
60K vocab, trigram LM
Sphinx-3.2 Performance (contd.)
1998 BN Eval set
5K tied-state CHMM, 32 density/state model (1998)
60K vocab, trigram LM
Sphinx-3.2 Performance (contd.)
Effect of absolute pruning parameters (1998 BN):
(Per frame) computation stats for each utterance
Distribution of stats over entire test set (375 utts)
Absolute pruning highly effective in controlling variance in computational cost
Conclusions
10x CHMM system available (thanks to P-!!!)
With lextree implementation, only about 1-2% of total HMMs active per frame
Order of magnitude fewer compared to flat lexicon search
Lextree replication improves WER noticeably
Absolute pruning parameters improve worst-case behavior significantly, without penalizing accuracy
When active search space grows beyond some threshold, no hope of correct recognition anyway
What Next?
Tree lexicon still not as accurate as flat lexicon baseline
Residual word segmentation problems?
Try lextree replication?
Use of composite SSID model at leaf nodes?
Parameters not close to optimal?
HMM state acoustic score computation now dominant
Back to efficient Gaussian selection/computation