Carnegie Mellon University

School of Computer Science
Department of Electrical Engineering & Computer Science

Learning to use the CMU SPHINX-III Automatic Speech Recognition system

Introduction
Setting up your system and getting familiar with it through a preliminary training run
- How to perform a preliminary training run
- How to perform a preliminary decode
Miscellaneous tools
What is expected in this lab
How to train, and key training issues
How to decode, and key decoding issues
APPENDIX 1
APPENDIX 2
APPENDIX 3
APPENDIX 4

Introduction

In this tutorial, you will learn to handle a complete state-of-the-art HMM-based speech recognition system. The system you will use is the SPHINX-III system, designed at Carnegie Mellon University. SPHINX is one of the best and most versatile recognition systems in the world today. At the end of this lab, you will be in a position to train and use this system for your own recognition tasks. More importantly, through your exposure to this system, you will have learned about several important issues involved in implementing a real HMM-based ASR system, and some of the latest engineering solutions to the problems related to these issues.

Unless you are using the Sphinx group's machines (a.k.a. the cartoon network), you may want to check the Open Source Tutorial instead.

What will be given to you

An HMM-based system, like all other speech recognition systems, functions by first learning the characteristics (or parameters) of a set of sound units, and then using what it has learned about the units to find the most probable sequence of sound units for a given speech signal. The process of learning about the sound units is called training. The process of using the knowledge acquired to deduce the most probable sequence of units in a given signal is called decoding, or simply recognition.

Accordingly, you will be given those components of the SPHINX system that you can use for training and for recognition. In other words, you will be given the SPHINX trainer and the SPHINX decoder.

Components provided for training

The SPHINX trainer that will be provided to you consists of a set of programs which have been compiled for two operating systems: LINUX and ALPHA . You will be provided the set of all executables which constitute the trainer. The source code will be provided too, in case you are curious about the software aspects of SPHINX or want to implement any small changes to the code based on your own ideas. Working with the source code is, however, not expected in this lab.

The trainer learns the parameters of the models of the sound units using a set of sample speech signals. This is called a "training database". A training database comprised of 1600 speech signals will also be provided to you. The trainer also needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database. This information is provided to the trainer through a file called the "transcript file", in which the sequence of words and non-speech sounds are written exactly as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal. The trainer then looks into a "dictionary" which maps every word to a sequence of sound units, to derive the sequence of sound units associated with each signal. Thus, in addition to the speech signals, you will also be given a set of transcripts for the database (in a single file) and two dictionaries, one in which legitimate words in the language are mapped sequences of sound units (or sub-word units), and another in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units. We will refer to the former as the "language dictionary" and the latter as the "filler dictionary".

In summary, the components provided to you for training will be:

The trainer executables
The acoustic signals
The corresponding transcript file
A language dictionary
A filler dictionary

Components provided for decoding

The decoder also consists of a set of programs, which have been compiled to give a single executable that will perform the recognition task, given the right inputs. The inputs that need to be given are: the trained acoustic models, a model index file, a language model, a language dictionary, a filler dictionary, and the set of acoustic signals that need to be recognized. The data to be recognized are commonly referred to as "test data".

In summary, the components provided to you for decoding will be:

The decoder executable
The language dictionary
The filler dictionary
The language model
The test data

In addition to these components, you will need the acoustic models that you have trained for recognition. You will have to provide these to the decoder. While you train the acoustic models, the trainer will generate appropriately named model-index files. A model-index file simply contains numerical identifiers for each state of each HMM, which are used by the trainer and the decoder to access the correct sets of parameters for those HMM states. With any given set of acoustic models, the corresponding model-index file must be used for decoding. If you would like to know more about the structure of the model-index file, you will find a description at the following URL: http://www.speech.cs.cmu.edu/sphinxman/fr4.html under the link Creating the CI model definition file.

Setting up your system and getting familiar with it through a preliminary training run

First make sure that you have accounts on all the speech queue machines. Currently, these are the "cartoon character" machines (astro, scooby, etc), and the queue we use is TORQUE. Check the list of the so called Cartoon Network machines. Additionally, if you are using the TORQUE queue in the "cartoon network" at CMU, you will find more information on how to use it, including some troubleshooting tips, in our internal user guide.

Log into one of those machines and create a directory from which you will do all this lab work. This directory should preferably be a local directory (physically located in one of the /usrs partitions), so that you do not need to worry about Kerberos authentication to AFS. Additionally, make sure that /usr/bin is in your path, and include the following lines in the .login file in your home directory. If you are using the default SCS .login file, add these just after the line that says "Add environment variables immediately below", rather than at the end of the file.

if ($?PBS_ENVIRONMENT) then
set _no_kinit = 'If set, dont kinit if unauthenticated'
cd $PBS_O_WORKDIR
endif

Save the tarball (last updated on Tue Oct 23 18:57:45 EDT 2007) to your working directory, which will be designated as your base directory, and then unzip it and untar it using:

gunzip -c tutorial.tar.gz | tar xf -

If you are familiar with CVS, you can also retrieve the code by doing:

cvs -d:ext:scrappy.speech.cs.cmu.edu:/usr0/robust/cvsroot co tutorial

Alternatively, if you intend to use a single machine, you can download the single machine tarball to your working directory, and unzip and untar it using:

gunzip -c tutorial_single_machine.tar.gz | tar xf -

In any way, this will create a directory named TUTORIAL/ in the base directory and it will install all the files necessary for you to train and test SPHINX for this lab. Be aware that each version has its own idiosyncrasies. In the single-machine version, the scripts make use of launching jobs in the background, therefore make sure that the jobs are finished before you go to the next step (check if any job is still running by doing ps -auxg | grep username and check if the log files are not being modified anymore). For the batch machines, you can use both LSF and TORQUE queues. You'll have to modify the file TUTORIAL/SPHINX3/c_scripts/variables.def to make sure the job submission and deletion commands are correct. The default works with the TORQUE queue, since this is the one currently installed in our queue machines. Also, you may want to add the queue's binaries to your path. For the TORQUE queue, add /usr/bin, if it is not already there, to your path in the .login file located in your home directory.

Within the TUTORIAL/ directory, you will find a directory called SPHINX3/ . This directory contains the following subdirectories:

s3trainer/ : The SPHINX trainer source code and executables.
lists/: The set of all user-supplied training files. You will be training the SPHINX system using a database called the Resource Management (RM) database. You can find more information on the Resource Management database at LDC. In the lists/ directory you will find a list of training utterances (train.wavlist) from this database. If you wish to listen to any of the files listed in train.wavlist, give the command waveform_play [waveform filename]. In addition to train.wavlist you will also find a transcript file (RM.1600.trans) containing transcriptions corresponding to the files listed in train.wavlist , a language dictionary (RM.dictionary), a filler dictionary (filler.dict), and a phone list (RM.phonelist) containing all the sub-word units used in the dictionary as well as an additional symbol SIL for silence.
c_scripts/ : The set of scripts that you will use to train continuous density HMMs for the RM database. This directory also contains a script that you will use for computing features (MFCCs) for your training and test utterances. Remember that the recognition system will be trained on these features and not on the waveforms directly.
decoding/ : This contains a list of test utterances (test.wavlist), scripts for performing recognition using the models you have trained, and a language model (RM.bigram.Z.DMP). The language model is a binary file that has been encoded for optimal use by the decoder. You are not expected to modify this file. In this tutorial, the dictionary that you will use for decoding is the same as the one you will use for training. Note that this is not usually the case. For tasks where the vocabulary of the test data is expected to be different from that of the training data, the dictionary that is used for decoding is different from that used for training.

How to perform a preliminary training run

The scripts should work "from the box", since the base directory will be automatically detected. If for any reason this doesn't work, you must enter the directory SPHINX3/c_scripts/ and edit the file called variables.def. All you have to do is to enter the full path to your SPHINX3 directory (including the SPHINX3 directory name) in the dotted space (delete the dots, of course!).

The system does not directly work with acoustic signals. The signals are first transformed into a sequence of feature vectors, which are used in place of the actual acoustic signals. To perform this transformation (or parameterization) from within the directory SPHINX3/c_scripts, type the following command on the commandline

./compute_mfcc.csh ../lists/train.wavlist

This script will compute, for each training utterance, a sequence of 13-dimensional vectors (feature vectors) consisting of the Mel-frequency cepstral coefficients (MFCCs). Note that the list of wave files contains a list with the full paths to the audio files. These paths are site-specific. Currently, they point to the location /net/kermit/usr6/data/resource-management, accessible from any of the Sphinx group machines at CMU. You may have to change this, as well as the test.wavlist file, if the location of data is different. This step takes approximately 10 minutes to complete on a fast network, but may take up to 40 minutes if the network is slow. As it is running, you might want to continuing reading. The MFCCs will be placed automatically in a directory called SPHINX3/feature_files/. Note that the type of features vectors you compute from the speech signals for training and recognition, outside of this tutorial, is not restricted to MFCCs. You could use any reasonable parameterization technique instead, and compute features other than MFCCs. The SPHINX can use features of any type or dimensionality. In this tutorial, however, you will use MFCCs for two reasons: a) they are currently known to result in the best recognition performance in HMM-based systems under most acoustic conditions, and b) this tutorial is not intended to cover the signal processing aspects of speech parameterization and only aims for a standard usable platform in this respect. Now you can begin to train the system.

In your current directory (SPHINX3/c_scripts), there are five directories numbered sequentially from 01* through 05*. Enter each directory starting from 01 and run the script called slave*.csh within that directory. If the slave*.csh script requires a y/n response from you, enter y for yes. These scripts will launch jobs on the CMU Speech network, and the jobs will take a few minutes each to run through. Before you run any script, note the directory contents of SPHINX3. After you run each slave*.csh note the contents again. Several new directories will have been created. These directories contain files which are being generated in the course of your training. At this point you need not know about the contents of these directories, though some of the directory names may be self explanatory and you may explore them if you are curious. After you run each slave*.csh script, keep a tab on the jobs by typing "qstat" from the command line. Wait till the response to this command is "No Unfinished Job Found" or till the command prompt returns (in case it has disappeared). Only then enter the next directory in the specified sequence and launch the slave*.csh in that directory. Repeat this process until you have run the slave*.csh in all five directories.

Note that in the process of going through the scripts in 01* through 05*, you will have generated several sets of acoustic models, each of which could be used for recognition. Once the jobs launched from 01* have run to completion, you will have trained the Context-Independent (CI) models for the sub-word units in'your dictionary. When the jobs launched from the 02* directory run to completion, you' will have trained the models for Context-Dependent sub-word units (triphones) with untied states. These are called CD-untied models and are necessary for buiding decision trees in order to tie states. The jobs in 03* will build decision trees for each state of each sub-word unit. The jobs in 04* will prune the decision trees and tie the states. Following this, the jobs in 05* will train the final models for the triphones in your training corpus. These are called CD-tied models. The CD-tied models are trained in many stages. We begin with 1 Gaussian per state HMMs, following which we train 2 Gaussian per state HMMs and so on till the desired number of Gaussians per State have been trained. The jobs in 05* will automatically train all these intermediate CD-tied models. At the end of any stage you may use the models for recognition. Remember that you may decode even while the training is in progress, provided you are certain that you have crossed the stage which generates the models you want to decode with. Before you decode, however, read the section called How to decode, and key decoding issues to learn a little more about decoding. This section also provides all the commands needed for decoding with each of these models.

You have now completed your training. You will find the parameters of the final 8 Gaussian/state 3-state CD-tied acoustic models (HMMs) with 1000 tied states in a directory called SPHINX3/model_parameters/SPHINX3.cd_continuous_8gau/ . You will also find a model-index file for these models called SPHINX3.1000.mdef in SPHINX3/model_architecture/ . This file, as mentioned before, is used by the system to associate the appropriate set of HMM parameters with the HMM for each sound unit you are modeling. The training process will be explained in greater detail later in this document.

How to perform a preliminary decode

Decoding is relatively simple to perform. First, compute MFCC features for all of the test utterances in the test set. To do this, enter the directory SPHINX3/c_scripts and, from the command line, type

./compute_mfcc.csh ../lists/test.wavlist (This will take approximately 10 minutes to run)

You are now ready to decode. First type

cd ../decoding

Then type the command

./launch_decode.cd.8gaumodels

This uses all of the components provided to you for decoding, including the acoustic models and model-index file that you have generated in your preliminary training run, to perform recognition on your test data. When the recognition job is complete, you can compute the recognition Word Error Rate (WER%) by running the following command:

./compute_acc.cd.8.csh

This will generate a detailed analysis file for your recognition hypotheses called result/cd.8.match.align . It will also print out the recognition accuracy and word error rate on the screen which will look like this:

WORD ACCURACY=  92.713% ( 5267/ 5681)  ERRORS=  9.470% (  538/ 5681)

The second percentage number (9.470%) is the WER% and has been obtained using the 8 Gaussians per state HMMs that you have just trained in the preliminary training run. Other numbers in the above output will be explained later in this document. If your WER% is not within +/- 1% of the number you see above, there may be some error in your training. If this happens, report the problem.

Miscellaneous tools

Three tools are provided to you in the form of three executables. You will find these executables in the directory SPHINX3/s3trainer/bin.linux/. The tools are described below. Instructions on how to use these tools are in toolname_instructions files which you will find alongside the executables.

Phone and triphone frequency analysis tool: This is the executable mk_mdef_gen . You can use this to count the relative frequencies of occurrence of your basic sound units (phones and triphones) in the training database. Since HMMs are statistical models, what you are aiming for is to design your basic units such that they occur frequently enough for their models to be well estimated, while maintaining enough information to minimize confusions between words. This issue is explained in greater detail in Appendix 1.
Tool for viewing the model parameters being estimated: This is the executable printp .
Tool for viewing the MFCC files: This is the executable cepview.

What is expected in this lab

You are expected to train the SPHINX system using all the components provided for training. The trainer will generate a set of acoustic models. You are expected to use these acoustic models and the rest of the decoder components to recognize what has been said in the test data set. You are expected to compare your recognition output to the "correct" sequence of words that have been spoken in the test data set (these will also be given to you), and find out the percentage of errors you made (the word error rate or WER%).

In the course of training the system, you are expected to use what you have learned about HMM-based ASR systems to manipulate the training process or the training parameters in order to achieve the lowest WER% on the test data. You may also adjust the decoder parameters for this and study the recognition outputs to re-decode with adjusted parameters, if you wish. At the end of this lab, you are expected to report what you did, and the reasons why you chose to do what you did. The only question you will answer at the end of this lab is:

Q. What is your word error rate, what did you do to achieve it, and why?

A satisfactory answer to this question would comprise of any well thought out and justified manipulation of any training file(s) or parameter(s). Remember that speech recognition is a complex engineering problem and that you are not expected to be able to manipulate all aspects of the system in this single lab session. The answer is expected to be written up in one or two pages.

How to train, and key training issues

You are now ready to begin your lab exercise. For every training and decoding run, you will need to first give it a name. We will refer to the experiment name of your choice by [experimentname]. For example, you can name the first experiment exp1, and the second experiment exp2, etc. Your choice of [experimentname] will be appended automatically to all the files for that training and recognition run for easy identification. All directories and files needed for this experiment will reside in a directory named TUTORIAL/[experimentname]. To begin a new experiment enter the directory TUTORIAL/ and give the command

SPHINX3/c_scripts/create_newexpt.csh [experimentname]

This will create a setup similar to the preliminary training and test setup in a directory called experimentname/ in your base directory. After this you must work entirely within this [experimentname] directory.

Now that you have been through training by manually launching each step, and hopefully have an idea of what happens in each step, you can edit the file c_scripts/variables.def and change the variable named "run_all" to 1. It has a default value of 0. Any value different from 0 will cause the script to launch the next step automatically.

Your tutorial exercise begins with training the system using the MFCC feature files that you have already computed during your preliminary run. However, when you train this time, you will be required to take certain decisions based on what you have learned so far in your coursework and the information that is provided to you in this document. The decisions that you take will affect the quality of the models that you train, and thereby the recognition performance of the system.

You must now go through the following steps in sequence:

Parameterize the training database. You have already done this for every training utterance during your preliminary run. At this point you don't have to do anything further except to note that in the speech recognition field it is common practice to call each file in a database an "utterance". The signal in an "utterance" may not necessarily be a full sentence. You can view the cepstra in any file by using the tool cepview and following the instructions in its corresponding instruction file.
Decide what sound units you are going to ask the system to train. To do this, look at the language dictionary [experimentname]/lists/RM.dictionary and the filler dictionary [experimentname]/lists/filler.dict, and note the sound units in these. A list of all sound units in these dictionaries is also written in the file RM.phonelist . Study the dictionaries and decide if the sound units are adequate for recognition. In order to be able to perform good recognition, sound units must not be confusable, and must be consistently used in the dictionary. Look at Appendix 1 for an explanation.
Also check whether these units, and the triphones they can form (for which you will be building models ultimately), are well represented in the training data. It is important that the sound units being modeled be well represented in the training data in order to estimate the statistical parameters of their HMMs reliably. To study their occurrence frequencies in the data, you may use the tool mk_mdef_gen . Instructions for the use of this tool are given in its corresponding instruction file. Based on your study, see if you can come up with a better set of sound units to train.
You can restructure the set of sound units given in the dictionaries by merging or splitting existing sound units in them. By merging of sounds units we mean the clustering of two different sound units into a single entity. For example, you may want to model the sounds "Z" and "S" as a single unit (instead of maintaining them as separate units). To merge these units, which are represented by the symbols Z and S in the language dictionary given, simply replace all instances of Z and S in the dictionary by a common symbol (which could be Z_S, or an entirely new symbol). By splitting of sound units we mean the introduction of multiple new sound units in place of a single sound unit. This is the inverse process of merging. For example, if you found a language dictionary where all instances of the sounds Z and S were represented by the same symbol, you might want to replace this symbol by Z for some words and S for others. Sound units can also be restructured by grouping specific sequences of sound into a single sound. For example, you could change all instances of the sequence "IX D" into a single sound IX_D. This would introduce a new symbol in the dictionary while maintaining all previously existing ones. The number of sound units is effectively increased by one in this case. There are other techniques used for redefining sound units for a given task. If you can think of any other way of redefining dictionaries or sound units that you can properly justify, we encourage you to try it.
Once you re-design your units, alter the file RM.phonelist accordingly. Make sure you do not have spurious empty spaces or lines in this file.
Alternatively, you may bypass this design procedure and use the phonelist and dictionaries as they have been provided to you. You will have occasion to change other things in the training later.
Once you have fixed your dictionaries and the phonelist file, edit the file variables.def in [experimentname]/c_scripts/ to enter the following training parameters:
- set dictionary = your training dictionary with full path (do not change if you have decided not to change the dictionary)
- set fillerdict = your filler dictionary with full path (do not change if you have decided not to change the dictionary)
- set phonefile = your phonelist with full path (do not change if you have decided not to change the dictionary)
- set statesperhmm = set this to either 3 or 5 for this tutorial. The number of states in an HMMs is related to the time-varying characteristics of the sound units. Sound units which are highly time-varying need more states to represent them. The time-varying nature of the sounds is also partly captured by the "skipstate" variable that is described below.
- set skipstate = set this to "no" or "yes" without the double quotes. This variable controls the topology of your HMMs. When set to "yes", it allows the HMMs to skip states. However, note that the HMM topology used in this system is a strict left-to-right Bakis topology. If you set this variable to "no", any given state can only transition to the next state. In all cases, self transitions are allowed. See the figures in >Appendix 2 for further reference. You will find the HMM toplogy file in the directory called model_architecture/ in your current base directory (experimentname).
- set gaussiansperstate = set this to any number from 4 to 8. Going beyond 8 is not advised because of the small training data set you have been provided with. The distribution of each state of each HMM is modeled by a mixture of Gaussians. This variable determines the number of Gaussians in this mixture. The number of HMM parameters to be estimated increases as the number of Gaussians in the mixture increases. Therefore, increasing the value of this variable may result in less data being available to estimate the parameters of every Gaussian. However, increasing its value also results in finer models, which can lead to better recognition. Therefore, it is necessary at this point to think judiciously about the value of this variable, keeping both these issues in mind. Remember that it is possible to overcome data insufficiency problems by sharing the Gaussian mixtures amongst many HMM states. When multiple HMM states share the same Gaussian mixture, they are said to be shared or tied. These shared states are called tied states (also referred to as senones). The number of mixtures you train will ultimately be exactly equal to the number of tied states you specify, which in turn can be controlled by the "n_tied_states" parameter described below.
- set n_tied_states = set this number to any value between 500 and 2500. This variable allows you to specify the total number of shared state distributions in your final set of trained HMMs (your acoustic models). States are shared to overcome problems of data insufficiency for any state of any HMM. The sharing is done in such a way as to preserve the "individuality" of each HMM, in that only the states with the most similar distributions are "tied". The n_tied_states parameter controls the degree of tying. If it is small, a larger number of possibly dissimilar states may be tied, causing reduction in recognition performance. On the other hand, if this parameter is too large, there may be insufficient data to learn the parameters of the Gaussian mixtures for all tied states. (An explanation of state tying is provided in >Appendix 3). If you are curious, you can see which states the system has tied for you by looking at the ascii file SPHINX3/model_architecture/[experimentname].n_tied_states.mdef . and comparing it with the file SPHINX3/model_architecture/[experimentname].untied.mdef. These files list the phones and triphones for which you are training models, and assign numerical identifiers to each state of their HMMs.
- set convergence_ratio = set this to a number between 0.1 to 0.001. This number is the ratio of the difference in likelihood between the current and the previous iteration of Baum-Welch to the total likelihood in the previous iteration. Note here that the rate of convergence is dependent on several factors such as initialization, the total number of parameters being estimated, the total amount of training data, and the inherent variability in the characteristics of the training data. The more iterations of Baum-Welch you run, the better you will learn the distributions of your data. However, the minor changes that are obtained at higher iterations of the Baum-Welch algorithm may not affect the performance of the system. Keeping this in mind, decide on how many iterations you want your Baum-Welch training to run in each stage. This is a subjective decision which has to be made based on the first convergence ratio which you will find written at the end of the logfile for the second iteration of your Baum-Welch training (SPHINX3/logdir/0*/[experimentname].*.2.norm.log. Usually, 5-15 iterations are enough, depending on the amount of data you have. Do not train beyond 15 iterations.
- set maxiter = set this to an integer number between 5 to 15. This limits the number of iterations of Baum-Welch to the value of maxiter.
Once you have made all the changes desired, you must train a new set of models. You can accomplish this by re-running all the slave*.csh scripts from the directories [experimentname]/c_scripts/01* through [experimentname]/c_scripts/05*.

How to decode, and key decoding issues

The first step in decoding is to compute the mfcc features for your test utterances. Since you have already done this in the preliminary run, you do not have to repeat the process here.
To decode, run one of the following commands from the directory [experimentname]/decoding:
- launch_decode.ci.1gaumodels
  This will use the CI models to decode your test set
- launch_decode.cd.untiedmodels
  This will use the CD-untied models to decode your test set
- launch_decode.cd.1gaumodels
  This will use the CD 1 Gaussian/state models to decode your test set
- launch_decode.cd.2gaumodels
  This will use the CD 2 Gaussian/state models to decode your test set
- launch_decode.cd.4gaumodels
  This will use the CD 4 Gaussian/state models to decode your test set
- launch_decode.cd.8gaumodels
  This will use the CD 8 Gaussian/state models to decode your test set
Before you run the script(s), you may want to adjust the language weight. You can modify the variable langwt in the file named [experimentname]/c_scripts/variables.def. A value between 6 and 13 is recommended, and by default it is 9.5. The language model and the language weight are described in Appendix 4. Remember that the language weight decides how much relative importance you will give to the actual acoustic probabilities of the words in the hypothesis. A low language weight gives more leeway for words with high acoustic probabilities to be hypothesized, at the risk of hypothesizing spurious words.

You may decode several times with different language weights, without re-training the acoustic models, to decide what is best for you.

To find the word error rate (scoring the hypotheses), run one of the following commands:

compute_acc.ci.csh
This will score the CI models
compute_acc.cduntied.1.csh
This will score the CD-untied models
compute_acc.cd.1.csh
This will score the CD 1 Gaussian/state models
compute_acc.cd.2.csh
This will score the CD 2 Gaussian/state models
compute_acc.cd.4.csh
This will score the CD 4 Gaussian/state models
compute_acc.cd.8.csh
This will score the CD 8 Gaussian/state models

It will give a screen output of the form:

WORD ACCURACY= 91.274% (29161/31949) ERRORS= 13.600% ( 4345/31949)

In this line the first percentage indicates the percentage of words in the test set that were correctly recognized. However, this is not a sufficient metric - it is possible to correctly hypothesize all the words in the test utterances merely by hypothesizing a large number of words for each word in the test set. The spurious words, called insertions, must also be penalized when measuring the performance of the system. The second percentage indicates the number of hypothesized words that were erroneous as a percentage of the actual number of words in the test set. This includes both words that were wrongly hypothesized (or deleted) and words that were spuriously inserted. Since the recognizer can, in principle, hypothesize many more spurious words than there are words in the test set, the percentage of errors can actually be greater than 100.

In the example above, of the 31949 words in the reference test transcripts 29161 words (91.27%) were correctly hypothesized. In the process the recognizer hypothesized 4345 spurious words (these include insertions, deletions and substitutions). You will find your recognition hypotheses in a file called *.match in the directory [experimentname]/decoding/result/.

compute_acc.*.csh script will also generate a file called [experimentname]/decoding/result/*.match.align in which your hypotheses are aligned against the reference sentences. You can study this file to examine the errors that were made. The list of confusions at the end of this file allows you to subjectively determine why particular errors were made by the recognizer. For example, if the word "FOR" has been hypothesized as the word "FOUR" almost all the time, perhaps you need to correct the pronunciation for the word FOR in your decoding dictionary and include a pronunciation that maps the word FOR to the units used in the mapping of the word FOUR. Once you make these corrections, you must re-decode.

APPENDIX 1

If your transcript file has the following entries:

THIS CAR THAT CAT (file1)
CAT THAT RAT (file2)
THESE STARS (file3)

and your language dictionary has the following entries for these words:

CAT K AE T

CAR K AA R

RAT R AE T

STARS S T AA R S

THIS DH I S

THAT DH AE T

THESE DH IY Z

then the occurrence frequencies for each of the phones are as follows (in a real scenario where you are training triphone models, you will have to count the triphones too):

K 3 S 3

AE 5 IY 1

T 6 I 1

AA 2 DH 4

R 3 Z 1

Since there are only single instances of the sound units IY and I, and they represent very similar sounds, we can merge them into a single unit that we will represent by I_IY. We can also think of merging the sound units S and Z which represent very similar sounds, since there is only one instance of the unit Z. However, if we merge I and IY, and we also merge S and Z, the words THESE and THIS will not be distinguishable. They will have the same pronunciation as you can see in the following dictionary with merged units:

CAT K AE T

CAR K AA R

RAT R AE T

STARS S_Z T AA R S_Z

THIS DH I_IY S_Z

THAT DH AE T

THESE DH I_IY S_Z

If it is important in your task to be able to distinguish between THIS and THESE, at least one of these two merges should not be performed.

APPENDIX 2

APPENDIX 3

Consider the following sentence.

CAT THESE RAT THAT

Using the first dictionary given in Appendix 1, this sentence can be expanded to the following sequence of sound units:

Silences (denoted as <sil> have been appended to the beginning and the end of the sequence to indicate that the sentence is preceded and followed by silence. This sequence of sound units has the following sequence of triphones

K(sil,AE) AE(K,T) T(AE,DH) DH(T,IY) IY(DH,Z) Z(IY,R) R(Z,AE) AE(R,T) T(AE,DH) DH(T,AE) AE(DH,T) T(AE,sil)

where A(B,C) represents an instance of the sound A when the preceding sound is B and the following sound is C. If each of these triphones were to be modeled by a separate HMM, the system would need 33 unique states, which we number as follows:

K(sil,AE) 0 1 2

AE(K,T) 3 4 5

T(AE,DH) 6 7 8

DH(T,IY) 9 10 11

IY(DH,Z) 12 13 14

Z(IY,R) 15 16 17

R(Z,AE) 18 19 20

AE(R,T) 21 22 23

DH(T,AE) 24 25 26

AE(DH,T) 27 28 29

T(AE,sil) 30 31 32

Here the numbers following any triphone represent the global indices of the HMM states for that triphone. We note here that except for the triphone T(AE,DH), all other triphones occur only once in the utterance. Thus, if we were to model all triphones independently, all 33 HMM states must be trained. We note here that when DH is preceded by the phone T, the realization of the initial portion of DH would be very similar, irrespective of the phone following DH. Thus, the initial state of the triphones DH(T,IY) and DH(T,AE) can be tied. Using similar logic, the final states of AE(DH,T) and AE(R,T) can be tied. Other such pairs also occur in this example. Tying states using this logic would change the above table to:

K(sil,AE) 0 1 2

AE(K,T) 3 4 5

T(AE,DH) 6 7 8

DH(T,IY) 9 10 11

IY(DH,Z) 12 13 14

Z(IY,R) 15 16 17

R(Z,AE) 18 19 20

AE(R,T) 21 22 5

DH(T,AE) 9 23 24

AE(DH,T) 25 26 5

T(AE,sil) 6 27 28

This reduces the total number of HMM states for which distributions must be learned, to 29. But further reductions can be achieved. We might note that the initial portion of realizations of the phone AE when the preceding phone is R is somewhat similar to the initial portions of the same phone when the preceding phone is DH (due to, say, spectral considerations). We could therefore tie the first states of the triphones AE(DH,T) and AE(R,T). Using similar logic other states may be tied to change the above table to:

K(sil,AE) 0 1 2

AE(K,T) 3 4 5

T(AE,DH) 6 7 8

DH(T,IY) 9 10 11

IY(DH,Z) 12 13 14

Z(IY,R) 15 16 17

R(Z,AE) 18 19 20

AE(R,T) 21 22 5

DH(T,AE) 9 23 11

AE(DH,T) 21 24 5

T(AE,sil) 6 25 26

We now have only 27 HMM states, instead of the 33 we began with. In larger data sets with many more triphones, the reduction in the total number of triphones can be very dramatic. The state tying can reduce the total number of HMM states by one or two orders of magnitude.

In the examples above, state-tying has been performed based purely on acoustic-phonetic criteria. However, in a typical HMM-based recognition system such as SPHINX, state tying is performed not based on acoustic-phonetic rules, but on other data driven and statistical criteria. These methods are known to result in much better recognition performance.

APPENDIX 4

Language Model: Speech recognition systems treat the recognition process as one of maximum a-posteriori estimation, where the most likely sequence of words is estimated, given the sequence of feature vectors for the speech signal. Mathematically, this can be represented as

Word1 Word2 Word3 ... =
argmax_{Wd1 Wd2 ...}{P(feature vectors|Wd1 Wd2 ...) P(Wd1 Wd2 ...)} (1)

where Word1.Word2... is the recognized sequence of words and Wd1.Wd2... is any sequence of words. The argument on the right hand side of Equation 1 has two components: the probability of the feature vectors, given a sequence of words P(feature vectors| Wd1 Wd2 ...), and the probability of the sequence of words itself, P(Wd1 Wd2 ...) . The first component is provided by the HMMs. The second component, also called the language component, is provided by a language model.

The most commonly used language models are N-gram language models. These models assume that the probability of any word in a sequence of words depends only on the previous N words in the sequence. Thus, a 2-gram or bigram language model would compute P(Wd1 Wd2 ...) as

P(Wd1 Wd2 Wd3 Wd4 ...) = P(Wd1)P(Wd2|Wd1)P(Wd3|Wd2)P(Wd4|Wd3)... (2)

Similarly, a 3-gram or trigram model would compute it as

P(Wd1 Wd2 Wd3 ...) = P(Wd1)P(Wd2|Wd1)P(Wd3|Wd2,Wd1)P(Wd4|Wd3,Wd2) ... (3)

The language model provided for this tutorial is a bigram language model.

Language Weight: Although strict maximum a posteriori estimation would follow Equation (1), in practice the language probability is raised to an exponent for recognition. Although there is no clear statistical justification for this, it is frequently explained as "balancing" of language and acoustic probability components during recognition and is known to be very important for good recognition. The recognition equation thus becomes

Word1 Word2 Word3 ... =
argmax_{Wd1 Wd2 ...}{P(feature vectors|Wd1 Wd2 ...)P(Wd1 Wd2 ...)^alpha} (4)

Here alpha is the language weight. Optimal values of alpha typically lie between 6 and 11.

This page was created by Rita Singh. For comments, suggestions, or questions, contact Evandro Gouvêa.
Last update: 13 May 2002.

CAT	K	AE	T
CAR	K	AA	R
RAT	R	AE	T
STARS	S_Z	T	AA	R	S_Z
THIS	DH	I_IY	S_Z
THAT	DH	AE	T
THESE	DH	I_IY	S_Z

K(sil,AE)	0	1	2
AE(K,T)	3	4	5
T(AE,DH)	6	7	8
DH(T,IY)	9	10	11
IY(DH,Z)	12	13	14
Z(IY,R)	15	16	17
R(Z,AE)	18	19	20
AE(R,T)	21	22	23
DH(T,AE)	24	25	26
AE(DH,T)	27	28	29
T(AE,sil)	30	31	32