Programmers' Manual

	        The Lotec Speech Recognition Package


======= File Formats =======

Sun's .au file format is documented on the "audio" man page.  

For a file used as a template, the word it represents is stored in the
info_string field of the file header.  This string must have the
format "word-xxxx", where the string before the dash is the word name,
and the information after the dash is anything you like (for example,
the context in which the word was recorded).

If you generate templates using chopper, the info_string is set for
you automatically.  Otherwise, you may need to load the .au file into
Sun's soundtool program and use the "describe" button to set the
info_string field yourself.

(The info_string is used as follows.  Feat copies this string to the
"template_name" field when it writes an .fe file, and match reads this
field.  The word name (before the dash) is used by showmatch and judge
when determining whether the word spotting succeeded, and the
information after the dash is displayed but not otherwise used.)


It helps to include useful information in the filename too.  Again,
template files created with chopper will have sensible names.

Note that times in .la files are in terms of milliseconds, but in .wh
files they are in terms of frames (10 ms per frame).


======= Grab =======

The detection of silence vs signal is quite ad hoc.

It would probably be a good idea to normalize volume somehow at this
step.

Grab outputs a click to prompt the user to begin speaking.  Since it
then starts recording immediately, it would seem possible for the
echos from the click to corrupt the recording, but this hasn't been a
problem for some reason.


======= Feat =======

This computes features a bank of 8 overlapping spectral filters.  It
is cobbled together using code received via ICSI and Cambridge ---
thanks guys.

Feat -e disables taking the log, instead using the raw energy.  This
is faster, but gives worse match results.

The standard alternative would be to compute features using the
mel-cepstrum.  In my (very limited) experience, the melcepstrum gives
results which are only slightly better, although it does run faster.


======= Match =======

For each template, match finds the time span in the input where that
template matches best.  This place, and the match score there, are then
output as a word hypothesis.

Match does not compensate for the fact that different utterances of
the same word will be different.  In particular it doesn't model
variations in speaking rate, nor changes in pronunciation due to
context.  If enough templates for each word are recorded, however,
this will be no problem.  Of course, the meaning of "enough" depends
on the needs of the downstream process; for my purposes it has meant
five templates per word.

The scores output by match are simple distance scores; not proper word
presence likelihoods.  In particular, comparisons of scores across
templates is unreliable --- some templates are just
better/more-useful/cleaner than others.

There is probably no reason to use the -u or -s options except when
debugging.

Match complains and exits if the input and templates were not all
computed in the same way (eg, some were computed with feat and some
with feat -e).


======= Labeler =======

The labeler does not allow overlapping labels.  

The interface is sloppily built; minor uglinesses accumulate unless
the user requests a redraw every so often.

It's also annoyingly slow, unless the X server is reasonably zippy.

It is of course traditional to include a spectrogram for tasks like
this, but, for a word-based system, it suffices to use the ear rather
than the eye.

Labeler provides no way to scroll the data; this makes it inconvenient
to use unless the whole sound sample fits on the screen (at 5
milliseconds per pixel, this means about 5 seconds of sound).


======= Judge =======

The purpose of the judge program is to help experimentation with
different configurations, for example, using different microphones.
The idea is that you take the same input, process it in different
ways, and then run judge on the results of each.  The one with the
better (higher) score is the better configuration.

Note that it is not possible to compare judge scores for two runs
unless the inputs and templates for both were derived from the same
acoustic content.

Judge is rather unusual, compared to other evaluation methods for
speech systems, in that it evaluates the quality of an entire lattice
of hypotheses.  Its result is the score decorated with >> <<; the
higher the better.

Here follow the details of the computation, and of what all the
numbers mean:

First, judge shows the scores for each hypothesis (although these are
not shown when judge is given more than one .wh file).

- "match_sc" is the match score, as output by the word spotter.  The
following number, in parentheses, is the expected value for this, as
might be expected from a completely random word spotter.

- "overlap_sc" is the overlap score; a measure of the degree to which
the hypothesis is located at correct time span.  Again, the number in
parentheses is the expected value (approximately).

- "product" is the product of the above two scores: this is a measure
of the overall quality of the hypothesis.  The idea behind
multiplication is that, if the spotter scores a word highly, it is
rewarded more if that hypothesis is accurate.  Again, the number in
parentheses is the expected value.

- "al_sc" is the "alignment score", an alternative measure of the
degree to which the hypothesis is temporally correct.

- "al_gd" is the "alignment-based-goodness, the average of match_sc
and al_sc.  For my purposes, this number is less meaningful than
"product", but the two seem to correlate well anyway.


At the bottom judge shows some summary statistics.  The ones whose
meanings are not obvious are:

The "match score correctness ratio" is the percent of evidence that is
for words actually present.

The "normalized average product" (decorated with >>>> <<<<) is the
result of scaling the average product to the range 0 to 1, where 0 is
performance at chance and 1 is perfection.


======= Showmatch =======

The reason for has the -l option (to take the label files from another
directory) is to let you can do feat+match in various ways, producing
corresponding .wh files in several directories, and then bring up two
showmatch windows to see qualitatively how the results compare.


======= Real =======

This is just like grab | feat | match, except that all the
initialization is done first and all the post-processing is done last.
This means that the inner loop does nothing superfluous, and runs in
real time (on a lightly-loaded and fast (SS10) machine).

The window display (with the -r option) is for amusement purposes
only.  It's slow and buggy.  Note that the bar height indicates the
likelihood for a word *starting* at that point in time.

The -a and -f options are for verifying that real does in fact behave
the same as grab | feat | match.

To decide whether or not to use log energy, real looks at whether the
templates were computed with log energy.  This means that real will
run faster, though less accurately, if the templates were computed
without using log energy (ie, with feat -e).

I haven't actually used this code for anything, so it's even worse
than the rest.


======= for more information =======

Read the source code.  For general background, read 
  Fundamentals of Speech Recognition. Lawrence Rabiner and Biing-Hwang 
  Juang.  1993, Prentice Hall, ISBN 0-13-015157-2

---
Nigel Ward
nigel@sanpo.t.u-tokyo.ac.jp
University of Tokyo
May 1994
---
