Newsgroups: comp.speech
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!EU.net!Germany.EU.net!news.dfn.de!zeus.rbi.informatik.uni-frankfurt.de!terra.wiwi.uni-frankfurt.de!news.th-darmstadt.de!fauern!rrze.uni-erlangen.de!hub-n.franken.de!ark.franken.de!ralf
From: ralf@ark.franken.de (Ralf W. Stephan)
Subject: Call for comments:  Speech recognition HOWTO
Message-ID: <1994Sep25.172115.280@ark.franken.de>
Organization: his desk writing an article
Date: Sun, 25 Sep 1994 17:21:15 GMT
X-Newsreader: TIN [version 1.2 PL2]
Lines: 84

  Hi,

I would be glad if some of the gurus could comment on this text.

Many thanks in advance,
ralf

-----------------------------------------------------------------
In my effort to gather information on how to build a simple
speech recognition (the EARS project) I learned the following.
This text is posted with a request for comments and
is intended to help people not so familiar with the methods
to build and play with their own implementation (as well as
it helps me clearing up my head).  Note that the following is
*not* from experience but from reading papers, sources, etc.

With 'simple' speech recognition I mean an implementation that
doesn't deal with word subunits (e.g. phonemes, fenones etc.).
but with whole words.  Of course, you don't get speaker-independence
or big vocabularies this way.  But an advantage could be that you
are not stuck to a specific language.

- First, find a way to sample your utterances.  I believe that
  a sampling rate of 8 kHz should suffice.  Maybe it is better
  to have the data in audio format than in raw format, but this
  depends on if that is available with your system.

- The data then must be processed through a feature extraction.
  Rasta-PLP seems to be the state-of-the-art.  You usually get
  a 20+-vector for each frame (10 or 20 ms), so that a whole word
  of 0.5 to 1 sec length consists of about 1-2000 FP numbers.

- Now there comes the part where there seem to be a lot of possible 
  paths to go:

  On the one hand, one can say that the task of recognizing a word
  can be seen as a pattern recognition task, with the time axis
  of the data being interpreted as the x axis of an image consisting 
  of a rectangle of more or less bright points (look at the image of
  a PLP file with the help of the OGI tools).  Thus you could apply
  techniques that are used in optical recognition tasks, most notably
  feedforward nonrecurrent neural nets.

  The problems with this approach:
  - most feedforward nets expect their input to be of constant size,
    so you have to scale the data.  This probably leads to the 
    unwanted behaviour that details within long words are lost.
  - you need big horsepower: even with only 10 hidden neurons you
    have to train tens of thousands of connections.

  But fortunately, the data has features that allow other methods
  to work:  it stays nearly constant over some period of time
  then changes very fast, only to stay constant over another period.  
  So, it can be seen as a concatenation of states with changes between 
  them.

  There are several models that can classify this sort of data,
  recurrent neural nets and Hidden Markov Models (HMM).  What
  recurrent net to use generally depends on the short term memory
  (STM) one wants to have.  Try Backpropagation Through Time (BPTT)
  or Time Delay Neural Nets (TDNN).  Some papers suggest that 
  recurrent nets and HMMs share several similarities, so I guess
  it should be possible to get good results with the nets, too.

  HMMs are the first choice with speech recognition.  There are
  HMMs that take discrete input, and ones that take continuous
  input.  If you want to use the former you have to quantize the
  data first.  This is done with unsupervised techniques, usually 
  with Vector Quantization or (I think) Kohonen SOMs. 

Availability of code:
=====================
  At this time, I know of no free package with code for continuous HMMs.
  Look into the comp.speech FAQ for feature extraction and discrete
  HMM code.  Two packages that weren't mentioned there:  the Rasta
  package on icsi-ftp.berkeley.edu, and Tony Robinson's cookbook on
  svr-ftp.eng.cam.ac.uk.  Look into the comp.ai.neural-nets FAQ for
  net code.

--------------------------------------------------------------------

ralf
--
You are in a different maze of little articles, all alike.
