Newsgroups: comp.speech
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!torn!alf.uwaterloo.ca!watserv2.uwaterloo.ca!mspundsa
From: mspundsa@coulomb.uwaterloo.ca (Mark Stephen Pundsack)
Subject: Continuous phonetic speech recognition using HMM's?
Message-ID: <D3LB8r.9IC@watserv2.uwaterloo.ca>
Sender: news@watserv2.uwaterloo.ca
Nntp-Posting-Host: babbage.uwaterloo.ca
Organization: University of Waterloo
Date: Mon, 6 Feb 1995 17:58:03 GMT
Lines: 45

I'd like to use HMMs for continuous speech recognition, but I was having
some difficulties figuring out how to use the HMM continuously instead 
of doing some kind of break detection.  I'd like to use a phonetic (or
triphone) based approach so I can't just wait for a lower volume to break
up the frames.  I've read that you can use the HMM to model a complete
grammer just by stringing everything together.  I suppose that if you
keep track of the path, you can decode the phoneme structure and thus the
word structure.  

My problem is that if you model dead time in your HMM, the 
probability will decay because of those frames.  Can the model
still have meaning after a long enough time that the probabilities get
really skewed?  If the HMM took input continuously for several sentences,
I would suspect that when the next sentence starts, there will be a bunch
of probabilities that are really skewed and the dynamic range gets really
large.  I don't know if this is an actual problem or not, but it seemed 
to me that it would have a large potential for being a problem.  

Perhaps those values that are at the bottom of the dynamic range 
can be dropped because they have a very small chance of actually 
being correct.  But I have thought of several situations that might 
cause a very small initial probability, but would eventually be 
the correct sentence.  Perhaps it is simply that full sentence 
recognition would require a much larger dynamic range than a simple 
word recognizer would.  Is this a limitation?

One method I wanted to try was starting several HMMs in parallel, but 
starting at different offsets.  This would hopefully find the correct
boundaries.  The big problem there is that I don't think the Viterbi 
algorithm will produce probabilities that are comparable when there is
a time offset.  ie. they aren't time synchronous.  Is there a correction
to make the Viterbi results comparable?  I've heard of just using some
arbitrary compensation, but I don't know how of any correct compensation.

Has anyone made continuous phone based speech recognizers using just
an HMM for the grammer?  I don't really want to try a stack based decoder
or other methods yet.

If this makes sense to anyone, but it isn't clear what I am getting at,
please email me and I'll try to be a little clearer.

Thanks in advance,
Mark Pundsack


