Newsgroups: comp.speech
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.mathworks.com!uunet!inews.intel.com!itnews.sc.intel.com!chnews!ennews!trcsun3!tucker
From: tucker@trcsun3.eas.asu.edu (Gregory B. Tucker)
Subject: Re: Continuous phonetic speech recognition using HMM's?
Keywords: HMM, Continuous speech recognition
Message-ID: <D8qoLD.IrC@ennews.eas.asu.edu>
Nntp-Posting-Host: enws125.eas.asu.edu
Sender: news@ennews.eas.asu.edu (USENET News System)
Organization: Arizona State University
Date: Wed, 17 May 1995 20:22:24 GMT
Lines: 68

On 8 Feb 1995 lowerre@madrone.ece.ucdavis.edu (Bruce Lowerre) said:

>> You can ask for a most likely path at any time.  This is useful if you
>> want to print out the words as they are spoken (looks impressive).
>> Strictly speaking, you need a probaility of ending in any one of your
>> states.  Another useful thign to do is to trace back all states until
>> the paths meet, then print that out.  As the paths can never diverge
>> further back in time, this gives you a way of running continously and
>> only printing out the maximimum likelihood word string.

> Here's a teaser from the person who brought you the "beam search."
> If you're doing continuous word recognition, you don't have to back
> trace, ever, either during or after the forward search.  There's
> a better technique for outputting words in real time as they are
> recognized in the forward search.

> Bruce Lowerre

I give up!  This `teaser' has me stumped.

I've written a frame-synchronous Viterbi employing the token-passing
scheme [1] to do continuous word recognition but I don't see how to
output the most likely model or state sequence without back tracing
through my linked lists of state maximum output probabilities.

In the token-passing scheme, choosing the max log probability in the
Viterbi equation:

   S_j(t) = max i { S_i(t-1) + log a_ij }  +  log{ Pr(j | O) }

is thought of as passing tokens from state to state and from model to
model.  Each token contains the maximum log probability (of being in
state j given the observation sequence O) and a link to it's history
of propagation.  This conception makes connected word recognition easy
to implement.

It seems to me that back-tracing to get the most likely word string is
inherent in the token-passing scheme.  Otherwise, this new technique
must have some other way to update an array of likely paths in a
frame-synchronous manner.  Also, I see that when running continuously,
you can only be sure that the most likely state sequence will not
change to time T is when all paths meet.

Is this better technique applicable under the token passing scheme?

Should I wait until all paths meet to `recognize' the ML word string?

Does anybody have any hints to a better way?  Right now my alternative
is to stop once every number of frames, trace back and collect
garbage.  Otherwise, if I don't throw away the impossible or unlikely
paths, my number of links keep growing.  I can't guarantee a limit on
this number of links.  There should be a better way.

I will appreciate any ideas.

--Greg

==========================================================================
 Gregory B. Tucker                     Telecommunications Research Center
 tucker@dspsun.eas.asu.edu             Dept. of Electrical Engineering
 http://yamuna.eas.asu.edu/~tucker/    Arizona State University
 (602) 965-0396                        Tempe, AZ 85287-7206
==========================================================================


[1]  S.J. Young, N.H. Russell, J.H.S Thornton, "Token Passing: a Simple
Conceptual Model for Connected Speech Recognition Systems", Cambridge
University Engineering Department, No. CUED/F-INFENG/TR.38, July 1989.
