Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!daresbury!keele!uknet!pipex!howland.reston.ans.net!cs.utexas.edu!asuvax!ennews!trcsun3!deisher
From: deisher@trcsun3.eas.asu.edu (Michael E. Deisher)
Subject: Re: Training HMM's
Message-ID: <CJu1IB.DJF@ennews.eas.asu.edu>
Summary: independent observations
Sender: news@ennews.eas.asu.edu (USENET News System)
Nntp-Posting-Host: enws125.eas.asu.edu
Organization: Arizona State University
References: <1994Jan17.231950.9326@ccu1.aukuni.ac.nz>
Date: Tue, 18 Jan 1994 15:54:09 GMT
Lines: 64

In article <1994Jan17.231950.9326@ccu1.aukuni.ac.nz>, mmt@ccu1.aukuni.ac.nz (Mark Thomson) writes:
> I have a question about the use of the Baum-Welch algorithm for training
> hidden Markov models. The explanation of the algorithm given in a number
> of references involves a step that I am having difficulty following.
> Since it appears in more than one place it seems unlikely to be
> incorrect. I would, therefore, be grateful if someone could point out to
> me where the flaw in my understanding lies. The situation is as follows:
> 
> In updating the elements, a_ij, of the state transition probability
> matrix, one divides the expected number of transitions from i to j by
> the expected number of transitions from i. The expected number of
> transitions from i is the sum over t of the probabilities that the model
> will be in state i at time t, given the particular sequence of feature
> vector observations on which the model is being trained.
> 
> The probability of the model being in state i at time t given the
> observation sequence is 
> 
>              p(in state i at t, and training sequence observed)
>              -------------------------------------------------
>                       p(training sequence observed)
> 
> Now all the references I've looked at (e.g. the book "HMM's for Speech
> Recognition" by Huang, Ariki and Jack) state that this is equal to
> 
>                           alpha_t(i)*beta_t(i)
>                      -----------------------------
>                      p(training sequence observed)
> 
> where alpha_t(i) is the forward variable =
>         p(training vectors O_1...O_t occur and in state i at t)
> 
> and beta_t(i) is the backward variable =
>         p(training vectors O_t+1...O_T occur | in state i at t)
> 
> My problem is that this implies that 
> 
>  alpha_t(i)*beta_t(i) = p(in state i at t and training sequence observed)
> 
> However, by my reckoning, the rhs of the last line is equal to
> 
> p(O_1...O_t occur and in state i at t) * 
>               p(O_t+1...O_T occur | (O_1...O_t occur _and_ in state i at t))
> 
> The first term is clearly alpha_t(i), but it seems to me that the second
> term is _not_ beta_t(i) as required, because it is conditioned on _both_
> the state at t and the observation sequence up to t.

The second term is equal to beta_t(i) if the observations are assumed
to be statistically independent.  That is,

p(O_t+1...O_T occur | O_1...O_t occur) = p(O_t+1...O_T occur)

I think most authors make this assumption (e.g., see Rabiner, Feb. 89
Proc.  IEEE, p. 262, text surrounding eqn 13).

--Mike

 ==============================================================================
  |  Mike Deisher                                  Arizona State University  |
  |  deisher@dspsun.eas.asu.edu          Telecommunications Research Center  |
  |  voice:  (602) 965-0396                    Signal Processing Laboratory  |
  |  fax:    (602) 965-8325                           Tempe, AZ  85287-7206  |
 ==============================================================================
