Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!uknet!pipex!howland.reston.ans.net!wupost!waikato!aukuni.ac.nz!mmt
From: mmt@ccu1.aukuni.ac.nz (Mark Thomson)
Subject: Training HMM's
Message-ID: <1994Jan17.231950.9326@ccu1.aukuni.ac.nz>
Organization: University of Auckland, New Zealand.
Date: Mon, 17 Jan 1994 23:19:50 GMT
Lines: 59

Hi,

I have a question about the use of the Baum-Welch algorithm for training
hidden Markov models. The explanation of the algorithm given in a number
of references involves a step that I am having difficulty following.
Since it appears in more than one place it seems unlikely to be
incorrect. I would, therefore, be grateful if someone could point out to
me where the flaw in my understanding lies. The situation is as follows:

In updating the elements, a_ij, of the state transition probability
matrix, one divides the expected number of transitions from i to j by
the expected number of transitions from i. The expected number of
transitions from i is the sum over t of the probabilities that the model
will be in state i at time t, given the particular sequence of feature
vector observations on which the model is being trained.

The probability of the model being in state i at time t given the
observation sequence is 

             p(in state i at t, and training sequence observed)
             -------------------------------------------------
                      p(training sequence observed)

Now all the references I've looked at (e.g. the book "HMM's for Speech
Recognition" by Huang, Ariki and Jack) state that this is equal to

                          alpha_t(i)*beta_t(i)
                     -----------------------------
                     p(training sequence observed)

where alpha_t(i) is the forward variable =
        p(training vectors O_1...O_t occur and in state i at t)

and beta_t(i) is the backward variable =
        p(training vectors O_t+1...O_T occur | in state i at t)

My problem is that this implies that 

 alpha_t(i)*beta_t(i) = p(in state i at t and training sequence observed)

However, by my reckoning, the rhs of the last line is equal to

p(O_1...O_t occur and in state i at t) * 
              p(O_t+1...O_T occur | (O_1...O_t occur _and_ in state i at t))

The first term is clearly alpha_t(i), but it seems to me that the second
term is _not_ beta_t(i) as required, because it is conditioned on _both_
the state at t and the observation sequence up to t.

I would appreciate it greatly if someone would let me know if thre is an
error in my reasoning.




Mark Thomson
Electrical and Electronic Engineering
The University of Auckland

