The Sphinx3 trainer can be used to train both semi-continuous and continuous HMMs.

THE DIFFERENCE BETWEEN DISCRETE, SEMI-CONTINUOUS AND CONTINUOUS MODELS

Discrete models

The vector space is partitioned into N regions and every vector is replaced by the numerical identifier (id) of the region it belongs to. In the discrete HMM for a phone, each state distribution is a histogram (with N bins) of the occurence-frequency of each region. Here we are essentially assuming that all data within a region are equally probable.

The process of partitioning a vector space into regions and replacing each vector by a representative from the region it belongs to is referred to as "vector quantization".

Semi-continuous models

The essential conceptual difference between discrete and semi-continuous models is that all data within a bin are not assumed to be equi-probable in semi-continuous models. Instead, we assign a probability density to all data within a bin.

In semi-continuous models the vector space is partitioned into N regions as in the case of discrete models. However there are several differences. Partitions are not rigid, and do not have clear boundaries. Instead, the data vectors are used to compute a mixture of N parametric distributions, usualy Gaussians, and each of these N distributions are visualized as representing the distribution of data within a particular partition. It is important to note that the partitioning itself is not explicitly done. It is assumed that the data belonging to any state of any HMM come from these various partitions in proportions that are characteristic of that state. However, within any partition, the distribution of data is dependent only on the partition and not on the state. The state distribution of any state is thus simply a sum of the N distributions (representing each of the partitions) weighted by the true fraction of data points from that state which came from the partitions.

Here, the underlying "truth" about the distribution associated with any state of an HMM is that, had one had all possible examples of the realization of that state, and explicit knowledge of which partition each of these vectors came from, then the true distribution of the state could be approximated as a sum of the N partitions weighted by the fraction of vectors from that state belonging to each partition. This "true" fraction is what we are trying to estimate from the limited data that we see, using the limited knowledge that we have of the data belonging to that state.

From the point of view of each vector, we do not associate a unique partition with any vector. Instead, we assign to the vector a probability of belonging to any partition. Thus every vector has a probability distribution associated with it, composed of the probabilities of belonging to each of the N partitions. During the training process these probability distributions associated with the vectors belonging to any state of any HMM are aggregated to give a single histogram, which is normalized to give the "mixture weight" distribution for that state. The probability distributions are also used simultaneously to update the parameters of the distribution associated with each partition of the vector space.

For a vector, the notion of "belonging to a state" is not absolute. It can only be absolute if the process of generation of the vectors were known at every stage to an observer. This is further compounded by the fact that the assumed model itself may be inaccurate. In other words, we cannot assign any vector exclusively to a particular state of an HMM. Instead, we can associate a probability of it having being generated by distribution of any state. A simple example to clarify this is as follows: If we took a mixture of 3 Gaussians and used one of them to generate a vector, we could assign an absolute probablity to that vector of belonging to its generating Gaussian. This assignment comes merely because we know which Gaussian generated the vector. Once the vector was generated, and that underlying knowledge obscured from an observer, the observer could at best assign probabilities of the vector having been generated from each of the 3 Gaussians. The observer's assignments could never be more specific than this in the absence of any other knowledge. However, the fact remains that the vector did indeed come from one of the 3 Gaussians, we simply cannot exploit the fact. We have to make guesses.

Here is a good practical approximation to Semi-continuous HMMs. It is not theoretically completely accurate (and not what the Sphinx does), but if you implemented this it would work too (though not as well as the theoretically correct version discussed above). In this approximation the vector space is explicitly partitioned into N parts and a histogram is constructed for every state of every hmm. This histogram is based on the counts of vectors that correspond to each partition. The vectors themselves are not replaced. Once all histograms are computed, the hmms are used to re-estimate the boundaries of the N partitions of the vector space. In order to do this, each partition is associated with a parametric distribution with its parameters computed from the vectors in that partition. The final models are in the form of parameters of the final distributions of the partitions (which form the final codebook) and the histograms or "mixture-weights" corresponding to each model. Also associated with each "base" acoustic unit modeled is a transition probabilities matrix. All "higher-level" acoustic units which may have been formed by some combination of the "base" acoustic units use the transition matrix of the base acoustic unit they are chiefly associated with. Triphones, quinphones, diphones etc. fall into the "higher-level" units category. A triphone, for example, is chiefly associated with its central phone.

Continuous models

In continuous HMMs the partitioning of the vector space is itself state-specific, unlike semi-continuous models where the partitioning is unique and shared by all states. More about this later...

THE ALGORITHM USED FOR BUILDING TREES
simple split (ssplit) (sequence in which this step is done),
compound split (csplit) (sequence in which this step is done)

                           
                           -----
                          |ROOT |
                           -----                        
                             |
                    ssplit(1),csplit(1)
                             |
                             |     
                  ----------------------
                 |                      |
               -----                  -----
              |     |                |     |
               -----                  ----- 
                 |                      |
        ssplit(2),csplit(4)     ssplit(3),csplit(2)
                 |                      |
         ------------------       -------------------
        |                  |     |                   |
      -----              -----   -----             -----
     |     |            |     | |     |           |     |
      -----              -----   -----             -----
        |                   |      |                 |
   ssplit(8)            ssplit(9)  |                 |
                                   |                 |
                                
                          ssplit(4),csplit(3) ssplit(5),csplit(5)
                                   |                 |
                            ---------------     --------------
                           |               |   |              |
                         -----         -----   -----      -----
                        |     |       |     | |     |    |     |
                         -----         -----   -----      -----
                        ssplit(6)    ssplit(7) ssplit(10) ssplit(11)