RANDOM OBSERVATIONS

On implementing Baum-Welch

  1. Sphinx-3 has code that looks like:
    in bw/baum_welch.c, around line 219.
        /* Calculate log[ p( O | \lambda ) ] */  
         log_fp = log(active_alpha[n_obs-1][i]); 
         for (t = 0; t < n_obs; t++) { 
         log_fp -= log(scale[t]);      
           for (j = 0; j < inv->gauden->n_feat; j++) {                           
             log_fp += dscale[t][j]; 
             }
         }
    
    Here's what the scaling means:
    dscale: density scaling factor. At each frame we compute several Gaussians, for several Gaussian mixtures (one for each tied state). When all Gaussian mixtures are poorly trained (early in training) or when a data vector is an outlier, then often all Gaussians underflow and the scores for all tied states becomes 0. The forward algorithm cannot proceed further then because all state scores become 0. To avoid this, for each vector we identify the Gaussian (from all the Gaussians in all the Gaussian mixtures) with the highest density value for that vector, and normalize all Gaussians of all mixtures with respect to this Gaussian value. This ensures that
    1. at least one Gaussian, and therefore at least one Gaussian mixture has a non-zero value (since the max Gaussian is normalized to 1.0)
    2. other Gaussian mixtures get a good chance of becoming non 0 since, although individually they may have poor scores, they may be only a small distance away from the best Gaussian, and the normalization pulls these scores up.
    The actual computation is done with the log scores (i.e. we identify the Gaussian with the highest logscore, and subtract the log score of this Gaussian out of the log scores of all other Gaussians, and then exponentiate everything). Dscale is the raw log score of the best Gaussian. To get the real value of any Gaussian density (or mixture), this term must be added to the normalized logscore of that density. Normalization by dscale does not affect the computation of gammas. However, it does affect the computation of the total likelihood and must be added back into the finally computed loglikelihood.

    scale: normalization term This normalization term is different from dscale. At each frame, we divide all alpha terms by the sum of all alpha terms. This is ALSO performed to prevent underflow, since even after normalizing density values, the accumulation of transition probabilities (which are all less than 1) can cause all alpha terms for a frame to go into underflow. This scale factor also does not affect the final gammas, but does affect total likelihood and must be factored back in.

    The actual gamma computation ignores the scale terms since they do affect both the numerator and denominator terms of the gamma compuation equally. The gamma itself is a normalized version of the product of alpha and beta at any state. The normalization term is the sum (alpha*beta) over all states for that frame. This comes from the actual equations.


Next generation speech recognition

This scratchpad just got extended: The endless, seamless list of unsolved problems in speech recognition