Sphinx-3 has code that looks like:
in bw/baum_welch.c, around line 219.
/* Calculate log[ p( O | \lambda ) ] */
log_fp = log(active_alpha[n_obs-1][i]);
for (t = 0; t < n_obs; t++) {
log_fp -= log(scale[t]);
for (j = 0; j < inv->gauden->n_feat; j++) {
log_fp += dscale[t][j];
}
}
Here's what the scaling means:
dscale: density scaling factor.
At each frame we compute several Gaussians, for several
Gaussian mixtures (one for each tied state). When all
Gaussian mixtures are poorly trained (early in training)
or when a data vector is an outlier, then often all
Gaussians underflow and the scores for all tied states
becomes 0. The forward algorithm cannot proceed further
then because all state scores become 0.
To avoid this, for each vector we identify the Gaussian
(from all the Gaussians in all the Gaussian mixtures)
with the highest density value for that vector, and
normalize all Gaussians of all mixtures with respect
to this Gaussian value. This ensures that
- at least one Gaussian, and therefore at least one
Gaussian mixture has a non-zero value (since the
max Gaussian is normalized to 1.0)
- other Gaussian mixtures get a good chance of becoming
non 0 since, although individually they may have poor
scores, they may be only a small distance away from
the best Gaussian, and the normalization pulls these
scores up.
The actual computation is done with the log scores (i.e.
we identify the Gaussian with the highest logscore, and
subtract the log score of this Gaussian out of the log
scores of all other Gaussians, and then exponentiate
everything). Dscale is the raw log score of the best
Gaussian. To get the real value of any Gaussian density
(or mixture), this term must be added to the normalized
logscore of that density.
Normalization by dscale does not affect the computation
of gammas. However, it does affect the computation of
the total likelihood and must be added back into the
finally computed loglikelihood.
scale: normalization term This normalization term is different from
dscale. At each frame, we divide all alpha terms by the sum of all alpha
terms. This is ALSO performed to prevent underflow, since even after
normalizing density values, the accumulation of transition probabilities
(which are all less than 1) can cause all alpha terms for a frame to go
into underflow. This scale factor also does not affect the final gammas,
but does affect total likelihood and must be factored back in.
The actual gamma computation ignores the scale terms since they do affect
both the numerator and denominator terms of the gamma compuation
equally. The gamma itself is a normalized version of the product of alpha
and beta at any state. The normalization term is the sum (alpha*beta) over
all states for that frame. This comes from the actual equations.