SPEAKER ADAPTATION IN CONTINUOUS SPEECH RECOGNITION
VIA ESTIMATION OF CORRELATED MEAN VECTORS
The present study addressed the problem of speaker adaptation in both
feature-based and stochastic model-based continuous speech recognition
systems. Effective speaker adaptation procedures must be able to adapt
to the characteristics of a new speaker given speaker-specific training
data in quantities which are well below those required for training
speaker-dependent systems. The adaptation algorithm must be computationally
efficient to allow for a short enrollment process. Since the basic
recognition unit in continuous speech recognition systems is at the
sub-word level, user feedback of unit labels is impractical. The adaptation
algorithm should therefore operate in an unsupervised mode.
The approach taken in this thesis was to use multivariate parameter
estimation procedures to update the mean values of the component densities
which comprise a feature-based system's classifiers, or a stochastic
model-based system's codebook. Emphasis was placed on obtaining low initial
estimation error with a computationally efficient algorithm. Adaptive
filtering techniques were exploited to derive an estimator which met these
conditions. The Bayesian optimal (EMAP) estimator was first shown to be
equivalent to a minimum mean-square error (MMSE) adaptive filter with
timevarying data statistics. A stochastic gradient approximation of the
MMSE formulation resulted in a least mean-square estimator, called LMS-C,
which with proper initialization produced a faster rate of convergence
than the Bayesian estimator. Computational requirements of the LMS-C
estimate are approximately one-third of those of the EMAP estimate.
Unlike the EMAP estimate, however, the LMS-C estimate is asymptotically
biased. This misadjustment is negligible in the context of the speaker
adaptation problem.
Expressions which define the LMS-C algorithm and its mean-square estimation
error were derived and analyzed assuming correlated, jointly-gaussian data
distributions. Compared with maximum likelihood (ML) estimation, the
additional expense required for LMS-C (or EMAP) estimation was shown to be
justified when the dogmatism of the data is neither very large nor very
small, and training data is limited. Relative gains of LMS-C and EMAP
estimates over ML estimates were shown to increase with increasing
correlation between the data means and with increasing skew in the class'
prior probabilities.
The general limitations of LMS-C, EMAP, and ML adaptation procedures were
assessed in the context of unsupervised speaker adaptation in the Carnegie
Mellon ANGEL system, a novel featurebased system called PROPHET, and a
semi-continuous version of the CMU SPHINX system. Comparisons between the
ANGEL and PROPHET systems indicated the necessity for the adaptation data
to obey the gaussian assumptions made in derivation of the estimation
algorithms. When these assumptions were met (using computer-generated data),
adaptation using the LMS-C or EMAP algorithms reduced front vowel
classification error rates by 28% after the presentation of 10 unlabeled
training samples. Five iterations through the training data were shown to
reduce the error rate by an additional 10% over the one-iteration rate.
Unsupervised adaptation experiments with a synthetic HMM indicated that
the EMAP and LMS-C estimates were able to produce an estimation error
lower than the ML estimate only when the dogmatism of the data was low.
It was shown that the unsupervised ML estimate, as specified by the HMM
reestimation procedure, produced an estimation error which was initially
much larger than the supervised form of this estimate. Due to the
dependence of the EMAP and LMS-C estimates on the ML, performance of
these two algorithms was also reduced. Repeated iteration of the
forwardbackward algorithm eventually reduced the unsupervised level of
error to that of the supervised estimate. It was also shown that the
unsupervised form of the ML estimate implicitly models the correlation
of the data means which serves to reduce estimation error as the data
means become more correlated.
Mean vector adaptation in SPHINX was less successful than with the
feature-based systems because the dogmatism of the data in SPHINX was
more than twice that of the feature-based systems. The SPHINX system's
performance using LMS-C, EMAP, and ML codebook mean vector adaptation
methods was compared with the system using no adaptation. Results showed
an overall reduction of 2.0 to 3.4% in word error rate due to adaptation
for a set of 11 speakers from the DARPA resource management task. Using
a distance metric applied to the adapted codebooks, word error rates were
reduced on average by 15% for those speakers automatically identified as
good candidates for adaptation.