Compensation Schemes for SPINE


Codeword Dependent Cepstral Normalization

Assumes that the training data is clean
Assumes that the test data has been corrupted by linear filtering and additive noise.
  • y = x + h + IDCT(log(1 + exp(DCT( n - x - h))))
  • y = cepstrum of noisy speech
  • x = cepstrum of clean speech that was corrupted to give noisy speech
  • h = cepstrum of impulse response of linear filter
  • n = cepstrum of noise
    Estimates the value of h (cepstrum of impulse response of the linear filter) and n (cepstrum of the additive noise) and compensates for them to estimate x from y
    The recipe:
    • Estimate a Gaussian mixture distribution from the cepstra of clean training speech
    • For each test utterance, obtain ML estimate of linear filter and additive noise parameters h and n, based on this distribution and the test utterance itself
    • Use a Minimum Mean Squared Error estimator to compensate for the effect of the linear filter and the additive noise and estimate x from y
    Models the effect of linear filtering and additive noise as a shift of the means of the Gaussians in the Gaussian mixture distribution
    Variances of the Gaussians are assumed to be invariant with increasing noise and filtering
    back

    Vector Taylor Series

    Assumes that the training data is clean
    Assumes that the test data has been corrupted by linear filtering and additive noise.
  • y = x + h + log(1 + exp( n - x - h))) (Note - these are log-spectral relations now. No DCT involved)
    Estimates the log-spectral values of the impulse response of the linear filter and the additive noise and compensates for them.
    The recipe:
    • Estimate a Gaussian mixture distribution from the log spectra of clean speech
    • For each test utterance, obtain ML estimate of h, the log spectrum of the impulse response of the linear filter, and the mean and variance of n mean and variance of the additive noise parameters based on this distribution and the test utterance itself
    • Use a Minimum Mean Squared Error estimator to compensate for the effect of the linear filter and the additive noise
    Models the effect of linear filtering and additive noise as a shift of the means and a scaling of the variances of the Gaussians in the Gaussian mixture distribution
    Unlike CDCN, variances are updated. Processing is done in the log-spectral domain, rather than the cepstral domain
    More effective than CDCN, but also more unstable (blows up if the linear filter/additive noise model is incorrect)
    back

    VTS based HMM compensation

    Assumes that the training data is clean
    Assumes that the test data has been corrupted by linear filtering and additive noise.
    Estimates the log-spectral values of the impulse response of the linear filter and the additive noise and modifies HMMs to account for them
    The recipe:
    • For each test utterance, obtain ML estimate of linear filter and the mean and variance of the additive noise parameters based on clean speech HMM and the test utterance itself
    • Modify the means and variances of the Gaussians in the recognizer to account for the channel and the noise
    • Decode using modified recognizer
    Models the effect of linear filtering and additive noise as a shift of the means and a scaling of the variances of the Gaussians in the Gaussian mixture distribution
    Computationally and implementationally far more complex than VTS
    Requires two passes of decoding (one to obtain a hypothesis, the other to obtain noise and channel estimates based on this hypothesis. This can be iterated)
    back

    VTS based environment normalization

    Assumes that the training data has been corrupted by linear filter and additive noise
    Assumes that the test data has also been corrupted by linear filtering and additive noise.
    Estimates linear filter and additive noise cepstral values both during training and decoding
    The recipe:
    • For each training utterance, estimate linear filter and additive noise parameters using current HMM parameters. Compensate utterance for linear filter and noise before adding to training buffers
    • For each test utterance, obtain ML estimate linear of filter and the mean and variance of the additive noise parameters based on clean speech HMM and the test utterance itself
    • Modify the means and variances of the Gaussians in the recognizer to account for the channel and the noise
    • Decode using modified recognizer
    Models the effect of linear filtering and additive noise as a shift of the means and a scaling of the variances of the Gaussians in the Gaussian mixture distribution
    Computationally and implementationally very complex, espeically during training
    back

    Maximum Likelihood Linear Regression

    Assume means of Gaussians have been transformed using an Affine transform
    Estimate parameters of this transform and update the means
    Recognize using updated means
    Greater effectiveness using principal component MLLR or inter class MLLR for small utterances (ref. Sam Joo Doh)
    back

    Interpolating Models

    Train separate models for separate noise conditions
    Model optimal HMM for test data as an interpolation of these models
    Learn interpolation factor somehow (e.g. from estimated SNR)
    Interpolate between the various models to obtain HMMs for decoding
    For more details, refer to Juan Huerta
    back

    Spectral subtraction

    The simplest noise reduction scheme
    The recipe:
    • Obtain running estimate of noise spectrum based on silence regions
    • Subtract noise spectrum from the power spectrum of noisy speech
    • Compute cepstra from power subtracted spectrum
    Can be performed on both training and test speech
    back

    Multi-style training

    Throw everything in and train
    No additional processing during decoding
    back