Speaker Adaptive Training for DNNs

Speaker Adaptive Training for Deep Neural Network Acoustic Models

Yajie Miao, Hao Zhang, Florian Metze

Carnegie Mellon University

Introduction

Speaker adaptive training (SAT) is a standard technique for GMM models. We apply the concept of SAT to deep neural network (DNN) acoustic models. Training of SAT-DNN models starts from fully-trained DNN models. We then train a smaller neural network iVecNN which takes speaker i-vectors as inputs and outputs linear feature shifts. These feature shifts are added to the original DNN inputs, resulting in more speaker-normalized features. The canonical DNN is finally updated in this newly-estimated feature space.

DNN SAT-DNN

Training

Suppose that we have a fully-trained DNN model and the i-vector for each speaker s. The idea of SAT-DNN can be formulated by the
following equation:

                                                                             a_t = o_t + iVecNN(i_s)

where i_s is the i-vector for s, o_t is the original DNN input feature vector (e.g., fbanks). iVecNN is a separate neural network depicted
with green circles in the figure. For each speaker, iVecNN converts his/her i-vector into a linear feature shift. With this shift added,
the resulting DNN input vector a_t becomes more speaker-normalized.

The steps to train SAT-DNN models can be summarized as follows:

1) Train the baseline DNN over the training data, as we normally do
2) Extract i-vectors for training speakers
3) Learn the iVecNN network, by keeping the DNN fixed
4) Update the DNN model in the new feature space a_t, by keeping the iVecNN network fixed

Both the learning of iVecNN and the updating of the DNN model can be performed using the standard error backpropagation, using
stochastic gradient descent. The implementation can be found in our Kaldi+PDNN scripts and PDNN toolkit.

Testing (Decoding)

During decoding, we simply need to extract the i-vector for each testing speaker. Feeding the i-vector to the SAT-DNN architecture will
automatically adapt SAT-DNN to this testing speaker. No initial decoding pass and no DNN fine-tuning are needed on the
adaptation data.

Implementation

SAT-DNN has been integrated into our Kaldi+PDNN recipes. You can check out the latest version from the repository and find the
following 4 recipes

run_swbd_110h/run-dnn-fbank-sat.sh    --   Hybrid systems with filterbanks as input features

run_swbd_110h/run-dnn-sat.sh    --   Hybrid systems with fMLLRs as input features

run_swbd_110h/run-bnf-fbank-tandem-sat.sh    --   BNF tandem systems with filterbanks as input features

run_swbd_110h/run-bnf-tandem-sat.sh   --   BNF tandem systems with fMLLRs as input features

Before running each of the recipes, make sure that
1)   I-vectors have been generated by running run_swbd_110h/run-ivec-extract.sh
2) The correpsonding DNN recipe has been run beforehand. For example, run-dnn-fbank.sh for run-dnn-fbank-sat.sh.
3) Dowload the following two source files to src/featbin and compile them.

http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/get-spkvec-feat.cc
http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/add-feats.cc

Training of the SAT-DNN models is performed by the run_DNN_SAT.py command from the PDNN toolkit.

Results

This is the Switchboard 110-hour setup. We show the WER% on the Switchboard part of the Hub'00 evaluation set. The DNN input
features can be SI filterbanks or SA fMLLRs.

Hybrid Models

	filterbank features	fMLLR features
DNN Baseline	run-dnn-fbank.sh 21.7%	run-dnn.sh 19.2%
SAT-DNN	run-dnn-fbank-sat.sh 19.3%	run-dnn-sat.sh 17.9%

Bottleneck Feature Tandem Systems

	filterbank features	fMLLR features
BNF Baseline	run-bnf-fbank-tandem.sh 19.6%	run-bnf-tandem.sh 18.0%
SAT-BNF	run-bnf-fbank-tandem-sat.sh 18.0%	run-bnf-tandem-sat.sh 17.5%

Reference

Yajie Miao, Hao Zhang, Florian Metze. Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models.
INTERSPEECH 2014.

Yajie Miao, Lu Jiang, Hao Zhang, Florian Metze. Improvements to Speaker Adaptive Training of Deep Neural Networks.
SLT 2014.