Speaker Adaptive Training for Deep Neural Network Acoustic Models

    Yajie Miao,  Hao Zhang,  Florian Metze
    Carnegie Mellon University
                                                                                                                                                              
Introduction
   
Speaker adaptive training (SAT) is a standard technique for GMM models. We apply the concept of SAT to deep neural network (DNN) acoustic models. Training of SAT-DNN models starts from fully-trained DNN models. We then train a smaller neural network iVecNN  which takes speaker i-vectors as inputs and outputs linear feature shifts. These feature shifts are added to the original DNN inputs, resulting in more speaker-normalized features. The canonical DNN is finally updated in this newly-estimated feature space. 
 
                    

                                    DNN                                                                                             SAT-DNN

                                                                                                                                                                                                  
Training
    
Suppose that we have a fully-trained DNN model and the i-vector for each speaker s. The idea of SAT-DNN can be formulated  by the
following equation:


                                                                             a_t = o_t + iVecNN(i_s)

where i_s is the i-vector for s, o_t is the original DNN input feature vector (e.g., fbanks). iVecNN is a separate neural network depicted
with green circles in the figure. For each speaker, iVecNN converts his/her i-vector into a linear feature shift.  With this shift added,
the resulting DNN input vector a_t becomes more speaker-normalized.

The steps to train SAT-DNN models can be summarized as follows:
  
1) Train the baseline DNN over the training data, as we normally do
2) Extract i-vectors for training speakers 
3) Learn the iVecNN network, by keeping the DNN fixed
4) Update the DNN model in the new feature space a_t, by keeping the iVecNN network fixed
 
Both the learning of iVecNN and the updating of the DNN model can be performed using the standard error backpropagation, using
stochastic gradient descent
. The implementation can be found in our Kaldi+PDNN scripts and PDNN toolkit.

                                                                                                                                                                                                    

Testing (Decoding)

    
During decoding, we simply need to extract the i-vector for each testing speaker. Feeding the i-vector to the SAT-DNN architecture  will
automatically adapt SAT-DNN to this testing speaker.
No initial decoding pass and no DNN fine-tuning are needed on the
adaptation data.

    

                                                                                                                                                                                                   
Implementation
    
SAT-DNN has been integrated into our Kaldi+PDNN recipes. You can check out the latest version from the repository and find the
following 4 recipes


run_swbd_110h/run-dnn-fbank-sat.sh    --   Hybrid systems with filterbanks as input features
   
run_swbd_110h/run-dnn-sat.sh    --   Hybrid systems with fMLLRs as input features
   
run_swbd_110h/run-bnf-fbank-tandem-sat.sh    --   BNF tandem systems with filterbanks as input features
   
run_swbd_110h/run-bnf-tandem-sat.sh   --   BNF tandem systems with fMLLRs as input features
   
Before running each of the recipes, make sure that
1)  
I-vectors have been generated by running
run_swbd_110h/run-ivec-extract.sh
2)  The correpsonding DNN recipe has been run beforehand. For example, run-dnn-fbank.sh for run-dnn-fbank-sat.sh.
3) 
Dowload the following two source files to src/featbin and compile them.

http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/get-spkvec-feat.cc
http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/add-feats.cc

 

Training of the SAT-DNN models is performed by the run_DNN_SAT.py command from the PDNN toolkit.
 
                                                                                                                                                                                                     
Results
 
This is the Switchboard 110-hour setup. We show the WER% on the Switchboard part of the Hub'00 evaluation set. The DNN input
features can be SI filterbanks or SA fMLLRs.


  Hybrid Models
   

  filterbank features
  fMLLR features
  DNN Baseline   run-dnn-fbank.sh          21.7%   run-dnn.sh            19.2%
  SAT-DNN
  run-dnn-fbank-sat.sh     19.3%   run-dnn-sat.sh       17.9%
   
  Bottleneck Feature Tandem Systems
    

  filterbank features
  fMLLR features
  BNF Baseline    
  run-bnf-fbank-tandem.sh         19.6%
  run-bnf-tandem.sh            18.0%
  SAT-BNF
  run-bnf-fbank-tandem-sat.sh    18.0%   run-bnf-tandem-sat.sh       17.5%

                                                                                                                                                                                                   
Reference
  
Yajie Miao, Hao Zhang, Florian Metze. Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models.
INTERSPEECH 2014.


Yajie Miao, Lu Jiang, Hao Zhang, Florian Metze. Improvements to Speaker Adaptive Training of Deep Neural Networks.
  SLT 2014.