Continuous Broadcast News Acoustic Models

HUB96-HUB97 (or HUB97-HUB98, depending on whom you ask) [1]

 

Note: This document is maintained by Arthur Chan and Evandro Gouvea.

 

HUB96-HUB97 – ~200hrs - 8000 senones

~200 hours of English training data was used from the HUB96 and HUB97 training sets.

 

Forced alignment: Acoustic models from a previous HUB97 run were used for this step. Of the 71,666 utterances in the training set, 70,405 of them were faligned. Note, there existed a small number of OOV words. Almost of all the OOV words were words that I did not know how to pronounce. Instead of inserting potentially incorrect pronunciations and thus less accurate acoustic models, I left out unknown words.

 

 

3 state vs. 5 state: Two sets of training were performed, 3 state and 5 state. All other variables were held constant, except skipstate, which was set to ‘no’ for 3 state models, ‘yes’ for 5 state models. (Forced alignment was not redone so the same utterances that were used to train the 3 state models were used for the 5 state models.)

 

Training Variables:

 

Gaussians:  Context Dependent models were built with 1,2,4,8,16,32 gaussians. This was done to allow for speed/accuracy testing.

 

Gausubvq: After the acoustic models were built, gausubvq was run on each set of gaussian models to produce the sub-vector quantized form of the acoustic models. Gaussubvq was called with the following command line arguments:

 Means variances 24,0-11/25,12-23/26,27-38 <num_cluster> 0.0001 1 <filename>

Several runs were performed to iterate over num_clusters, again for testing purposes. Quantized versions should exist for clusters of size: 512, 1024, 2048 and 4096.



[1] I will reference HUB96 and HUB97 because this is what Rita Singh, rsingh@cs.cmu.edu uses.