Continuous
Broadcast News Acoustic Models
HUB96-HUB97 (or HUB97-HUB98, depending on whom you ask) [1]
Note: This document is maintained by Arthur Chan and Evandro Gouvea.
~200 hours of English training data was used from the HUB96 and HUB97 training sets.
Forced alignment: Acoustic models from a previous HUB97 run were used for this step. Of the 71,666 utterances in the training set, 70,405 of them were faligned. Note, there existed a small number of OOV words. Almost of all the OOV words were words that I did not know how to pronounce. Instead of inserting potentially incorrect pronunciations and thus less accurate acoustic models, I left out unknown words.
3 state vs. 5 state: Two sets of training were performed, 3 state and 5 state. All other variables were held constant, except skipstate, which was set to ‘no’ for 3 state models, ‘yes’ for 5 state models. (Forced alignment was not redone so the same utterances that were used to train the 3 state models were used for the 5 state models.)
Training Variables:
Gaussians: Context Dependent models were built with 1,2,4,8,16,32 gaussians. This was done to allow for speed/accuracy testing.
Gausubvq: After the acoustic models were built, gausubvq was run on each set of gaussian models to produce the sub-vector quantized form of the acoustic models. Gaussubvq was called with the following command line arguments:
Means variances
24,0-11/25,12-23/26,27-38 <num_cluster> 0.0001 1 <filename>
Several runs were performed to iterate over num_clusters, again for testing purposes. Quantized versions should exist for clusters of size: 512, 1024, 2048 and 4096.