Exercise7
Linear Discriminant Analysys (LDA) and K-Means

During this session we will change the signal preprocessing by adding Linear Discriminant Analysis and explore another way to initialize the acoustic model using K-Means (see picture session5).

LDA is a linear transform that we mainly use to reduce dimensionalty of our input features. LDA assumes that we have classes (labels) and that features of different classes have a Gaussian distribution with the same covariance matrix but different mean (http://en.wikipedia.org/wiki/Linear_discriminant_analysis). A very good explanation of the LDA used in Janus can be found in chapter 10 of "Introduction to Statistical Pattern Recognition" Second Edition by Keinosuke Fukunaga. LDA in speech recognition is also described in chapter 9.3.4 of "Spoken Language Processing" by Huang, Acero, and Hon.

The main difference between LDA and Principle Component Analysis (PCA) is that PCA tries to keep as much structure of the features (variance). PCA decorrelates the feature space and orders the dimensions with decreasing variance. In case we want to reduce the dimensionality and choose the first N dimensions we keep the most structure of the data. LDA focus on dimensions that separates classes and orders dimensions according to class separabilty (http://en.wikipedia.org/wiki/Principal_components_analysis).

The following picture illustrates the difference of PCA and LDA.

picture here.

In this example the highest variance in the data set is along a direction that would not provide usefull information for classification. Therefor if we would choose only this direction we would get features that are not useful. This is a worst case situation and often PCA works also.

Question 7-1: Assuming that you have 10 classes. What is the maximal rank of the LDA transform?

If the pre-conditions for LDA hold, the transform creates independent features that are optimal for classification. But we also would need to know how many classes and need labeled data. However the assumtions do not really hold for speech, but LDA is still a powerfull transform.

To use a LDA to reduce the dimensionalty in Janus we only need to multiply the feature (space) that was used to train the LDA-transform and cut the resulting features to the target dimensions.

Task 7-1: create a new feature description file featDesc.lda. Apply a LDA transform to transform FEAT features to the 40 dimensional feature LDA. Do this only if the matrix ldaMatrix exists. Also change the feature FEAT to be the +/-7 adjacent frames of the MFCC feature before applying LDA.

In the following step we use the Viterbi-labels from homework5 to compute a mapping from feature vector to class label. We will use the context dependent from homework6 to to define to which class (GMM model) a feature vector belongs. Experience in speech recognition shows that the context dependent classes result usual in a better transform.

#
# This procedure open (un)compressed files. Can be used instead of open
#
proc fileOpen {fileName {mode r}} {
if {[file extension $fileName] == ".gz"} {
   # mode "a" is not allowed
   if {$mode == "r"} {
     set fd [open "|gunzip -c $fileName" $mode]
   } else {    
     set fd [open "|gzip -c > $fileName" $mode]
   }
 } else {
   set fd [open $fileName $mode]
 }
 return $fd
}

#
# This procdure read a codebook description file and alters entries on the fly
# 
proc readModifyCbsDesc {args} {
set feat ""
set refN -1
set dimN -1
set type ""

  itfParseArgv readChangeCbsDesc $args [list [
    list "<cbs>"                    object  {} cbs     {CodebookSet}   {object of codebook set} ] [
    list "<cbsDescFile>"            string  {} cbsDescFile         {}   {cbs description file} ] [
    list "-feat"                    string  {} feat   {}   {name of feature to use} ] [
    list "-refN"                    int     {} refN   {}   {number of Gaussian} ] [
    list "-dimN"                    int     {} dimN   {}   {dimesnon of feature} ] [
    list "-covarType"               string  {} type   {}   {type of covariance NONE,RADIAL,DIAGONAL,FULL} ] ]

    set idx 0
    set fd [fileOpen $cbsDescFile]
    while {[gets $fd LINE] >= 0} {
	# remark line?
        if {[regexp -- {^;} $LINE]} continue
	set name [lindex $LINE 0]
	if {$feat == ""} {set feat [lindex $LINE 1]}
	if {$refN == -1} {set refN [lindex $LINE 2]}
	if {$dimN == -1} {set dimN [lindex $LINE 3]}
	if {$type == ""} {set type [lindex $LINE 4]}
        $cbs add $name $feat $refN $dimN $type
	incr idx
    } 
    close $fd
    return $idx
}

#
# We register the procedure at the class. 
# All instances of class CodebookSet can call them like a built in method.
#
CodebookSet method readModify readModifyCbsDesc -text "Read a CodebookSet description file and modify the settings (feat,refN,dimN,type)"


#
# For the following steps, we do not need to load the codebook and distribution parameter.
# We also should not load them, because we will alter the description of the codebooks. 
# 
[Tags tags] read ./tags.desc
[PhonesSet phonesSet] read ./phones_set.desc
Dictionary dict phonesSet:PHONES tags
dict read ./dictionary.desc

# Load the new feature description file that 
# computes a 32 dimension featue LDA 
# and a 195 dimeansional feature FEAT
[FeatureSet fs] setDesc @./featDesc.lda

# We create the objects for the GMMs and alter the 
# description so that the new GMMs will be trained 
# with the feature LDA etc.
CodebookSet cbs fs
# from homework6
cbs readModify weights/CD-cbs-pruned.desc.gz -dimN 32 -feat LDA -refN 4

DistribSet dss cbs
# from homework6
dss read weights/CD-dss-pruned.desc.gz

Tree dssTree phonesSet:PHONES phonesSet tags dss -padPhone [phonesSet:PHONES index @]
# from homework6
dssTree read ./weights/CD-dssTree-pruned.desc.gz

DistribStream dsStream dss dssTree
SenoneSet sns {dsStream}
[TmSet tmSet] read ./tmSet.desc
[TopoSet topoSet sns tmSet] read ./topoSet.desc
[Tree topoTree phonesSet:PHONES phonesSet tags topoSet -padPhone [phonesSet:PHONES index @]] read ./topoTree.desc
AModelSet ams topoTree ROOT
HMM hmm dict ams
Path path

# open database
set uttDB utterance
DBase db
db open ${uttDB}.dat ${uttDB}.idx    -mode "r"

# readFile
source trainLib.tcl
set trainKeyFile "./train_utt.lst"
#set keyList [lrange [readFile $trainKeyFile] 0 19]
set keyList [readFile $trainKeyFile]

The dependencies for the LDA object are shown in the picture below.

picture here



#
# Create a LDA object
#
LDA lda fs FEAT 195

# add one class for each codebook and map the index of the codebook to class in lda object
foreach cb [cbs] {
  lda add $cb
  lda map [cbs index $cb] -class $cb
}

set labelPath ./labels

foreach key $keyList {
  puts "LDA $key"
  set uttInfo [db get $key]
  makeArray uttArray $uttInfo

  fs eval $uttInfo

  hmm make $uttArray(TEXT) -opt $
  if {![catch {path bload $labelPath/$uttArray(SPKID)/$uttArray(UTTID).lbl -hmm hmm} msg]} {
    # map senone index to codebook index of stream 0
    # that means that the senone index in the path object is replaced
    path map hmm -senoneSet hmm.stateGraph.senoneSet -stream 0 -codebookX 1
    lda accu path	
  } else {
    puts "WARNING: could not load label of key $key. Skip" 
  }
}

# save collected x_i and x^2 statistic as FMatrix objects
lda saveMeans weights/lda.mean
lda saveScatter weights/lda.scat
# Question: What are the dimensions of lda.mean and lda.scat matrix.

# compute within scatter matrix and total scatter matrix from the collected statistics
lda update

# compute the lda matrix by simultaneous diagonalisation (see Fukunaga above)
DMatrix ldaA
DMatrix ldaK
ldaA simDiag ldaK lda.matrixT lda.matrixW
FMatrix   ldaMatrix
ldaMatrix DMatrix ldaA
ldaMatrix bsave weights/lda.bmat
ldaK destroy
ldaA destroy

We use the sample extraction to sub sample our training data and extract 1000 feature vectors for each GMM.
SampleSet smp fs LDA 32

set maxSampleCount 1000 ;# maximal number of sample we want to extract for each codebook

# add one class for each codebook and map the index of the codebook to class in lda object
foreach cb [cbs] {
  smp add $cb
  smp map [cbs index $cb] -class $cb
}

# do not make a mess, store the files in a directory
set dataPath ./tmp-data
foreach class [smp:] {
    smp:${class} configure -maxCount $maxSampleCount
    smp:${class} configure -modulus  10

    # samples of class will be stored in the following file
    # the file will have the format of a FMatrix object
    regsub -all "/" $class "_" normClass
    smp:${class} configure -fileName [file join $dataPath ${normClass}.smp]
}

# We want to collect 1000 samples equal drawn from 
# the training data. Therefore we need a better modulus.
# We can get this information from the previous step, 
# asking the lda object how many samples for 
# each class are available in total.
# This is also a very valuable diagnostic output
foreach class [smp:] {
    ;# use the total number of samples to find the modulus
    smp:${class} configure -modulus  [expr 1+int([lda:${class}.mean configure -count]/$maxSampleCount)]
}


#
# collect samples 
# Steps are very similar to statistic collection of LDA
#
foreach key $keyList {
  puts "Sample $key"
  set uttInfo [db get $key]
  makeArray uttArray $uttInfo

  fs eval $uttInfo

  hmm make $uttArray(TEXT) -opt $
  if {![catch {path bload $labelPath/$uttArray(SPKID)/$uttArray(UTTID).lbl -hmm hmm} msg]} {
    # map senone index to codebook index of stream 0
    # that means that the senone index in the path object is replaced
    path map hmm -senoneSet hmm.stateGraph.senoneSet -stream 0 -codebookX 1

    # during the sample extraction from the path sample is extended by the gamma value stored in the path item 
    smp accu path	
  } else {
    puts "WARNING: could not load label of key $key. Skip"
  }
}
# make sure that are all samples on disk
smp flush

# Question: 

The FMatrix object hast a method neuralgas that can be used to do a K-means clustering (http://en.wikipedia.org/wiki/K-means_algorithm). However, neuralgas has a parameter called "temperature" that control the how vectors are assigned to class centroids. In K-means a feature vector is assigned to only one centroid which happens if the temperature is zero. Othervise the vector is assingned with an exponential decaying weight depending on the rang of the centroid (the paper of Martinetz, Berkovich and Schulten can be found at IEEE-Explorer).

To find the closest centroids the euklid distance is used, which corresponds to a Mahalanobis distance with an identity covariance matrix.

Question: What problems could arise from using the euklid distance?

We use the extracted samples from the prevoius step to initalize each GMM we have data for. It is handy that we can access the means of a codebook as a FMatrix object. Usual we use a temperature of zerro (K-means) because it is much faster and the difference in initialisation often does not have a big impact on the final system.


#
# Do K-Means initialisation
#

FMatrix smpMat
FVector smpC   ;# counts of samples assigned to centroids
set maxIter 10
set tempF   0.05 ;# temperature factor. New_temperature = Current_temperature * Temperatur_factor
foreach cb [cbs:] {
  puts "k-means $cb"
  set smpFile [file join $dataPath ${cb}.smp]
  if {[file exist $smpFile]} {
    puts "load $smpFile"
    smpMat bload $smpFile
    # remove gamma of path from samples
    # use resize to rmove the highest dimesion (=gamma)
    smpMat resize [smpMat configure -m] [expr [smpMat configure -n]-1]

    # init = 0   -> no initalisation (keep what is in the matrix)
    # init = 1   -> pseudo random (deterministic) initalisation
    # init = >1  -> initalize with the first N elemens (of smpMat)
    cbs:${cb}.mat  neuralGas smpMat -maxIter $maxIter -tempF $tempF -counts smpC

    # compute inital mixture weights from k-means
    set sum 0
    set vec ""
    foreach x [smpC puts] { set sum     [expr $sum + $x]   }
    foreach x [smpC puts] { lappend vec [expr $x   / $sum] }
    dss:${cb} configure -count $sum -val $vec
  }
}

#
# Write the Gaussian and mixture weights.
# From this point another EM-training can be done
#
cbs write weights/CD-i0.cbs.desc.gz
cbs save  weights/CD-i0.cbs.param.gz
dss write weights/CD-i0.dss.desc.gz
dss save  weights/CD-i0.dss.param.gz

# staring with this models we can do forward-backward/viterbi/label training
# with or without split and merge

# Task 7-2: Do label training for 12 iterations with split and merge followed by two iteration of viterbi training
Last modified: Wed Feb 01 11:56:49 Eastern Standard Time 2006
Maintainer: tschaaf@cs.cmu.edu.