During this session we will prepare several N-Gram language models and compute the test set perplexity for them. We will learn how to create language model objects in Janus, which we will need when we start decoding speech.

# SRILM toolkit
setenv MACHINE_TYPE `uname -m`
setenv SRILM /project/Class-11-753/tools/srilm-1.4.6
setenv PATH ${SRILM}/bin:${SRILM}/bin/${MACHINE_TYPE}:${PATH}
setenv MANPATH ${SRILM}/man:${MANPATH}
To display utf-8 encoded text the UNIX environment variable LANG should be
"UTF-8" or "en_US.UTF-8" (default on spoon). If you login from a windows machine
and use "putty" you should configure the translation to "utf-8". However, to display
Mandarin characters also require that the language support east Asian languages
is also installed. This can be done trough the control panel, a description can be
found at the following web page: http://www.umiacs.umd.edu/~aelkiss/xml/displaywin.html .
To view a file that is encoded in utf-8 under Unix, it seems that "less" is not working correct
(adds a line at the beginning), but "more" does.
Question 8-1: Why do we need a language model in speech recognition?
Question 8-2: How is perplexity defined? What is the test set perplexity?
Question 8-3: What word sequence should have a zero probability in continuous speech recognition?
ngram-count -order 2 -lm train.2.arpabo.gz -text data/CH/trl.utf8.set/trl.utf8.train -unk -map-unk "<UNK>" -kndiscount -interpolateMeasure the quality of the created language model.
ngram -order 2 -lm train.2.arpabo.gz -ppl data/CH/trl.utf8.set/trl.utf8.devthe default of order is 3. However, order should always be set! See the man page of ngram why.

The LingKS class is a abstract class that defines an interface for language models inside Janus. To create a language model it is necessary to specify the language model type.
The encoding we use for the Mandarin corpus is utf-8. Therefore we want to use the same encoding in Tcl/Tk. With the Tcl command encoding system utf-8 the encoding is set to utf-8, therefore all string operations should work properly. E.g. string length "马里 南部" should return the value 5, but string bytelength "马里 南部" should return 13.
LingKS lm1 -help
# Options of 'lm1' are:
# <name> name of the linguistic knowledge source (string:"lm1")
# <type> Kind of LingKS: NGramLM|PhraseLM|MetaLM|CFG|CFGSet (string:"NULL")
LingKS lm1 NGramLM
lm1 load train.2.arpabo.gz
lm1 score "马里 南部"
# computes the -log probability for the given word sequence
lm1 score "马里 南部" -idx 0
# is the same as above, because the first word has index 0
# with -idx it is possible to specify the start
lm1 score "马里 南部" -idx 1
# in this case only one probability is queried
# because this word sequence is also a bi-gram in the loaded
# language model the value is the same as in the file
lm1.NGramLM configure
# {-order 2} {-history 1} {-segSize 12} {-log0 -99.000000} {-log0Val -5.000000} {-hashLCT 0} {-itemN 7042} {-blkSize 1000}
# 7042 different vocabulary words in the language model
# including <UNK> (class of unknown words), <s>, and </s> (sentence start/end word.
# order 2, history 1 = bi-gram
# other entries are for memory management and default values for log(0)

LingKS lm2 NGramLM
lm2 load train.1.arpabo.gz
LingKS lm3 MetaLM
lm3.MetaLM -help
# DESCRIPTION
# Meta language model: flexible LM using sub-LMs.
# METHODS
# puts display the contents of a MetaLM
# add add an item (using atomic LMs)
# get get the parameters for one item
# list list the currently available LMs
# LMadd add a language model for usage with metaLM
# LMindex return the internal index of an atomic LM
# LMname return the name of an element (atomic LM)
# cover cover an element (read all words from it)
# scoreFct change the score function
lm3.MetaLM LMadd lm1
lm3.MetaLM LMadd lm2
# now add words to the MetaLM
lm3.MetaLM add -help
# Options of 'add' are:
# <LM word> LM word in this model (string:"NULL")
# -lksA Language Model A (int:0)
# -lksB Language Model B (int:1)
# -nameA corresponding word in LM A (string:"NULL")
# -nameB corresponding word in LM B (string:"NULL")
# -prob probability (float:0.000000)
# add word to MetaLM
lm3.MetaLM add <UNK> -prob 0.5
# interpolate mono-gram and bi-gram language models with the same weight for all words
foreach w [lm1] {
lm3.MetaLM add $w -prob 0.5
}
lm3.MetaLM.item(100) configure
# {-name 不但} {-idxA 100} {-idxB 100} {-lmA 0} {-lmB 1} {-prob 0.500000}
# get a lm score from the interpolated lm
lm3 score "马里 南部"
lm1 score "马里 南部" -idx 1
lm2 score "马里 南部" -idx 1
lm3 score "马里 南部" -idx 1
Task 8-1: Verify that the score from lm3 (-idx 1 example above) is the average probability of lm1 and lm2

LingKS lm4 PhraseLM lm4.PhraseLM -help # DESCRIPTION # This module takes a LM and adds phrases (aka. multi-words) to it. # METHODS # base define the base LingKS # puts display the contents of a PhraseLM # add add a mapping for a phrase # readSubs read map-table from 'NGramLM' object # As base language model any type of LingKS object can be used, often it is a plain NGramLM type lm4.PhraseLM base lm3 lm4.PhraseLM add -help # Options of 'add' are: # <search word> search vocabulary word (string:"NULL") # <LM word string> language-model word(s) (string:"NULL") # -prob probability (float:0.000000) # -v verbose (int:1) # The PhraseLM test if one of the words is in his list and performs a mapping # Therefore if we do not add entries we get exactly the same scores as from # the underlying language model. # lm4 score "马里 南部" -idx 1 # 0.916000 lm3 score "马里 南部" -idx 1 # 0.916000 # What is a PhraseLM good for? # Assume that we have a "multi-word" in the decoder dictionary # E.g. in English "gonna" could be mapped to "going to" # Assume that we have pronunciation variations which have a different observation probability # like "and" and a sloppy version "and(2)" which occur 80%/20%. # Then we could map "and(2)" -> "and" -prob -log(0.2)/log(10) # "and" -> "and" -prob -log(0.8)/log(10) # However, there is another Janus object SVMap that does vocab mapping and has also the capability of doing that lm4.PhraseLM add "马里_南部" "马里 南部" # two single words lm4 score "马里 南部" # 5.686000 # # one multi-word lm4 score "马里_南部" # 5.686000
Question 8-4: How can the PhraseLM be used for Mandarin speech recognition?

# Create a dummy dictionary
Phones ps
ps add X
Tags tags
Dictionary dict ps tags
dict add "(" {X} ;# utterance start ~ <s>
dict add ")" {X} ;# utterance end ~ </s>
dict add "$" {X} ;# optional silence
foreach w [lm1] {
dict add $w {X}
}
dict.item(100)
# 不但 { X}
# create a search vocab object SVocab
# this is the vocabulary used by the decoder and
# defines the words/units that can be recognized
SVocab svocab dict
SVMap svmap svocab lm4
# add all words of the dictionary
foreach w [dict] {
svocab add $w
}
#
svmap map base ;# map all pronunciation variances of svocab to the base-form
# "and(2)" is a pronunciation variance of the base-form "and"
# All pronunciation variances are indicated by an opening parenthesis,
# which is the separator between base-form and an identifier.
# Also "and(2", and "and(an)" are possible names for pronunciation variances of the base-form "and"
svmap map id; # do not map to the base-form
svmap get 马里
# 马里 马里 0.000000
# 马里 is mapped to the same word without an additional score
# The SVmap object is an more efficient mapping compared to the PhraseLM
# we change the "probability" of a mapping
# However, "-prob" expect a language model score ( -log(prob)/log(10) )
svmap add "马里" "马里" -prob 0.5
Question 8-5: How can SVMap be used for a class based Language Model of type p(z|class(z))*p(class(z)|class(x) class(y)).
A more detailed description of the language model objects can be found in the JRTk documentation /project/Class-11-753/tools/janus/doc/janus/doc/janus/janus-doku.pdf.

Task 8-3: Build character based language models (1-6-Gram) for Mandarin given the training data and measure the perplexity on training and development set.
Task 8-4: Collect more language model data from the web and add them to the training data. Build language models and measure the perplexity.
Last modified: Thu Feb 09 17:18:37 Eastern Standard Time 2006
Maintainer: tschaaf@cs.cmu.edu.