Exercise8
Language Modeling (LM)

During this session we will prepare several N-Gram language models and compute the test set perplexity for them. We will learn how to create language model objects in Janus, which we will need when we start decoding speech.

SRILM and Mandarin Environment Setup

We will use the language model toolkit SRILM from SRI which can be downloaded from http://www.speech.sri.com/projects/srilm/. There is also documentation available. To use the LM Tools from SRI during the lab add the following settings to you Unix environment.

# SRILM toolkit
setenv MACHINE_TYPE `uname -m`
setenv SRILM      /project/Class-11-753/tools/srilm-1.4.6
setenv PATH       ${SRILM}/bin:${SRILM}/bin/${MACHINE_TYPE}:${PATH}
setenv MANPATH    ${SRILM}/man:${MANPATH}

To display utf-8 encoded text the UNIX environment variable LANG should be "UTF-8" or "en_US.UTF-8" (default on spoon). If you login from a windows machine and use "putty" you should configure the translation to "utf-8". However, to display Mandarin characters also require that the language support east Asian languages is also installed. This can be done trough the control panel, a description can be found at the following web page: http://www.umiacs.umd.edu/~aelkiss/xml/displaywin.html . To view a file that is encoded in utf-8 under Unix, it seems that "less" is not working correct (adds a line at the beginning), but "more" does.

Language Model

A detailed description about language modeling for speech recognition was written by Fred Jelinek in the article "Self-Organizing Language Modeling for Speech Recognition" (Readings in Speech Recognition, 1990, Waibel and Lee). Chapter 11 of "Spoken Language Processing" by Huang, Acero and Hon discuss many aspects of language modeling. A tutorial of the "State of the Art Language Modeling" by Joshua Goodman can be found http://research.microsoft.com/~joshuago/publications.htm the power point presentation is here.

Question 8-1: Why do we need a language model in speech recognition?

Question 8-2: How is perplexity defined? What is the test set perplexity?

Question 8-3: What word sequence should have a zero probability in continuous speech recognition?

Computing a Language Model

The following comand computes a bi-gram language model with Kneser-Ney discounting. For more details see the man pages of ngram-count and ngram.

ngram-count -order 2 -lm train.2.arpabo.gz -text data/CH/trl.utf8.set/trl.utf8.train -unk -map-unk "<UNK>" -kndiscount -interpolate

Measure the quality of the created language model.

ngram -order 2 -lm train.2.arpabo.gz -ppl data/CH/trl.utf8.set/trl.utf8.dev

the default of order is 3. However, order should always be set! See the man page of ngram why.

Language Models in Janus

The LingKS class is a abstract class that defines an interface for language models inside Janus. To create a language model it is necessary to specify the language model type.

The encoding we use for the Mandarin corpus is utf-8. Therefore we want to use the same encoding in Tcl/Tk. With the Tcl command encoding system utf-8 the encoding is set to utf-8, therefore all string operations should work properly. E.g. string length "马里南部" should return the value 5, but string bytelength "马里南部" should return 13.

N-Gram Language Models

LingKS lm1 -help
# Options of 'lm1' are:
#  <name>  name of the linguistic knowledge source (string:"lm1")
#  <type>  Kind of LingKS: NGramLM|PhraseLM|MetaLM|CFG|CFGSet (string:"NULL")
LingKS lm1 NGramLM
lm1 load train.2.arpabo.gz

lm1 score "马里 南部"
# computes the -log probability for the given word sequence

lm1 score "马里 南部" -idx 0
# is the same as above, because the first word has index 0
# with -idx it is possible to specify the start

lm1 score "马里 南部" -idx 1
# in this case only one probability is queried
# because this word sequence is also a bi-gram in the loaded 
# language model the value is the same as in the file


lm1.NGramLM configure
# {-order 2} {-history 1} {-segSize 12} {-log0 -99.000000} {-log0Val -5.000000} {-hashLCT 0} {-itemN 7042} {-blkSize 1000}
# 7042 different vocabulary words in the language model 
# including <UNK> (class of unknown words), <s>, and </s> (sentence start/end word. 
# order 2, history 1 = bi-gram
# other entries are for memory management and default values for log(0)

Dynamic Interpolation of Language Models

LingKS lm2 NGramLM
lm2 load train.1.arpabo.gz

LingKS lm3 MetaLM

lm3.MetaLM -help
# DESCRIPTION
# Meta language model: flexible LM using sub-LMs.
# METHODS
# puts               display the contents of a MetaLM
# add                add an item (using atomic LMs)
# get                get the parameters for one item
# list               list the currently available LMs
# LMadd              add a language model for usage with metaLM
# LMindex            return the internal index of an atomic LM
# LMname             return the name of an element (atomic LM)
# cover              cover an element (read all words from it)
# scoreFct           change the score function

lm3.MetaLM LMadd lm1
lm3.MetaLM LMadd lm2

# now add words to the MetaLM
lm3.MetaLM add -help
# Options of 'add' are:
#  <LM word>  LM word in this model (string:"NULL")
#  -lksA      Language Model A (int:0)
#  -lksB      Language Model B (int:1)
#  -nameA     corresponding word in LM A (string:"NULL")
#  -nameB     corresponding word in LM B (string:"NULL")
#  -prob      probability (float:0.000000)

# add word to MetaLM
lm3.MetaLM add <UNK> -prob 0.5

# interpolate mono-gram and bi-gram language models with the same weight for all words
foreach w [lm1] {
  lm3.MetaLM add $w -prob 0.5
}

lm3.MetaLM.item(100) configure
# {-name 不但} {-idxA 100} {-idxB 100} {-lmA 0} {-lmB 1} {-prob 0.500000}

# get a lm score from the interpolated lm

lm3 score "马里 南部"

lm1 score "马里 南部" -idx 1
lm2 score "马里 南部" -idx 1
lm3 score "马里 南部" -idx 1

Task 8-1: Verify that the score from lm3 (-idx 1 example above) is the average probability of lm1 and lm2

How to handle Multi-Words

LingKS lm4 PhraseLM

lm4.PhraseLM -help
#  DESCRIPTION
#  This module takes a LM and adds phrases (aka. multi-words) to it.
#  METHODS
#  base               define the base LingKS
#  puts               display the contents of a PhraseLM
#  add                add a mapping for a phrase
#  readSubs           read map-table from 'NGramLM' object

# As base language model any type of LingKS object can be used, often it is a plain NGramLM type
lm4.PhraseLM base lm3

lm4.PhraseLM add -help
#  Options of 'add' are:
#   <search word>     search vocabulary word (string:"NULL")
#   <LM word string>  language-model word(s) (string:"NULL")
#   -prob             probability (float:0.000000)
#   -v                verbose (int:1)

# The PhraseLM test if one of the words is in his list and performs a mapping
# Therefore if we do not add entries we get exactly the same scores as from 
# the underlying language model.
# 
lm4 score "马里 南部" -idx 1
# 0.916000
lm3 score "马里 南部" -idx 1
# 0.916000

# What is a PhraseLM good for?
# Assume that we have a "multi-word" in the decoder dictionary
# E.g. in English "gonna" could be mapped to "going to"
# Assume that we have pronunciation variations which have a different observation probability
# like "and" and a sloppy version "and(2)" which occur 80%/20%.
# Then we could map "and(2)" -> "and" -prob -log(0.2)/log(10)
#                   "and"    -> "and" -prob -log(0.8)/log(10)
# However, there is another Janus object SVMap that does vocab mapping and has also the capability of doing that

lm4.PhraseLM add "马里_南部" "马里 南部"

# two single words
lm4 score "马里 南部"
# 5.686000
#
# one multi-word
lm4 score "马里_南部"
# 5.686000

Question 8-4: How can the PhraseLM be used for Mandarin speech recognition?

Search Vocabulary and Vocab Mapper

# Create a dummy dictionary
Phones ps
ps add X
Tags tags
Dictionary dict ps tags
dict add "(" {X}  ;# utterance start ~ <s>
dict add ")" {X}  ;# utterance end   ~ </s>
dict add "$" {X}  ;# optional silence
foreach w [lm1] {
   dict add $w {X}
}
dict.item(100)
# 不但 { X}

# create a search vocab object SVocab
# this is the vocabulary used by the decoder and 
# defines the words/units that can be recognized
SVocab svocab dict
SVMap svmap svocab lm4

# add all words of the dictionary 
foreach w [dict] {
  svocab add $w
}

#
svmap map base ;# map all pronunciation variances of svocab to the base-form
# "and(2)" is a pronunciation variance of the base-form "and"
# All pronunciation variances are indicated by an opening parenthesis, 
# which is the separator between base-form and an identifier.
# Also "and(2", and "and(an)" are possible names for pronunciation variances of the base-form "and" 
svmap map id; # do not map to the base-form

svmap get 马里
# 马里 马里 0.000000
# 马里 is mapped to the same word without an additional score

# The SVmap object is an more efficient mapping compared to the PhraseLM 

# we change the "probability" of a mapping
# However, "-prob" expect a language model score ( -log(prob)/log(10) )
svmap add  "马里" "马里" -prob 0.5

Question 8-5: How can SVMap be used for a class based Language Model of type p(z|class(z))*p(class(z)|class(x) class(y)).

A more detailed description of the language model objects can be found in the JRTk documentation /project/Class-11-753/tools/janus/doc/janus/doc/janus/janus-doku.pdf.

Task 8-2: Build word based language models (1/2/3-Gram) for Mandarin given the training data and measure the perplexity on training and development set.

Task 8-3: Build character based language models (1-6-Gram) for Mandarin given the training data and measure the perplexity on training and development set.

Task 8-4: Collect more language model data from the web and add them to the training data. Build language models and measure the perplexity.

Last modified: Thu Feb 09 17:18:37 Eastern Standard Time 2006
Maintainer: tschaaf@cs.cmu.edu.

Exercise8 Language Modeling (LM)