The CMU-Cambridge Statistical Language Modeling Toolkit v2
Preface
This document is not complete. It represents the current state of the
documentation for the toolkit, and should give a fairly good impression of
what the toolkit will contain when it is publicly available.
The toolkit is currently undergoing beta-testing, and the final version
should be available in February.
Any comments or questions regarding the contents of this page would be
welcomed, so please feel free to mail me.
Contents
* Introduction
* Changes from Version 1
* Downloading the Toolkit
* Installing the Toolkit
* Terminology and File Formats
* The Tools
o text2wfreq
o wfreq2vocab
o text2wngram
o text2idngram
o ngram2mgram
o wngram2idngram
o idngram2stats
o mergeidngram
o idngram2lm
o binlm2arpa
o evallm
* Typical Usage
* Discounting Strategies
* Feedback
If you want to get started making language models as quickly as possible,
you should download and install the toolkit (assuming that you have not
already done so), and then read the Typical Use section.
----------------------------------------------------------------------------
Introduction
Version 1 of the Carnegie Mellon University Statistical Language Modeling
toolkit was written by Roni Rosenfeld, and released in 1994. It is available
by ftp from here. Here is a excerpt from its read-me file:
Overview of the CMU SLM Toolkit, Rev 1.0
========================================
The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit
is a set of unix software tools designed to facilitate language
modeling work in the research community.
Some of the tools are used to process general textual data into:
- word frequency lists and vocabularies
- word bigram and trigram counts
- vocabulary-specific word bigram and trigram counts
- bigram- and trigram-related statistics
- various Backoff bigram and trigram language models
Others use the resulted language models to compute:
- perplexity
- Out-Of-Vocabulary (OOV) rate
- bigram- and trigram-hit ratios
- distribution of Backoff cases
- annotation of test data with language scores
Version 2 of the toolkit seeks to maintain the structure of version one, to
include all (or very nearly all) of the functionality of version 1, and to
provide useful improvements in terms of functionality and efficiency. The
key differences between this version and version one are described in the
next section.
----------------------------------------------------------------------------
Changes from Version 1
Efficient pre-processing tools
The tools used to pre-process the text stream which is used as training data
to the id n-gram file which can be read by idngram2lm, and generate a
vocabulary have been completely re-written, in order to increase their
efficiency.
All of the tools have been written in C, so there is no longer the reliance
on shell scripts and UNIX tools such as sort and awk. The tools now run much
faster, due to requiring much less disk I/O, although they do now require
more RAM than the tools of version 1.
Multiple discounting strategies
Version 1 of the toolkit allowed only Good-Turing discounting to be used in
the construction of the models. Version 2 allows any of the following
discounting strategies:
* Good Turing discounting
* Witten Bell discounting
* Absolute discounting
* Linear discounting
Use of n-grams with arbitrary n
The tools in the toolkit are no longer limited to the construction and
testing of bigram and trigram language models. As larger corpora, and faster
machines with more memory become available, it is becoming more interesting
to examine 4-grams, 5-grams, etc. The tools in version 2 of this toolkit
enable these models to be constructed and evaluated.
Interactive language model evaluation
The program evallm is used to test the language models produced by the
toolkit. Commands to this program are read in from the standard input after
the language model has been read, so the user can issue commands
interactively, rather than simply from the shell command line. This means
that if the user wants to calculate the perplexity of a particular language
model with respect to several different texts, the language model only needs
to be read once.
Evaluation of ARPA format language models
Version 2 of the toolkit includes the ability to calculate perplexities of
ARPA format language models.
Handling of context cues
In version 1, the tags ,
, and ,
the begin sentence, and begin paragraph
tags.
ASCII or binary (by default) file
containing a numerically sorted list of
Id n-gram n-tuples of numbers, corresponding to .id3gram.bin,
file the mapping of the word n-grams .id4gram.ascii etc.
relative to the vocabulary. Out of
vocabulary (OOV) words are mapped to
the number 0.
Binary file containing all the n-gram
Binary counts, together with discounting
language information and back-off weights. Can .binlm
model file be read by evallm and used to generate
word probabilities quickly.
ARPA ASCII file containing the language
language model probabilities in ARPA-standard .arpa
model file format.
ASCII file containing a list of
Forced vocabulary words from which to enforce
back-off back-off, together with either an 'i' .fblist
file or an 'e' to indicate inclusive or
exclusive forced back-off respectively.
These files may all be written are read by all the tools in compressed or
uncompressed mode. Specifically, if a filename is given a .Z extension, then
it will be read from the specified file via a zcat pipe, or written via a
compress pipe. If a filename is given a .gz, it will be read from the
specified file via a gunzip pipe, or written via a gzip pipe. If either of
these compression schemes are to be used, then the relevant tools (ie zcat,
and compress or gzip) must be available on the system, and pointed to by the
path.
If a filename argument is given as - then it is assumed to represent either
the standard input, or standard output (according to context). Any file read
from the standard input is assumed to be uncompressed, and therefore, all
desired compression and decompression should take place in a pipe: zcat <
abc.Z | abc2xyz | compress > xyz.Z
----------------------------------------------------------------------------
The Tools
text2wfreq
Input : Text stream
Output : List of every word which occurred in the text, along with its
number of occurrences.
Notes : Uses a hash-table to provide an efficient method of counting word
occurrences. Output list is not sorted (due to "randomness" of the
hash-table, but can be easily sorted into the user's desired order by the
UNIX sort command. In any case, the output does not need to be sorted in
order to serve as input for wfreq2vocab.
Command Line Syntax:
text2wfreq [ -hash 1000000 ]
[ -verbosity 2 ]
< .text > .wfreq
Higher values for the -hash parameter require more memory, but can reduce
computation time.
wfreq2vocab
Input : A word unigram file, as produced by text2wfreq
Output : A vocabulary file.
Command Line Syntax:
wfreq2vocab [ -top 20000 | -gt 10]
[ -records 1000000 ]
[ -verbosity 2]
< .wfreq > .vocab
The -top parameter allows the user to specify the size of the vocabulary; if
the program is called with the command -top 20000, then the vocabulary will
consist of the most common 20,000 words.
The -gt parameter allows the user to specify the number of times that a word
must occur to be included in the vocabulary; if the program is called with
the command -gt 10, then the vocabulary will consist of all the words which
occurred more than 10 times.
If neither the -gt, nor the -top parameters are specified, then the program
runs with the default setting of taking the top 20,000 words.
The -records parameter allows the user to specify how many of the word and
count records to allocate memory for. If the number of words in the input
exceeds this number, then the program will fail, but a high number will
obviously result in a higher memory requirement.
text2wngram
Input : Text stream
Output : List of every word n-gram which occurred in the text, along with
its number of occurrences.
Notes : Uses a "suffix-array" in order to efficiently count the n-grams.
Command Line Syntax:
text2wngram [ -n 3 ]
[ -temp ./ ]
[ -chars n ]
[ -words m ]
[ -verbosity 2 ]
< .text > .wngram
Default number of characters and words are chosen so that the memory
requirement of the program is approximately that of STD_MEM, and the number
of charactors is seven times greater than the number of words.
text2idngram
Input : Text stream, plus a vocabulary file.
Output : List of every id n-gram which occurred in the text, along with its
number of occurrences.
Notes : Maps each word in the text stream to a short integer as soon as it
has been read, thus enabling more n-grams to be stored and sorted in memory.
Command Line Syntax:
text2idngram -vocab .vocab
[ -buffer 100 ]
[ -temp ./ ]
[ -files 25 ]
[ -n 3 ]
[ -write_ascii ]
[ -verbosity 2 ]
< .text > .idngram
By default, the id n-gram file is written out as binary file, unless the
-write_ascii switch is used.
The size of the buffer which is used to store the n-grams can be specified
using the -buffer parameter. This value is in megabytes, and the default
value can be changed from 100 by changing the value of STD_MEM in the file
CMU-Camb_Toolkit_v2/src/toolkit.h before compiling the toolkit.
In the case of really huge quantities of data, it may be the case that more
temporary files are generated than can be opened at one time by the filing
system. In this case, the temporary files will be merged in chunks, and the
-files parameter can be used to specify how many files are allowed to be
open at one time.
ngram2mgram
Input : Either a word n-gram file, or an id n-gram file.
Output : Either a word m-gram file, or an id m-gram file, where m < n.
Command Line Syntax:
ngram2mgram -n N -m M
[ -binary | -ascii | -words ]
< .ngram > .mgram
The -binary, -ascii, -words correspond to the format of the input and output
(Note that the output file will be in the same format as the input file).
-ascii and -binary denote id n-gram files, in ASCII and binary formats
respectively, and -words denotes a word n-gram file.
wngram2idngram
Input : Word n-gram file, plus a vocabulary file.
Output : List of every id n-gram which occurred in the text, along with its
number of occurrences, in either ASCII or binary format
Note : For this program to be successful, it is important that the
vocabulary file is in alphabetical order. If you are using vocabularies
generated by the wfreq2vocab tool then this should not be an issue, as they
will already be alphabetically sorted.
Command Line Syntax:
wngram2idngram -vocab .vocab
[ -buffer 100 ]
[ -hash 200000 ]
[ -temp ./ ]
[ -files 30 ]
[ -verbosity 2 ]
[ -n 3 ]
[ -write_ascii ]
< .wngram > .idngram\n");
The size of the buffer which is used to store the n-grams can be specified
using the -buffer parameter. This value is in megabytes, and the default
value can be changed from 100 by changing the value of STD_MEM in the file
CMU-Camb_Toolkit_v2/src/toolkit.h before compiling the toolkit.
Higher values for the -hash parameter require more memory, but can reduce
computation time.
The -files parameter is used to specify the number of files which can be
open at one time.
idngram2stats
Input : An id n-gram file (in either binary (by default) or ASCII (if
specified) mode).
Output : A list of the frequency-of-frequencies for each of the 2-grams, ...
, n-grams, which can enable the user to choose appropriate cut-offs, and to
specify appropriate memory requirements with the -spec_num option in
idngram2lm.
Command Line Syntax:
idngram2stats [ -n 3 ]
[ -fof_size 50 ]
[ -verbosity 2 ]
[ -ascii_input ]
< .ingram > .stats
mergeidngram
Input : A set of id n-gram files (in either binary (by default) or ASCII (if
specified) format - note that they should all be in the same format,
however).
Output : One id n-gram file (in either binary (by default) or ASCII (if
specified) format), containing the merged id n-grams from the input files.
Notes : This utility can also be used to convert id n-gram files between
ascii and binary formats.
Command Line Syntax:
mergeidngram [ -n 3 ]
[ -ascii_input ]
[ -ascii_output ]
.idngram_1 .idngram_2 ... .idngram_N > .idngram
idngram2lm
Input : An id n-gram file (in either binary (by default) or ASCII (if
specified) format), a vocabulary file, and (optionally) a context cues file.
Additional command line parameters will specify the cutoffs, the discounting
strategy and parameters, etc.
Output : A language model, in either binary format (to be read by evallm),
or in ARPA format.
Command Line Syntax:
idngram2lm -idngram .idngram[ .bin | .ascii ]
-vocab .vocab
-arpa .arpa | -binary .binlm
[ -context .ccs]
[ -calc_mem | -guess | -spec_num y ... z]
[ -vocab_type 1 ]
[ -oov_fraction 0.5 ]
[ -two_byte_alphas [ [ -min_alpha -2 ] [ -max_alpha 4 ] ] ]
[ -out_of_range_alphas ]
[ -linear | -absolute | -good_turing | -witten_bell ]
[ -cutoffs 0 ... 0 ]
[ -ascii_input | -bin_input]
[ -n 3 ]
[ -verbosity 2 ]
The -context parameter allows the user to specify a file containing a list
of words within the vocabulary which will serve as context cues (for
example, markers which indicate the beginnings of sentences and paragraphs.
-calc_mem, -guess and -spec_num x y ... z are options to dictate how it is
decided how much memory should be allocated for the n-gram counts data
structure. -calc_mem demands that the id n-gram file should be read twice,
so that we can accurately calculate the amount of memory required. -guess
results in a (hopefully generous) estimate based on the size of the id
n-gram file. And -spec_num allows the user to specify exactly how many
2-grams, 3-grams, ... , and n-grams will need to be stored.
The -vocab_type flag allows the user to specify one of three vocabulary
options: closed vocabulary (-vocab_type 0), open vocabulary type 1
(-vocab_type 1), or open vocabulary type 2 (-vocab_type 2). In the case of
open vocabulary type 2, an extra parameter oov_fraction should be specified
(if it isn't, then it will take the default value of 0.5.
The floating point values of the back off weights (the alphas) may be stored
as two-byte integers, by using the -two_byte_alphas switch. This will
introduce slight rounding errors, and so should only be used if memory is
short. The -min_alpha, -max_alpha and -out_of_range_alphas are parameters
used by the functions for using two-byte alphas. Their values should only be
altered if the program instructs it. For further details, see the comments
in the source file two_byte_alphas.c.
The discounting strategy and its parameters are specified by the -linear,
-absolute, -good_turing and -witten_bell options.
The user can specify the cutoffs for the 2-grams, 3-grams, ..., n-grams by
using the -cutoffs parameter. If the parameter is omitted, then all the
cutoffs are set to zero.
Default value of n (the n in n-gram) = 3.
binlm2arpa
Input : A binary format language model, as generated by idngram2lm.
Output : An ARPA format language model.
Command Line Syntax:
binlm2arpa -binary .binlm
[ -arpa .arpa ]
[ -verbosity 2 ]
If no ARPA language model filename is specified, then the language model
will be written to standard output.
evallm
Input : A binary or ARPA format language model, as generated by idngram2lm.
In addition, one may also specify a text stream to be used to compute the
perplexity of the language model. The ARPA format language model does not
contain information as to which words are context cues, so if an ARPA format
lanaguage model is used, then a context cues file may be specified as well.3
Output : The program can run in one of two modes.
* compute-PP - Output is the perplexity of the language model with
respect to the input text stream.
* validate - Output is confirmation or denial that the sum of the
probabilities of each of the words in the context supplied by the user
sums to one.
Notes: evallm can receive and process commands interactively. When it is
run, it loads the language model specified at the command line, and waits
for instructions from the user. The user may specify one of the following
commands:
* perplexity
Computes the perplexity of a given text. May optionally specify words
from which to force back-off.
Syntax:
perplexity -text .text
[ -probs .fprobs ]
[ -oovs .oov_file ]
[ -annotate .annotation_file ]
[ -backoff_from_unk_inc | -backoff_from_unk_exc ]
[ -backoff_from_ccs_inc | -backoff_from_ccs_exc ]
[ -backoff_from_list .fblist ]
If the -probs parameter is specified, then each individual word
probability will be written out to the specified file.
If the -oovs parameter is specified, then any out-of-vocabulary (OOV)
words which are encountered in the test set will be written out to the
specified file.
If the -annotate parameter is used, then an annotation file will be
created, containing information on the probability of each word in the
test set according to the language model, as well as the back-off class
for each event.
To force back-off from all unknown words, use the -backoff_from_unk
flag. To force back-off from all context-cues, use the
-backoff_from_ccs flag. One can also specify a list of words from which
to back off, by storing this list in a forced back-off list file and
using the -backoff_from_list switch.
One can specify either inclusive or exclusive forced back-off by using
the -forced_backoff_inc or -forced_back_off_exc flags respectively.
* validate
Calculate the sum of the probabilities of all the words in the
vocabulary given the context specified by the user.
Syntax:
validate [ -backoff_from_unk -backoff_from_ccs |
-backoff_from_list .fblist ]
[ -forced_backoff_inc | -forced_back_off_exc ]
word1 word2 ... word_(n-1)
Where n is the n in n-gram.
* help
Displays a help message.
Syntax:
help
* quit
Exits the program.
Syntax:
quit
Since the commands are read from standard input, a command file can be piped
into it directly, thus removing the need for the program to run
interactively.
Command Line Syntax:
evallm [ -binary .binlm |
-arpa .arpa [ -context .ccs ] ]
----------------------------------------------------------------------------
Typical Usage
Given a large corpus of text in a file a.text, but no specified vocabulary
* Compute the word unigram counts
cat a.text | text2wfreq > a.wfreq
* Convert the word unigram counts into a vocabulary consisting of the
20,000 most common words
cat a.wfreq | wfreq2vocab -top 20000 > a.vocab
* Generate a binary id 3-gram of the training text, based on this
vocabulary
cat a.text | text2idngram -vocab a.vocab > a.idngram.bin
* Convert the idngram into a binary format language model
idngram2lm -idngram a.idngram.bin -vocab a.vocab -binary a.binlm
* Compute the perplexity of the language model, with respect to some test
text b.text
evallm -binary a.binlm
evallm : perplexity -text b.text
evallm : quit
Alternatively, some of these processes can be piped together:
cat a.text | text2wfreq | wfreq2vocab -top 20000 > a.vocab
cat a.text | text2idngram -vocab a.vocab | idngram2lm -idngram - -binary a.binlm
echo "perplexity -text b.text \n quit \n" | evallm -binary a.binlm
----------------------------------------------------------------------------
Discounting Strategies
Discounting is the process of replacing the original counts with modified
counts so as to redistribute the probability mass from the more commonly
observed events to the less frequent and unseen events. If the actual number
of occurrences of an event E (such as a bigram or trigram occurrence) is
c(E), then the modified count is d(c(E))c(E), where d(c(E)) is known as the
discount ratio.
Good Turing discounting
Good Turing discounting defines d(r) = (r+1)n(r+1) / rn(r) where n(r) is the
number of events which occur r times.
The discounting is only applied to counts which occ ur fewer than K times,
where typically K is chosen to be around 7. This is the "discounting range"
which is specified using the -disc_range parameter of the idngram2lm
program.
For further details see "Estimation of Probabilities from Sparse Data for
the Language Model Component of a Speech Recognizer", Slava M. Katz, in
"IEEE Transactions on Acoustics, Speech and Signal Processing", volume
ASSP-35, pages 400-401, March 1987.
Witten Bell discounting
The discounting scheme which we refer to here as "Witten Bell discounting"
is that which is referred to as type C in "The Zero-Frequency Problem:
Estimating the Probabilities of Novel Events in Adaptive Text Compression",
Ian H. Witten and Timothy C. Bell, in "IEEE Transactions on Information
Theory, Vol 37, No. 4, July 1991".
The discounting ratio is not solely dependent on the event's count, but also
on t, the number of types which followed the particular context. It defines
d(r,t) = n/(n + t), where n is the size of the training set in words. This
is equivalent to setting P(w | h) = c / (n + t) (where w is a word, h is the
history and c is the number of occurrences of w in the context h), for
events that have been seen, and P(w | h) = t / (n + t) for unseen events.
Absolute discounting
Absolute discounting defines d(r) = (r-b)/r. Typically b=n(1)/(n(1)+2n(2)).
The discounting is applied to all counts.
This is, of course, equivalent to simply subtracting the constant b from
each count.
Linear discounting
Linear discounting defines d(r) = 1 - (n(1)/C), where C is the total number
of events. The discounting is applied to all counts.
For further details of both linear and absolute discounting, see "On
structuring probabilistic dependencies in stochastic language modeling", H.
Ney, U. Essen and R. Kneser in "Computer Speech and Language", volume 8(1),
pages 1-28, 1994.
----------------------------------------------------------------------------
Feedback
Any comments, questions or bug reports concerning the toolkit should be
addressed to Philip Clarkson, either by e-mail to prc14@eng.cam.ac.uk or by
other means of communication.
----------------------------------------------------------------------------
Philip Clarkson - prc14@eng.cam.ac.uk
was required to be in the vocabulary. In
version 2, one may have any number of context cues (or none at all), and
they may be represented by any symbols one chooses. The context cues are a
subset of the vocabulary, and are specified in a context cue file.
Compact data storage
The data structures used to store the n-grams are more compact than those of
version 1, with the result that language models construction is a less
memory intensive task.
Support for gzip compression
As well as the compress data compression utility used in version 1 of the
toolkit, there is now also support for gzip.
Confidence interval capping
Confidence interval capping has been omitted from version 2 of the toolkit.
Forced back off
There are situations (for example when there is an unknown word in the
context) when it might be preferable not to use the n-gram probability
produced by the model, but to use less of the context, so that the
prediction is not based on the unknown word.
For example, if we have a trigram model, and are seeking to predict P(
A | C), where represents a sentence boundary, then we might
postulate that the information before the sentence boundary is not
particularly useful, and that a better estimate would be P( A | C), or
even P(A | C). The former case shall be referred to as inclusive forced
back-off and the latter as exclusive forced back-off
The evallm program allows the user to specify either inclusive or exclusive
forced back-off, as well as a list of words from which to enforce back off.
----------------------------------------------------------------------------
Downloading the Toolkit
When the toolkit is finished, instructions for downloading it will appear
here.
----------------------------------------------------------------------------
Installing the Toolkit
Note that this information is new, and applies to versions BETA.2 (31st Jan
1997) onwards.
In order to install the toolkit, the downloaded file should be uncompressed
and expanded from the tar file, using the commands
gunzip CMU-Camb_Toolkit_v2.tar.gz
tar -xvf CMU-Camb_Toolkit_v2.tar
This should create a directory CMU-Camb_Toolkit_v2, containing the
subdirectories src, bin, lib and doc. In order to create the executables,
one should change into the src directory and type make install. (It may be
necessary to edit the Makefile file first, in order to correctly set the
LINKFLAGS and CFLAGS parameters.) The executables will then be created in
the bin directory.
Before building the executables, it might be worth adjusting the value of
STD_MEM in the file CMU-Camb_Toolkit_v2/src/toolkit.h. This value controls
how much memory (in MB) the programs will attempt to assign for the large
buffers used by some of the programs. The result is that the final process
sizes will be a few MB bigger than this value. The more memory that can be
grabbed, the faster the programs will run. At present it is set to 100, but
if the machines which the tools will be run on contain much more memory than
this, then this value should be adjested to reflect this.
----------------------------------------------------------------------------
Terminology and File Formats
Name Description Typical file
extension
An ASCII file containing text. It may
Text stream or may not have markers to indicate .text
context cues, and white space can be
used freely.
An ASCII file containing a list of
Word words, and the number of times that
frequency they occurred. This list is not sorted; .wfreq
file it will generally be used as the input
to wfreq2vocab, which does not require
sorted input.
Word n-gram ASCII file containing an alphabetically
file sorted list of n-tuples of words, along .w3gram, .w4gram etc.
with the number of occurrences
ASCII file containing a list of
.vocab.20K,
Vocabulary vocabulary words. Comments may also be .vocab.60K etc.,
file included - any line beginning ## is depending on the size
considered a comment. The vocabulary is
limited in size to 65535 words. of the vocabulary.
ASCII file containing the list of words
which are to be considered "context
cues". These are words which provide
Context cues useful context information for the
file n-grams, but which are not to be .ccs
predicted by the language model.
Typical examples would be and