anti-ngram
anti-ngram
 NAME 
anti-ngram - count posterior-weighted N-grams in N-best lists
 SYNOPSIS 
anti-ngram [ -help ] option ...
 DESCRIPTION 
 anti-ngram 
counts the N-grams in a set of N-best hypotheses lists.
The N-gram counts are weighted by the posterior probabilities of the
hypotheses they occur in.
Thus, 
 anti-ngram 
can be used to construct language models of word sequences
that are acoustically confusable with correct hypotheses.
The counts output should be processed with
 ngram-count -float-counts 
to estimate a language model.
 OPTIONS 
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
-  -help 
- 
Print option summary.
-  -version 
- 
Print version information.
- -refs file
- 
Read the reference transcripts from 
file.
Each line should contain an utterance ID followed by the transcript words.
- -nbest-files file
- 
List of N-best files.
The base components of filenames must correspond to the utterance IDs found
in the reference file.
- -max-nbest n
- 
Limits the number of hypotheses read from each N-best list to the first
n.
- -order n
- 
Set the maximal order (length) of N-grams to count.
The default order is 3.
- -lm file
- 
Reads an ARPA language model from 
 file 
and rescores the N-best lists with it prior to counting N-grams.
- -classes file
- 
Interpret the LM as a class-based N-gram and read class definitions
in 
classes-format(5)
from
file.
-  -tolower 
- 
Map vocabulary to lowercase, eliminating case distinctions.
-  -multiwords 
- 
Split multiwords (words joined by '_') into their components when
reading N-best lists.
- -multi-char C
- 
Character used to delimit component words in multiwords
(an underscore character by default).
- -rescore-lmw lmw
- 
Sets the language model weight used in combining the language model log
probabilities with acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
- -rescore-wtw wtw
- 
Sets the word transition weight used to weight the number of words relative to
the acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
- -posterior-scale scale
- 
Divide the total weighted log score by 
 scale 
when computing normalized posterior probabilities.
This controls the peakedness of the posterior distribution. 
The default value is whatever was chosen for 
-rescore-lmw,
so that language model scores are scaled to have weight 1,
and acoustic scores have weight 1/lmw.
-  -all-ngrams 
- 
Causes even N-grams that occur in the reference string to be counted.
By default N-best N-grams that also occur in the corresponding reference 
are excluded.
- -min-count C
- 
Prune all N-grams from the output that have counts less than
C.
- -read-counts countsfile
- 
Read N-gram counts from a file.
Each line contains an N-gram of 
words, followed by an integer or fractional count, all separated by whitespace.
Repeated counts for the same N-gram are added.
N-grams from N-best lists are added to those read with this option.
- -write-counts countsfile
- 
Writes total N-gram counts to
countsfile.
The default is to write to stdout.
-  -sort 
- 
Output counts in lexicographic order, as required for
ngram-merge(1).
- -debug level
- 
Set debugging output level.
Level 0 means no debugging.
Debugging messages are written to stderr.
 SEE ALSO 
ngram(1), ngram-merge(1), ngram-count(1), nbest-scripts(1),
classes-format(5),
A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech
Transcription System",
Proc. NIST Speech Transcription Workshop, College Park, MD, 2000.
 BUGS 
There is no
 -vocab 
option to limit the vocabulary.
 AUTHOR 
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 2000-2008 SRI International