segment
segment
 NAME 
segment - segment text using N-gram language model
 SYNOPSIS 
segment [ -help ] option ...
 DESCRIPTION 
 segment 
infers a most likely segmentation (location of segment boundaries)
from a text, based on a segment language model.
The language model is a standard backoff N-gram model in ARPA
ngram-format(5),
modeling segmentation using the boundary tags <s> and </s>.
The program reads in a word sequence, finds the most likely locations 
of segment boundaries according to the language model, and 
outputs the word sequence with segment boundaries marked by <s> tags.
 OPTIONS 
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
-  -help 
- 
Print option summary.
-  -version 
- 
Print version information.
- -order n
- 
Set the maximal N-gram order to be used, by default 3.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
- -debug level
- 
Set the debugging output level (0 means no debugging output).
Debugging messages are sent to stderr.
- -lm file
- 
Read the N-gram model from
file.
- -text file
- 
Find the text to be segmented in 
file.
Default input is stdin.
-  -continuous 
- 
Process all words in the input as one sequence of words, irrespective of
line breaks.
Normally each line is processed separately as a word sequence.
-  -posteriors 
- 
Use a forward-backward algorithm to compute the posterior probabilities
of a segment boundary at each word transition, and hypothesize a boundary
whenever the probability exceeds 0.5.
By default a Viterbi algorithm is used that computes
the globally most likely segmentation.
 If
 -continuous 
is specified as well,
then this option will produce one line of output per word, containing,
respectively, the <s> tag (if appropriate), the word itself, and the 
posterior probability for a boundary preceding the word.
-  -unk 
- 
Output the unknown word token <unk> for each input word not in the 
language model vocabulary.
The default is to output the input word unchanged.
- -stag string
- 
Use
 string 
to mark segment boundaries in the output.
Default is the start-of-sentence symbol defined in the language model (<s>).
- -bias b
- 
Make a segment boundary a priori more likely by a factor of
b.
This allows balancing of false detection/rejection errors.
The default is 1.
 SEE ALSO 
ngram-count(1), ngram-format(5).
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
Spontaneous Speech,'' Proc. ICSLP, 1005-1008, 1996.
 BUGS 
Only N-grams models up to trigram order are used accurately.
For higher-order models use the more general 
hidden-ngram(1).
 AUTHOR 
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1997-2004 SRI International