MALLET Classification from the Command Line

MALLET Classification from the Command Line (DRAFT)

MALLET is a library of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text. In addition to supporting classification tasks through the MALLET library, classification tasks are also supported using command line programs, described here.

For more information about obtaining the source and citing its use, see the MALLET home page.

This documentation is intended as a brief tutorial for using the command line classification facilities in Mallet. These facilities are available in MALLET version 0.3 or later, and in CVS versions after March 1, 2004.

The examples on this page assume that you have compiled mallet and and that mallet-install-directory>/mallet/bin is on your path. Several of the examples also assume that you have downloaded the 20_newsgroups data set, unpacked it in your home directory, and therefore that its files are available in the directory ~/20_newsgroups.

1. Introduction

A typical usage of MALLET for classification involves two steps:
1) Read documents or other objects to be classified into MALLET, and convert these to a list of instances, where each instance is a feature vector.
[an error occurred while processing this directive] 2) Classify the feature vectors. MALLET can also compute diagnostic information from an instance list, such as information gain, or printing the label associated with each instance.

You can obtain on-line documentation of each MALLET command-line program by specifying the --help option. The --help option is useful checking the latest details of particular options, but does not provide a tutorial or an overview of MALLET's use.

2. Reading documents, building a list of feature vectors

Before performing classification or diagnostics with MALLET, you must first convert your data into a list of feature vectors. Data may be in a single list that is split at classification time into training and testing portions, or the user may manually create two lists.

In the most basic setting, the text data should be in plain text files, one file per document. No special tags are needed at the beginning or end of documents. Thus, for example, you should be able to index a directory of UseNet articles or MH mailboxes without any preprocessing.

The files should be organized in directories, such that all documents with the same class label are contained within a directory. (MALLET does not directly support classification tasks in which individual documents have multiple class labels. We recommend handling this as a series of binary classification tasks.)

To build a list of feature vectors from documents, use the text2vectors command. The --input option specifies a list of directory names, one directory per class. The --output option specifies the name of the file to put the feature vectors into. The command generates one feature vector per document, where each word is a dimension in the vector and the value of the vector at each position is the count of that word in the document. For example, to build a model that distinguishes among the three talk.politics classes of 20_newsgroups, (and store the feature vectors in the file news.vectors), invoke text2vectors like this:

   text2vectors --input ~/20_newsgroups/talk.politics.* 
       --skip-header --output news2.vectors

where ~/20_newsgroups/talk.politics.* would be expanded by the shell like this:

   ~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast
          ~/20_newsgroups/talk.politics.misc

and --skip-header specifies that only text occurring after two blank lines will be accepted from each document.

To build a list containing feature vectors from all 20 newsgroups, type:

   text2vectors --input ~/20_newsgroups/* 
        --skip-header --output news2.vectors

The label for each instance is derived from the file path of this instance by removing the common prefix of all instances.

2.1. Document Tokenizing Options

When indexing a file, text2vectors turns the file's stream of characters into tokens by a process called tokenization.

By default, text2vectors tokenizes all alphabetic sequences of characters (that is characters in A-Z and a-z), changing each sequence to lowercase and tossing out any token which is on the "stoplist", a list of common words such as "the", "of", "is", etc.

text2vectors supports several options for tokenizing text. For example, the previously introduced --skip-headers option causes MALLET to skip newsgroup or email headers before beginning tokenization. We used it for the 20_newsgroups dataset, since the headers include the name of the correct newsgroup!

The classification tool (see section 3.1) normally takes a single set of vectors and splits them into training and testing sets at classification time. The user may wish to produce training and testing sets into separate files, perhaps because they come from different sources or the user wishes to resuse the sets. Because the training and testing vectors used at classification time must use a common processing pipe and dictionaries, producing separate testing and training files requires that the same pipe and dictionaries must be in both files. This is accomplished by specifying the "use-pipe-from" parameter on the second file to be produced. For example,

   text2vectors --input ~/20_newsgroups/talk* --skip-header --output train.vectors

produces a set of vectors from the USENET talk hierarchy. Following this with

   text2vectors --use-pipe-from train.vectors --input ~/20_newsgroups/alt* --skip-header --output test.vectors

produces a set of test vectors with the same processing pipe and dictionaries as the training vectors, and thus may be used in the same classification task. Not that the -use-pipe-from option rewrites the specified file with the state of the dictionary after the text has been processed into vectors, so that both the -use-pipe-from vectors and the newly created vectors both have exactly the same dictionary.

Some other examples of handy tokenizing options are:

--preserve-case Do not force all strings to lowercase. (The default is to force lowercase.)

--remove-stopwords Do not include stopwords in the feature vectors. The default is to include them. The stoplist is the SMART system's list of 524 common words, like "the" and "of". IS THIS TRUE?

--skip-html Skip all characters between "<" and ">". Useful for tokenizing HTML files. The default is to include text between "<" and ">".

--gram-sizes n1,n2,n3.. Include among features all n-grams of the sizes n1,n2,n3.. specified. The default is not to generate additional n-gram features.

--string-pipe "Pipe constructor" Specify the construction of a MALLET pipe that will be run immediately after the input has become a CharSequence and before the input is tokenized.

--token-pipe "Pipe constructor" Specify the construction of a MALLET pipe that will be run immediately after the input has become a Token but before case-folding, stopword removal, or n-gram feature expansion has occurred.

--fv-pipe "Pipe constructor" Specify the construction of a MALLET pipe that will be run immediately after features have been created from tokens. This is currently the last step in processing.

For a complete list of text2vectors tokenizing options, see the output of text2vectors --help.

2.2. Specifying Feature Vectors directly

In addition to generating feature vectors from documents, MALLET users can specify feature vectors directly in a text form. This text form is then converted into an internal of feature vectors and written to a file with the csv2vectors command.

For example, the command

    csv2vectors --input datafile --output data.vectors

will read a file containing "comma-separated-values-style" feature vectors and save it as a MALLET InstanceList of feature vectors. The input datafile is assumed to contain one instance per line, where each line has the following format:

    instance_id_name class_label datafeature1 datafeature2 datafeature3 ...

The items on each line may be delimited with whitespace or commas and may not contain whitespace characters or commas themselves.

Data in other formats can also be accomodated by csv2vectors. The data on each line is split into fields using the following regular expression: ^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$ . Data that matches the first parenthesized group on the line is taken as the name, data from the second as the label, and the remainder of the line as the datafeatures. Group numbers are start at 0; the group number for the name in this expression is 0, for the label is 1, and for the data is 2. Users may specify their own reqular expression to split each line into fields with the --lineRegex option, and to specify the group numbers that will be taken as the name, label, and datafeatures with the --name n, --label n, --data n options, respectively.

3. Classification

Once a list of feature vectors has been created MALLET can can perform classification. In a typical usage, we split the feature vectors in to a training set and a test set. A classifier will collect statistics from the training set and derive internal parameters from those statistics. The classifier will then apply those parameters to classifying the validation set and output the classifications.

The --num trials option performs a specified number of trials and prints the classifications of the documents in each trial's test-set to standard output. For example,

   vectors2classify --input data.vectors --training-portion 0.6 --num-trials 3

will output the results of three trials, each with a randomized train-test split in which 60 percent of the documents are used for training, and 40 percent for testing. Details of the --training-portion option are described in section 3.1.

The normal output of vectors2classify include accuracies, standard deviation and standard error for the training and test data, and a confusion matrix.

For example, the command

 vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2

will, for an instance list built from the three talk.politics classes, print something like the following:

-------------------- Trial 0  --------------------
 
Trial 0 Training NaiveBayesTrainer with 1800 instances
Trial 0 Training NaiveBayesTrainer finished
Trial 0 Trainer NaiveBayesTrainer training data accuracy= 0.9533333333333334
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted  accuracy=0.8941666666666667
      label   0   1   2  |total
  0    guns 382   .  18  |400
  1 mideast  12 365  30  |407
  2    misc  47  20 326  |393
 
Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8941666666666667
 
-------------------- Trial 1  --------------------
 
Trial 1 Training NaiveBayesTrainer with 1800 instances
Trial 1 Training NaiveBayesTrainer finished
Trial 1 Trainer NaiveBayesTrainer training data accuracy= 0.9505555555555556
Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted  accuracy=0.9025
      label   0   1   2  |total
  0    guns 372   .  17  |389
  1 mideast   5 376  19  |400
  2    misc  58  18 335  |411
 
Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.9025
 
NaiveBayesTrainer
Summary. train accuracy mean = 0.9519444444444445 stddev = 0.001388888888888884 stderr = 9.82092751647979E-4
Summary. test accuracy mean = 0.8983333333333333 stddev = 0.004166666666666652
stderr = 0.002946278254943937

The selection of training statistics can be specified with the --report option. The default report is --report train:confusion train:accuracy test:accuracy, which prints out a confusion matrix and accuracy for each training trial and the mean, standard deviation, and standard error of the accuracy over all trials.

The general form of a report option is a dataset:statistic pair, where the dataset is the list of instances to be reported and is one of train, validation or test, and the statistic is the information desired about the instance list and is one of confusion, accuracy, f1, or raw. On each trial, the input instance list is partioned into a training, testing and validation set. For each of these sets, any combination of statistics can be output. The f1 statistic is reported against a label, with is specified following an "=" as follows:

 vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --report train:f1=mideast

will generate a line like the following in the output

Trial 0 Trainer NaiveBayesTrainer test data F1(mideast) = 0.9626373626373625

Raw classification results, selected with the raw statistic, are printed as a series of text lines that look something like this:

   /home/mccallum/20_newsgroups/talk.politics.misc/178939 misc misc:0.98 mideast:0.015 guns:0.005

That is, one test file per line, consisting of the following fields:

   directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ...

We have already seen that the (default) output from

 vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2

is the same as:

 vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2 
     --report train:confusion train:accuracy test:accuracy

Note that multiple options can be specified in a single --report option Multiple --report options can be specified with the same effect. The same example with multiple --report options is

 vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2 --report train:confusion --report train:accuracy

3.1. Specifying the Training and Testing Sets

In cases in which the test documents have been included with the training documents as part of the list of feature vectors, the training set is specified with the --training-portion option. For example,

   vectors2classify --input data.vectors --training-portion 0.6 --num-trials 1

will use a pseudo-random number generator to select .6 of the documents in the feature vector list and place them into the training set, then place the remaining documents in the test set.

The default value for --training-portion is 1.0, indicating that no documents are placed in the test set.

A portion of the feature vectors can be reserved for validation. The validation portion is specified with the --validation-portion option. For example,

   vectors2classify --input data.vectors --training-portion 0.6 --validation-portion 0.1

will use .6 of the feature vectors for training, .3 for testing, and .1 for validation.

Although the validation set is available to the MALLET classifiers, currently none of the MALLET classifiers are sophisticated enough to use it.

3.1.1. Generating and Classifying vectors from separate files

You can classify files that were converted to vectors at a different time than the training data. Section 2.1 describes one way to generate such files so that they have the same processing pipe and dictionaries and are thus compatible for use in the same classification task. To use separate files for training and testing (and validation), use the --training-file, --testing-file, and --validation-file options instead of the --input option. For example,

   vectors2classify --training-file train.vectors --testing-file test.vectors

will train the default classifier with vectors in train.vectors and classify vectors in test.vectors

Another method of generating separate vector files is splitting an existing vector file. The vectors2vectors program supports the same splitting mechanism found in vectors2classify. For example,

   vectors2vectors --input news2.vectors --training-portion .6 
           --training-file train.vectors --testing-file test.vectors

will randomly split the vectors in news2.vectors into a training file and a testing file, in the specified proportion, and which are compatible ( using the same pipe and dictionaries) for classifying.

3.2. Selecting the Classification Method

MALLET supports several different classification methods, (and the code makes it easy to add more). The default is Naive Bayes, but Maximum Entropy, Decision Tree, and Winnow are all available. These are specified with the option --trainer followed by one of the trainer names: NaiveBayes, MaxEnt, Winnow. For example,

   vectors2classify --input news2.vectors --trainer MaxEnt --training-portion 0.7

will use Maximum Entropy for classification. More than one trainer can be specified at the same time, which will cause all specified trainers to be trained on the same data. For example,

   vectors2classify --input news2.vectors --trainer NaiveBayes --trainer MaxEnt --training-portion 0.7

will classify the same split train/test data using both the Naive Bayes and Maximum Entropy classifiers. Internal to MALLET, trainers and classifiers are separate classes; a classifier is generated from its corresponding trainer class after training. The --trainer option is actually specifying a trainer constructor to be run. When no parenthesis appear in the argument to --trainer a "new" is prepended to the argument and "Trainer()" is appended to the argument (if not already present) to generate a constructor call. For example, specifying --trainer NaiveBayes is the same as specifying --trainer "new NaiveBayesTrainer()". (Note quotation marks needed to protect the ()'s from the shell) Users may specify any constructor call directly as the argument to --trainer. Explicitly specifying a constructor (with a preceding "new" and trailing "Trainer") is most often used when specifying arguments to a trainer. For example,

   vectors2classify --input news2.vectors --trainer "new MaxEntTrainer(0.01)" --training-portion 0.6

will train using the Maximum Entropy classifier initialized with a gaussian prior variance of 0.01.

4. Diagnostics

In addition to using a list of feature vectors for document classification, you can also print various information about them.

4.1. Words by Mutual Information with the Class

To see a list of the words that have highest average mutual information with the class variable (sorted by mutual information), use the --print-word-infogain option. For example

   vectors2info --input all20news.vectors --print-infogain 10

When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:

0 windows
1 god
2 dod
3 government
4 writes
5 he
6 team
7 game
8 people
9 x

4.2. Class Labels

To print the labels of all of classes found in the feature list, use the --print-labels option. For example,

   vectors2info --input all20news.vectors --print-labels

When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

4.3. Printing Entire Word/Document Matrix

You can print the entire word/document matrix to standard output in using the --print-matrix option. Documents are printed one to a line. The first (white-space separated) field is the document name; this is followed by entries for the words.

There are several different alternatives for the format in which the words are printed, and all of them are amenable to processing by perl or awk, and somewhat human-readable. The alternatives are specified by an optional "formatting" argument to the --print-matrix option.

The format is specified as a string of three characters, consisting of selections from the following three groups

Print entries for all words in the vocabulary, or just print the words that actually occur in the document.

a all

s sparse, (default)

Print word counts as integers or as binary presence/absence indicators.

b binary

i integer, (default)

How to indicate the word itself.

n integer word index

w word string

c combination of integer word index and word string, (default)

e empty, don't print anything to indicate the identity of the word

For example, to print a sparse matrix, in which the word string and the word counts for each document are listed, use the format string ``siw''. The command

   vectors2info --input testdata.vectors --print-matrix=siw

generates a large output, the first part of the first few lines of which are shown here:

file:20news-18828/alt.atheism/49960 alt.atheism  from 13  mathew 3  mantis
file:20news-18828/alt.atheism/51139 alt.atheism  from 2  subject 1  to 6  message
file:20news-18828/alt.atheism/51140 alt.atheism  from 1  subject 1  to 1  message 1  id 1 
file:20news-18828/alt.atheism/51123 alt.atheism  subject 1  atheism 1  to 2  message 1  id 1 
file:20news-18828/alt.atheism/51125 alt.atheism  subject 1  anything 1  to 11  message 1  id 1 
file:20news-18828/alt.atheism/51126 alt.atheism  subject 1  message 1  id 1  date 1  mar 1  gmt 1 
file:20news-18828/alt.atheism/51127 alt.atheism  subject 1  to 1  message 1  id 1  date 1  mar 1  
file:20news-18828/alt.atheism/51130 alt.atheism  from 1  subject 1

To print a non-sparse matrix, indicating the binary presence/absence of all words in the vocabulary for each document, use the format string ``abe''. The command

   vectors2info --input testdata.vectors --print-matrix=abe

generates a large output, the first part of the first few lines of which are shown here:

file:20news-18828/alt.atheism/53366 alt.atheism  1  1  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/53367 alt.atheism  0  0  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51247 alt.atheism  1  0  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51248 alt.atheism  0  0  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51249 alt.atheism  0  0  1  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51250 alt.atheism  1  1  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51251 alt.atheism  0  0  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51252 alt.atheism  0  1  0  0  0  0  0  0  0  0  
file:20news-18828/alt.atheism/51253 alt.atheism  1  0  0  1  1  0  0  0  0  0  
file:20news-18828/alt.atheism/51254 alt.atheism  0  1  0  0  1  0  0  0  0  0

For a summary of all the diagnostic options, see the "Diagnostics" section of the rainbow --help output. -->

5. General options

5.1. Verbosity of Progress Messages

MALLET prints messages about its progress to standard error as it runs. You can change the verbosity of these progress messages with the --verbosity=LEVEL option. The argument LEVEL should be an integer from 0 to 8, 0 being silent (no progress messages printed to standard error), and 8 being the most verbose. Levels 0-8 correspond to the java.logger predefined levels off, severe, warning, info, config, fine, finer, finest, all. The default verbosity level is taken from the MALLET logging.properties file, which currently defaults to the INFO level (3).

For example, the following command will print no progress or log messages.

   vectors2classify  --verbosity 0 --input news2.vectors

Progress messages are those messages which are typically very repetitive, and only the last one is generally of interest. By default, messages that MALLET deems to be progress messages are written on top of each other, with no intervening newline. This is implemented by a custom message formatter installed in the logging hierarchy.

If all messages are to be seen on separate lines, the special progress message formatting can be turned off by specifying the --noOverwriteProgressMessages option. For example, the MaxEnt trainer prints the log likelihood at each step during training as a progress message. Normally, each of these messages overwrites the previous one on the users terminal. To supress this behavior, and see each log likelihood on its own line the user would specify this as shown in the following example:

   vectors2classify  --input news2.vectors --trainer MaxEnt --noOverwriteProgressMessages

5.1. Initializing of the Pseudo-Random Seed

MALLET uses a pseudo-random number generator to create the randomized test-train splits described in section 3.1. You can specify the seed for this random number generator using the --random-seed option. For example

   vectors2classify --training-portion 0.7 --random-seed=2

If this option is not given, then the seed is set using the computer's real-time clock.

Last updated: 11 March 2004, mccallum@cs.cmu.edu

`--preserve-case`	Do not force all strings to lowercase. (The default is to force lowercase.)
`--remove-stopwords`	Do not include stopwords in the feature vectors. The default is to include them. The stoplist is the SMART system's list of 524 common words, like "the" and "of". IS THIS TRUE?
`--skip-html`	Skip all characters between "<" and ">". Useful for tokenizing HTML files. The default is to include text between "<" and ">".
`--gram-sizes n1,n2,n3..`	Include among features all n-grams of the sizes n1,n2,n3.. specified. The default is not to generate additional n-gram features.
`--string-pipe "Pipe constructor"`	Specify the construction of a MALLET pipe that will be run immediately after the input has become a CharSequence and before the input is tokenized.
`--token-pipe "Pipe constructor"`	Specify the construction of a MALLET pipe that will be run immediately after the input has become a Token but before case-folding, stopword removal, or n-gram feature expansion has occurred.
`--fv-pipe "Pipe constructor"`	Specify the construction of a MALLET pipe that will be run immediately after features have been created from tokens. This is currently the last step in processing.

Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
`a`	all
`s`	sparse, (default)
Print word counts as integers or as binary presence/absence indicators.
`b`	binary
`i`	integer, (default)
How to indicate the word itself.
`n`	integer word index
`w`	word string
`c`	combination of integer word index and word string, (default)
`e`	empty, don't print anything to indicate the identity of the word