For more information about obtaining the source and citing its use, see the MALLET home page.
This documentation is intended as a brief tutorial for using the command line classification facilities in Mallet. These facilities are available in MALLET version 0.3 or later, and in CVS versions after March 1, 2004.
The examples on this page assume that you have compiled mallet and and that mallet-install-directory>/mallet/bin is on your path. Several of the examples also assume that you have downloaded the 20_newsgroups data set, unpacked it in your home directory, and therefore that its files are available in the directory ~/20_newsgroups.
You can obtain on-line documentation of each MALLET command-line program by specifying the --help option. The --help option is useful checking the latest details of particular options, but does not provide a tutorial or an overview of MALLET's use.
Before performing classification or diagnostics with MALLET, you must first convert your data into a list of feature vectors. Data may be in a single list that is split at classification time into training and testing portions, or the user may manually create two lists.
In the most basic setting, the text data should be in plain text files, one file per document. No special tags are needed at the beginning or end of documents. Thus, for example, you should be able to index a directory of UseNet articles or MH mailboxes without any preprocessing.
The files should be organized in directories, such that all documents with the same class label are contained within a directory. (MALLET does not directly support classification tasks in which individual documents have multiple class labels. We recommend handling this as a series of binary classification tasks.)
To build a list of feature vectors from documents, use the text2vectors command. The --input option specifies a list of directory names, one directory per class. The --output option specifies the name of the file to put the feature vectors into. The command generates one feature vector per document, where each word is a dimension in the vector and the value of the vector at each position is the count of that word in the document. For example, to build a model that distinguishes among the three talk.politics classes of 20_newsgroups, (and store the feature vectors in the file news.vectors), invoke text2vectors like this:
text2vectors --input ~/20_newsgroups/talk.politics.* --skip-header --output news2.vectorswhere ~/20_newsgroups/talk.politics.* would be expanded by the shell like this:
~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.miscand --skip-header specifies that only text occurring after two blank lines will be accepted from each document.
To build a list containing feature vectors from all 20 newsgroups, type:
text2vectors --input ~/20_newsgroups/* --skip-header --output news2.vectorsThe label for each instance is derived from the file path of this instance by removing the common prefix of all instances.
When indexing a file, text2vectors turns the file's stream of characters into tokens by a process called tokenization.
By default, text2vectors tokenizes all alphabetic sequences of characters (that is characters in A-Z and a-z), changing each sequence to lowercase and tossing out any token which is on the "stoplist", a list of common words such as "the", "of", "is", etc.
text2vectors supports several options for tokenizing text. For example, the previously introduced --skip-headers option causes MALLET to skip newsgroup or email headers before beginning tokenization. We used it for the 20_newsgroups dataset, since the headers include the name of the correct newsgroup!
The classification tool (see section 3.1) normally takes a single set of vectors and splits them into training and testing sets at classification time. The user may wish to produce training and testing sets into separate files, perhaps because they come from different sources or the user wishes to resuse the sets. Because the training and testing vectors used at classification time must use a common processing pipe and dictionaries, producing separate testing and training files requires that the same pipe and dictionaries must be in both files. This is accomplished by specifying the "use-pipe-from" parameter on the second file to be produced. For example,
text2vectors --input ~/20_newsgroups/talk* --skip-header --output train.vectorsproduces a set of vectors from the USENET talk hierarchy. Following this with
text2vectors --use-pipe-from train.vectors --input ~/20_newsgroups/alt* --skip-header --output test.vectorsproduces a set of test vectors with the same processing pipe and dictionaries as the training vectors, and thus may be used in the same classification task. Not that the -use-pipe-from option rewrites the specified file with the state of the dictionary after the text has been processed into vectors, so that both the -use-pipe-from vectors and the newly created vectors both have exactly the same dictionary.
Some other examples of handy tokenizing options are:
--preserve-case | Do not force all strings to lowercase. (The default is to force lowercase.) |
--remove-stopwords | Do not include stopwords in the feature vectors. The default is to include them. The stoplist is the SMART system's list of 524 common words, like "the" and "of". IS THIS TRUE? |
--skip-html | Skip all characters between "<" and ">". Useful for tokenizing HTML files. The default is to include text between "<" and ">". |
--gram-sizes n1,n2,n3.. | Include among features all n-grams of the sizes n1,n2,n3.. specified. The default is not to generate additional n-gram features. |
--string-pipe "Pipe constructor" | Specify the construction of a MALLET pipe that will be run immediately after the input has become a CharSequence and before the input is tokenized. |
--token-pipe "Pipe constructor" | Specify the construction of a MALLET pipe that will be run immediately after the input has become a Token but before case-folding, stopword removal, or n-gram feature expansion has occurred. |
--fv-pipe "Pipe constructor" | Specify the construction of a MALLET pipe that will be run immediately after features have been created from tokens. This is currently the last step in processing. |
For a complete list of text2vectors tokenizing options, see the output of text2vectors --help.
For example, the command
csv2vectors --input datafile --output data.vectorswill read a file containing "comma-separated-values-style" feature vectors and save it as a MALLET InstanceList of feature vectors. The input datafile is assumed to contain one instance per line, where each line has the following format:
instance_id_name class_label datafeature1 datafeature2 datafeature3 ...The items on each line may be delimited with whitespace or commas and may not contain whitespace characters or commas themselves.
Data in other formats can also be accomodated by csv2vectors. The data on each line is split into fields using the following regular expression: ^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$ . Data that matches the first parenthesized group on the line is taken as the name, data from the second as the label, and the remainder of the line as the datafeatures. Group numbers are start at 0; the group number for the name in this expression is 0, for the label is 1, and for the data is 2. Users may specify their own reqular expression to split each line into fields with the --lineRegex option, and to specify the group numbers that will be taken as the name, label, and datafeatures with the --name n, --label n, --data n options, respectively.
Once a list of feature vectors has been created MALLET can can perform classification. In a typical usage, we split the feature vectors in to a training set and a test set. A classifier will collect statistics from the training set and derive internal parameters from those statistics. The classifier will then apply those parameters to classifying the validation set and output the classifications.
The --num trials option performs a specified number of trials and prints the classifications of the documents in each trial's test-set to standard output. For example,
vectors2classify --input data.vectors --training-portion 0.6 --num-trials 3will output the results of three trials, each with a randomized train-test split in which 60 percent of the documents are used for training, and 40 percent for testing. Details of the --training-portion option are described in section 3.1.
The normal output of vectors2classify include accuracies, standard deviation and standard error for the training and test data, and a confusion matrix.
For example, the command
vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2will, for an instance list built from the three talk.politics classes, print something like the following:
-------------------- Trial 0 -------------------- Trial 0 Training NaiveBayesTrainer with 1800 instances Trial 0 Training NaiveBayesTrainer finished Trial 0 Trainer NaiveBayesTrainer training data accuracy= 0.9533333333333334 Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix Confusion Matrix, row=true, column=predicted accuracy=0.8941666666666667 label 0 1 2 |total 0 guns 382 . 18 |400 1 mideast 12 365 30 |407 2 misc 47 20 326 |393 Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8941666666666667 -------------------- Trial 1 -------------------- Trial 1 Training NaiveBayesTrainer with 1800 instances Trial 1 Training NaiveBayesTrainer finished Trial 1 Trainer NaiveBayesTrainer training data accuracy= 0.9505555555555556 Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix Confusion Matrix, row=true, column=predicted accuracy=0.9025 label 0 1 2 |total 0 guns 372 . 17 |389 1 mideast 5 376 19 |400 2 misc 58 18 335 |411 Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.9025 NaiveBayesTrainer Summary. train accuracy mean = 0.9519444444444445 stddev = 0.001388888888888884 stderr = 9.82092751647979E-4 Summary. test accuracy mean = 0.8983333333333333 stddev = 0.004166666666666652 stderr = 0.002946278254943937 |
The selection of training statistics can be specified with the --report option. The default report is --report train:confusion train:accuracy test:accuracy, which prints out a confusion matrix and accuracy for each training trial and the mean, standard deviation, and standard error of the accuracy over all trials.
The general form of a report option is a dataset:statistic pair, where the dataset is the list of instances to be reported and is one of train, validation or test, and the statistic is the information desired about the instance list and is one of confusion, accuracy, f1, or raw. On each trial, the input instance list is partioned into a training, testing and validation set. For each of these sets, any combination of statistics can be output. The f1 statistic is reported against a label, with is specified following an "=" as follows:
vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --report train:f1=mideastwill generate a line like the following in the output
Trial 0 Trainer NaiveBayesTrainer test data F1(mideast) = 0.9626373626373625
Raw classification results, selected with the raw statistic, are printed as a series of text lines that look something like this:
/home/mccallum/20_newsgroups/talk.politics.misc/178939 misc misc:0.98 mideast:0.015 guns:0.005
That is, one test file per line, consisting of the following fields:
directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ...
We have already seen that the (default) output from
vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2is the same as:
vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2 --report train:confusion train:accuracy test:accuracyNote that multiple options can be specified in a single --report option Multiple --report options can be specified with the same effect. The same example with multiple --report options is
vectors2classify --input news2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 2 --report train:confusion --report train:accuracy
vectors2classify --input data.vectors --training-portion 0.6 --num-trials 1will use a pseudo-random number generator to select .6 of the documents in the feature vector list and place them into the training set, then place the remaining documents in the test set.
The default value for --training-portion is 1.0, indicating that no documents are placed in the test set.
A portion of the feature vectors can be reserved for validation. The validation portion is specified with the --validation-portion option. For example,
vectors2classify --input data.vectors --training-portion 0.6 --validation-portion 0.1will use .6 of the feature vectors for training, .3 for testing, and .1 for validation.
Although the validation set is available to the MALLET classifiers, currently none of the MALLET classifiers are sophisticated enough to use it.
vectors2classify --training-file train.vectors --testing-file test.vectorswill train the default classifier with vectors in train.vectors and classify vectors in test.vectors
Another method of generating separate vector files is splitting an existing vector file. The vectors2vectors program supports the same splitting mechanism found in vectors2classify. For example,
vectors2vectors --input news2.vectors --training-portion .6 --training-file train.vectors --testing-file test.vectorswill randomly split the vectors in news2.vectors into a training file and a testing file, in the specified proportion, and which are compatible ( using the same pipe and dictionaries) for classifying.
vectors2classify --input news2.vectors --trainer MaxEnt --training-portion 0.7will use Maximum Entropy for classification. More than one trainer can be specified at the same time, which will cause all specified trainers to be trained on the same data. For example,
vectors2classify --input news2.vectors --trainer NaiveBayes --trainer MaxEnt --training-portion 0.7will classify the same split train/test data using both the Naive Bayes and Maximum Entropy classifiers. Internal to MALLET, trainers and classifiers are separate classes; a classifier is generated from its corresponding trainer class after training. The --trainer option is actually specifying a trainer constructor to be run. When no parenthesis appear in the argument to --trainer a "new" is prepended to the argument and "Trainer()" is appended to the argument (if not already present) to generate a constructor call. For example, specifying --trainer NaiveBayes is the same as specifying --trainer "new NaiveBayesTrainer()". (Note quotation marks needed to protect the ()'s from the shell) Users may specify any constructor call directly as the argument to --trainer. Explicitly specifying a constructor (with a preceding "new" and trailing "Trainer") is most often used when specifying arguments to a trainer. For example,
vectors2classify --input news2.vectors --trainer "new MaxEntTrainer(0.01)" --training-portion 0.6will train using the Maximum Entropy classifier initialized with a gaussian prior variance of 0.01.
In addition to using a list of feature vectors for document classification, you can also print various information about them.
To see a list of the words that have highest average mutual information with the class variable (sorted by mutual information), use the --print-word-infogain option. For example
vectors2info --input all20news.vectors --print-infogain 10
When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:
0 windows 1 god 2 dod 3 government 4 writes 5 he 6 team 7 game 8 people 9 x
vectors2info --input all20news.vectors --print-labels
When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:
alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
You can print the entire word/document matrix to standard output in using the --print-matrix option. Documents are printed one to a line. The first (white-space separated) field is the document name; this is followed by entries for the words.
There are several different alternatives for the format in which the words are printed, and all of them are amenable to processing by perl or awk, and somewhat human-readable. The alternatives are specified by an optional "formatting" argument to the --print-matrix option.
The format is specified as a string of three characters, consisting of selections from the following three groups
Print entries for all words in the vocabulary, or just print the words that actually occur in the document. | |
a | all |
s | sparse, (default) |
Print word counts as integers or as binary presence/absence indicators. | |
b | binary |
i | integer, (default) |
How to indicate the word itself. | |
n | integer word index |
w | word string |
c | combination of integer word index and word string, (default) |
e | empty, don't print anything to indicate the identity of the word |
For example, to print a sparse matrix, in which the word string and the word counts for each document are listed, use the format string ``siw''. The command
vectors2info --input testdata.vectors --print-matrix=siw
generates a large output, the first part of the first few lines of which are shown here:
file:20news-18828/alt.atheism/49960 alt.atheism from 13 mathew 3 mantis file:20news-18828/alt.atheism/51139 alt.atheism from 2 subject 1 to 6 message file:20news-18828/alt.atheism/51140 alt.atheism from 1 subject 1 to 1 message 1 id 1 file:20news-18828/alt.atheism/51123 alt.atheism subject 1 atheism 1 to 2 message 1 id 1 file:20news-18828/alt.atheism/51125 alt.atheism subject 1 anything 1 to 11 message 1 id 1 file:20news-18828/alt.atheism/51126 alt.atheism subject 1 message 1 id 1 date 1 mar 1 gmt 1 file:20news-18828/alt.atheism/51127 alt.atheism subject 1 to 1 message 1 id 1 date 1 mar 1 file:20news-18828/alt.atheism/51130 alt.atheism from 1 subject 1
To print a non-sparse matrix, indicating the binary presence/absence of all words in the vocabulary for each document, use the format string ``abe''. The command
vectors2info --input testdata.vectors --print-matrix=abe
generates a large output, the first part of the first few lines of which are shown here:
file:20news-18828/alt.atheism/53366 alt.atheism 1 1 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/53367 alt.atheism 0 0 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51247 alt.atheism 1 0 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51248 alt.atheism 0 0 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51249 alt.atheism 0 0 1 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51250 alt.atheism 1 1 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51251 alt.atheism 0 0 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51252 alt.atheism 0 1 0 0 0 0 0 0 0 0 file:20news-18828/alt.atheism/51253 alt.atheism 1 0 0 1 1 0 0 0 0 0 file:20news-18828/alt.atheism/51254 alt.atheism 0 1 0 0 1 0 0 0 0 0
For a summary of all the diagnostic options, see the "Diagnostics" section of the rainbow --help output. -->
MALLET prints messages about its progress to standard error as it runs. You can change the verbosity of these progress messages with the --verbosity=LEVEL option. The argument LEVEL should be an integer from 0 to 8, 0 being silent (no progress messages printed to standard error), and 8 being the most verbose. Levels 0-8 correspond to the java.logger predefined levels off, severe, warning, info, config, fine, finer, finest, all. The default verbosity level is taken from the MALLET logging.properties file, which currently defaults to the INFO level (3).
For example, the following command will print no progress or log messages.
vectors2classify --verbosity 0 --input news2.vectors
Progress messages are those messages which are typically very repetitive, and only the last one is generally of interest. By default, messages that MALLET deems to be progress messages are written on top of each other, with no intervening newline. This is implemented by a custom message formatter installed in the logging hierarchy.
If all messages are to be seen on separate lines, the special progress message formatting can be turned off by specifying the --noOverwriteProgressMessages option. For example, the MaxEnt trainer prints the log likelihood at each step during training as a progress message. Normally, each of these messages overwrites the previous one on the users terminal. To supress this behavior, and see each log likelihood on its own line the user would specify this as shown in the following example:
vectors2classify --input news2.vectors --trainer MaxEnt --noOverwriteProgressMessages
MALLET uses a pseudo-random number generator to create the randomized test-train splits described in section 3.1. You can specify the seed for this random number generator using the --random-seed option. For example
vectors2classify --training-portion 0.7 --random-seed=2
If this option is not given, then the seed is set using the computer's real-time clock.