Rainbow

Rainbow

Rainbow is a program that performs statistical text classification. It is based on the Bow library. For more information about obtaining the source and citing its use, see the Bow home page.

This documentation is intended as a brief tutorial for using rainbow, version 0.9 or later. It is not complete documentation. It is not a tutorial on the source code.

The examples on this page assume that you have compiled libbow and rainbow, and that rainbow is in your path. Several of the examples also assume that you have downloaded the 20_newsgroups data set, unpacked it in your home directory, and therefore that its files are available in the directory ~/20_newsgroups.

1. Introduction

The general pattern of rainbow usage is in two steps (1) have rainbow read your documents and write to disk a "model" containing their statistics, (2) using the model, rainbow performs classification or diagnostics.

You can obtain on-line documentation of each rainbow command-line option by typing

 rainbow --help | more 
This --help option is useful checking the latest details of particular options, but does not provide a tutorial or an overview of rainbow's use.

Command-line options in rainbow and all the Bow library frontends are handled by the libargp library from the FSF. Many command-line options have both long and short forms. For example, to set the verbosity level to 4 (to make rainbow give more runtime diagnostic messages than usual), you can type "--verbosity=4", or "--verbosity 4", or "-v 4". For more detail about the verbosity option, see section 5.1.

2. Reading the documents, building a model

Before performing classification or diagnostics with rainbow, you must first have rainbow index your data--that is, read your documents and archive a "model" containing their statistics. The text indexed for the model must contain all the training data. The testing data may also be read as part of the model, or it can be left out and read later.

The model is placed in the file system location indicated by the -d option. If no -d option is given, the name ~/.rainbow is used by default. (The model name is actually a file system directory containing separate files for different aspects of the model. If the model directory location does not exist when rainbow is invoked, rainbow will create it automatically.)

In the most basic setting, the text data should be in plain text files, one file per document. No special tags are needed at the beginning or end of documents. Thus, for example, you should be able to index a directory of UseNet articles or MH mailboxes without any preprocessing. The files should be organized in directories, such that all documents with the same class label are contained within a directory. (Rainbow does not directly support classification tasks in which individual documents have multiple class labels. I recommend handling this as a series of binary classification tasks.)

To build a model, call rainbow with the --index (or -i) option, followed by one directory name for each class. For example, to build a model that distinguishes among the three talk.politics classes of 20_newsgroups, (and store that model in the directory ~/model), invoke rainbow like this:

   rainbow -d ~/model --index ~/20_newsgroups/talk.politics.*
where ~/20_newsgroups/talk.politics.* would be expanded by the shell like this:
   ~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.misc

To build a model containing all 20 newsgroups, type:

   rainbow -d ~/model --index ~/20_newsgroups/*

2.1. Tokenizing Options

When indexing a file, rainbow turns the file's stream of characters into tokens by a process called tokenization or "lexing".

By default, rainbow tokenizes all alphabetic sequences of characters (that is characters in A-Z and a-z), changing each sequence to lowercase and tossing out any token which is on the "stoplist", a list of common words such as "the", "of", "is", etc.

Rainbow supports several options for tokenizing text. For example the --skip-headers (or -h) option causes rainbow to skip newsgroup or email headers before beginning tokenization. (Which should be used for the 20_newsgroups dataset, since the headers include the name of the correct newsgroup!) It does this by scanning forward until it finds two newlines in a row.

   rainbow -d ~/model -h --index ~/20_newsgroups/talk.politics/*

Some other examples of handy tokenizing options are:

--use-stemming Pass all words through the Porter stemmer before counting them. (The default is not to stem.)
--no-stoplist Include words in the stoplist among the statistics. The default is to skip them. The stoplist is the SMART system's list of 524 common words, like "the" and "of".)
--istext-avoid-uuencode Attempt to detect when a file mostly consists of a uuencoded block, and if so, skip it. This option is useful for tokenizing UseNet articles, because word statistics can be thrown off by repetitive tokens found in uuencoded images.
--skip-html Skip all characters between "<" and ">". Useful for lexing HTML files.
--lex-pipe-command SHELLCMD Rather than tokenizing the file directly, pass the file as standard input into this shell command, and tokenize the standard output of the shell command. For example, to index only the first 20 lines of each file, use:
rainbow --lex-pipe-command "head -n 20" -d ~/model --index ~/20_newsgroups/talk.politics/*
--lex-white Rather than tokenizing the file with the default rules (skipping non-alphabetics, downcasing, etc), instead simply grab space-delimited strings, and make no further changes. This option is useful if you want to take complete control of tokenization with your own script, as specified by --lex-pipe-command, and don't want rainbow to make any further changes.

For a complete list of rainbow tokenizing options, see the "Lexing options" section in the output of rainbow --help.

3. Classifying Documents

Once indexing is performed and a model has been archived to disk, rainbow can perform document classification. Statistics from a set of training documents will determine the parameters of the classifier; classification of a set of testing documents will be output.

The --test (or -t) option performs a specified number of trials and prints the classifications of the documents in each trial's test-set to standard output. For example,

   rainbow -d ~/model --test-set=0.4 --test=3
will output the results of three trials, each with a randomized test-train split in which 60 percent of the documents are used for training, and 40 percent for testing. Details of the --test-set option are described in section 3.1.

Classification results are printed as a series of text lines that look something like this:

   /home/mccallum/20_newsgroups/talk.politics.misc/178939 talk.politics.misc talk.politics.misc:0.98 talk.politics.mideast:0.015 talk.politics.guns:0.005

That is, one test file per line, consisting of the following fields:

   directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ...

The Perl script rainbow-stats, which is provided in the Bow source distribution, reads lines like this and outputs average accuracy, standard error, and a confusion matrix.

For example, the command

   rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-stats
will, for a model build from the three talk.politics classes, print something like the following:

Trial 0

Correct: 1079 out of 1201 (89.84 percent accuracy)

 - Confusion details, row is actual, column is predicted
               classname   0   1   2  :total
 0    talk.politics.guns 372   2  27  :401  92.77%
 1 talk.politics.mideast   6 371  23  :400  92.75%
 2    talk.politics.misc  44  20 336  :400  84.00%

Trial 1

Correct: 1086 out of 1201 (90.42 percent accuracy)

 - Confusion details, row is actual, column is predicted
               classname   0   1   2  :total
 0    talk.politics.guns 377   2  22  :401  94.01%
 1 talk.politics.mideast   6 371  23  :400  92.75%
 2    talk.politics.misc  40  22 338  :400  84.50%

Percent_Accuracy  average 90.13 stderr 0.21

(To give you some idea of the speed of rainbow: On a 200 MHz Pentium, the above rainbow command finishes in 14 seconds. The command reads the model from disk, and performs two trials--each building a model from about 1800 documents and testing on about 1200. The rainbow-stats command finishes in 2 seconds.)

The Perl script rainbow-be, also provided in the Bow source distribution, reads lines like this and outputs precision-recall breakeven points.

You can vary the precision with which classification scores are printed using the --score-precision=NUM option, where NUM is the number of digits to print after the decimal point. Note, however, that several internal variables are of type float, (which has only about 7 digits of resolution) and the classification scores are calculated as double's, (which has only about 17 digits of resolution), so precision is inherently limited. The default printed score precision is 10. This option works only with the naive Bayes classifier.

3.1. Specifying the Training and Testing Sets

In cases in which the test documents have been tokenized as part of the model, the test set is specified with the --test-set option. For example,
   rainbow -d ~/model --test-set=0.5 --test=1
will use a pseudo-random number generator to select one-half of the documents in the model and place them into the test set, then place the remaining documents in the training set.

When the argument to --test-set contains no decimal point, the number is interpreted as an exact number of documents. For example,

   rainbow -d ~/model --test-set=30 --test=1
will place 30 documents in the test set, attempting to select a number of documents from each class such that the class proportions in the test set roughly matches that in the entire model.

If the number argument is followed by "pc", then the arguments indicates a number of documents per class. Thus

   rainbow -d ~/model --test-set=200pc --test=1
will place into the test set 200 randomly-selected documents from each of the classes in the model, for a total of 600 test documents, if the model was build using three classes.

You can also specify exactly which files should be in the test set, listing them by name. If the argument to --test-set contains non-numeric characters, it is interpreted as a filename, which in turn should contain a list of white-space-separated filenames of documents indexed in the model. For example,

   rainbow -d ~/model --test-set=~/filelist1 --test=1
will open the file ~/filelist1 and take from there the list of names of files to be place in the test set. Note that the class labels of these documents are already known from when the model file was built.

The list of filenames should be named as they where then the model was built. A list of all the filenames of documents contained in a rainbow model can be obtained with the following command:

 
   rainbow -d ~/model --print-doc-names

See section 4.3 for more details on the --print-doc-names option.

The default value for --test-set is 0, indicating the no documents are placed in the test set. Thus, when using the --test option, you must use the --test-set option in order to give rainbow some documents to classify.

3.1.1. Training Set

The training set can be specified using the --train-set option with the same types of arguments described above. For example,

   rainbow -d ~/model --test-set=~/filelist1 --train-set=~/filelist2 --test=1
will take all test documents from the list in ~/filelist1, all training documents from ~/filelist2, and ignore all documents that don't appear in either list. It is an error for a document to be listed in both the test set and the train set.

The default value for the --train-set is the keyword remaining, which specifies that all documents not placed in the test set should be placed in the training set.

The keyword remaining can also be used for the test set. For example,

   rainbow -d ~/model --train-set=1pc --test-set=remaining --test=1
will put one document from each class into the training set, and put all the rest of the documents in the testing set.
3.1.2. Classifying Files not in the Model

You can classify files that were not indexed into the model by replacing the --test option with the --test-files option. For example,

   rainbow -d ~/model --test-files ~/more-talk.politics/*
will use all the files in the model as the training set, and output classifications for all files contained in the subdirectories of ~/more-talk.politics/. Note that the number and basenames of the directories listed must match those given to --index when the model was built.

You can classify a single file (read from standard input or from a specified filename) using the --query option.

3.2. Rainbow Classification as a Server

Rainbow can also efficiently classify individual documents not in the model by running as a server. In this mode, rainbow starts, reads the model from disk, then waits for query documents by listening on a network socket.

To do this, run rainbow with the command line option --query-server=PORT (where PORT is some port number larger than 1000). For example

   rainbow -d ~/model --query-server=1821

In order to test the server, telnet to whatever port you specified (e.g. "telnet localhost 1821"), type in a document you want to classify, then type '.' alone on a line, followed by Return. Rainbow will then print back to the socket (and thus to your screen) a list of classes and their scores. If you write your own program to connect to a rainbow server (to replace telnet in this example), make sure to use the sequence "\r\n" to send a newline. Thus, to indicate the end of a query document, you should send the sequence "\r\n.\r\n".

3.2. Feature Selection

Feature set or "vocabulary" size may be reduced by by occurrence counts or by average mutual information with the class variable ([Cover & Thomas, "Elements of Information Theory" Wiley & Sons, 1991], (which we also call "information gain").

--prune-vocab-by-infogain=N
or -T
Remove all but the top N words by selecting words with highest average mutual information with the class variable. Default is N=0, which is a special case that removes no words.
--prune-vocab-by-doc-count=N
or -D
Remove words that occur in N or fewer documents.
--prune-vocab-by-occur-count=N
or -O
Remove words that occur less than N times.

For example, to classify using only the 50 words that have the highest mutual information with the class variable, type:

   rainbow -d ~/model --prune-vocab-by-infogain=50 --test=1

If you want to see what these 50 words are, type:

   rainbow -d ~/model -I 50
There is more information about -I and other diagnostic-printing command-line options options in section 4.

3.3. Selecting the Classification Method

Rainbow supports several different classification methods, (and the code makes it easy to add more). The default is Naive Bayes, but k-nearest neighbor, TFIDF, and probabilistic indexing are all available. These are specified with the --method (or -m) option, followed by one of the following keywords: naivebayes, knn, tfidf, prind. For example,
   rainbow -d ~/model --method=tfidf --test=1
will use TFIDF/Rocchio for classification.

3.4. Naive Bayes Options

The following options change parameters of Naive Bayes.

--smoothing-method=METHOD Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell. The default is laplace, which is a uniform Dirichlet prior with alpha=2.
--event-model=EVENTNAME Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word (i.e. multinomial, unigram), document (i.e. multi-variate Bernoulli, bit vector), or document-then-word (i.e. document-length-normalized multinomial). For more details on these methods, see A Comparison of Event Models for Naive Bayes Text Classification. The default is word.
--uniform-class-priors When classifying and calculating mutual information, use equal prior probabilities on classes, instead of using the distribution determined from the training data.

4. Diagnostics

In addition to using a model for document classification, you can also print various information about the model.

4.1. Words by Mutual Information with the Class

To see a list of the words that have highest average mutual information with the class variable (sorted by mutual information), use the --print-word-infogain (or -I) option. For example

   rainbow -d ~/model -I 10

When invoked on a model containing all 20 classes of the 20_newsgroups dataset, the following is printed to standard out:

  0.09381 windows
  0.09003 god
  0.07900 dod
  0.07700 government
  0.06609 team
  0.06570 game
  0.06448 people
  0.06323 car
  0.06171 bike
  0.05609 hockey
The above is calculated using all the training data. To restrict the calculation to a subset of the data, use any of the methods for defining the training set described in section 3.1. For example, to calculate mutual information based just on the the documents listed in ~/docs1, type:
   rainbow -d ~/model --train-set=~/docs1 -I 10

4.2. Words by Probability

To print the probability of all the words use the --print-word-probabilities option. For example, the following command will print the word probabilities in the talk.politics.mideast class, after pruning the vocabulary to the ten words that have highest mutual information with the class.
   rainbow -d ~/model -T 10 --print-word-probabilities=talk.politics.mideast

Here is the output of this command. Notice that the word probabilities correctly sum to one.

   god                             0.05026782
   people                          0.64977338
   government                      0.24062629
   car                             0.03502266
   game                            0.00412031
   team                            0.01030078
   bike                            0.00041203
   dod                             0.00041203
   hockey                          0.00123609
   windows                         0.00782859

4.3. Word Counts and Probabilities

To print the number of times a word occurs in each class (as well as the total number of words in the class, and the word's probability in each class), use the --print-word-counts option. For example, the following command prints diagnostics about the word team.

   rainbow -d ~/model --print-word-counts=team

Here is the output on the above command, on a model built from 20_newsgroups. Note that the word probabilities (in parenthesis) may not simply be equal to the ratio of the two previous counts because of smoothing.

        2 /    125039  (  0.00002) alt.atheism
        6 /    119511  (  0.00005) comp.graphics
        5 /     91147  (  0.00005) comp.os.ms-windows.misc
        1 /     71002  (  0.00001) comp.sys.mac.hardware
       12 /    131120  (  0.00009) comp.windows.x
       15 /     62130  (  0.00024) misc.forsale
        2 /     83942  (  0.00002) rec.autos
       10 /     78685  (  0.00013) rec.motorcycles
      543 /     88623  (  0.00613) rec.sport.baseball
      970 /    115109  (  0.00843) rec.sport.hockey
        9 /    136655  (  0.00007) sci.crypt
        1 /     81206  (  0.00001) sci.electronics
        8 /    125235  (  0.00006) sci.med
       71 /    128754  (  0.00055) sci.space
        2 /    141389  (  0.00001) soc.religion.christian
       13 /    135054  (  0.00010) talk.politics.guns
       24 /    208367  (  0.00012) talk.politics.mideast
       14 /    164266  (  0.00009) talk.politics.misc
        9 /    130013  (  0.00007) talk.religion.misc

(Note: the probability of the word team is not equal to the probability of team from the --print-word-probabilities command above, because we did not reduce vocabulary size to 10 in this example.

4.4. Document Names

To print a list of the filenames of all documents, use the --print-doc-names option. Document filenames are printed in the order in which they were indexed. Thus all documents of the same class appear contiguously.

This command is often useful for generating lists of document names to be used with the --test-set and --train-set options.

For example, the following command prints 10 randomly selected documents that were indexed. In order to obtain a random selection, gawk, the GNU version of awk, is used to generate random numbers, and sort is used to permute the list. The command head is then used to select the first 10 from the permuted list.

   rainbow -d ~/model --print-doc-names \
   | gawk '{print rand(), $1}' | sort -n | gawk '{print $2}' | head -n 10

Example output of this command on the 20_newsgroups data set is:

   ~/20_newsgroups/rec.motorcycles/104735
   ~/20_newsgroups/comp.windows.x/67345
   ~/20_newsgroups/sci.med/59555
   ~/20_newsgroups/talk.politics.misc/178418
   ~/20_newsgroups/misc.forsale/76867
   ~/20_newsgroups/rec.sport.hockey/52601
   ~/20_newsgroups/talk.politics.mideast/77394
   ~/20_newsgroups/comp.os.ms-windows.misc/9661
   ~/20_newsgroups/talk.politics.mideast/75947
   ~/20_newsgroups/talk.politics.misc/179105

You can also print the names of just those documents that fall into one of the sets of the test/train split. For example

   rainbow -d ~/model --train-set=3pc --print-doc-names=train
will select three documents from each class to be in the training set, and print just those documents. The output of this command might be:
   ~/20_newsgroups/talk.politics.guns/53329
   ~/20_newsgroups/talk.politics.guns/54704
   ~/20_newsgroups/talk.politics.guns/54656
   ~/20_newsgroups/talk.politics.mideast/76420
   ~/20_newsgroups/talk.politics.mideast/76523
   ~/20_newsgroups/talk.politics.mideast/77392
   ~/20_newsgroups/talk.politics.misc/179005
   ~/20_newsgroups/talk.politics.misc/176939
   ~/20_newsgroups/talk.politics.misc/179083

4.5. Printing Entire Word/Document Matrix

You can print the entire word/document matrix to standard output in using the --print-matrix option. Documents are printed one to a line. The first (white-space separated) field is the document name; this is followed by entries for the words.

There are several different alternatives for the format in which the words are printed, and all of them are amenable to processing by perl or awk, and somewhat human-readable. The alternatives are specified by an optional "formatting" argument to the --print-matrix option.

The format is specified as a string of three characters, consisting of selections from the following three groups

Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
aall
ssparse, (default)
Print word counts as integers or as binary presence/absence indicators.
bbinary
iinteger, (default)
How to indicate the word itself.
ninteger word index
wword string
ccombination of integer word index and word string, (default)
eempty, don't print anything to indicate the identity of the word

For example, to print a sparse matrix, in which the word string and the word counts for each document are listed, use the format string ``siw''. The command

   rainbow -d ~/model -T 100 --print-matrix=siw | head -n 10

reduces the vocabulary to only 100 words, then prints

   ~/20_newsgroups/alt.atheism/53366 alt.atheism  god 2  jesus 1  nasa 2  people 2  
   ~/20_newsgroups/alt.atheism/53367 alt.atheism  jesus 2  jewish 1  christian 1  
   ~/20_newsgroups/alt.atheism/51247 alt.atheism  god 4  evidence 2  
   ~/20_newsgroups/alt.atheism/51248 alt.atheism  
   ~/20_newsgroups/alt.atheism/51249 alt.atheism  nasa 1  country 2  files 1  law 3  system 1  government 1  
   ~/20_newsgroups/alt.atheism/51250 alt.atheism  god 3  people 2  evidence 1  law 1  system 1  public 5  rights 1  fact 1  religious 1  
   ~/20_newsgroups/alt.atheism/51251 alt.atheism  
   ~/20_newsgroups/alt.atheism/51252 alt.atheism  people 4  evidence 2  system 2  religion 1  
   ~/20_newsgroups/alt.atheism/51253 alt.atheism  god 19  christian 1  evidence 1  faith 5  car 2  space 1  game 1  
   ~/20_newsgroups/alt.atheism/51254 alt.atheism  people 1  jewish 3  game 1  bible 7  

To print a non-sparse matrix, indicating the binary presence/absence of all words in the vocabulary for each document, use the format string ``abe''. The command

   rainbow -d ~/model -T 10 --print-matrix=abe | head -n 10

reduces the vocabulary to only 10 words, then prints

   ~/20_newsgroups/alt.atheism/53366 alt.atheism  1  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/53367 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51247 alt.atheism  1  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51248 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51249 alt.atheism  0  0  1  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51250 alt.atheism  1  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51251 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51252 alt.atheism  0  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51253 alt.atheism  1  0  0  1  1  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51254 alt.atheism  0  1  0  0  1  0  0  0  0  0  

For a summary of all the diagnostic options, see the "Diagnostics" section of the rainbow --help output.

5. General options

5.1. Verbosity of Progress Messages

Rainbow prints messages about its progress to standard error as it runs. You can change the verbosity of these progress messages with the --verbosity=LEVEL (or -v option. The argument LEVEL should be an integer from 0 to 5, 0 being silent (no progress messages printed to standard error), and 5 being most verbose. The default is 2.

For example, the following command will print no progress messages.

   rainbow -v 0 -d ~/model -I 10

Some of the progress messages print backspace characters in order to show running counters. When running rainbow with GDB inside an Emacs buffer, however, the backspace character is printed as a character escape sequence and fills the buffer. You can avoid printing progress messages that contain backspace characters by using the --no-backspaces (or -b) option.

5.1. Initializing of the Pseudo-Random Seed

Rainbow may use a pseudo-random number generator for several tasks, including the randomized test-train splits described in section 3.1. You can specify the seed for this random number generator using the --random-seed option. For example

   rainbow -d ~/model -t 1 --test-set=0.3 --random-seed=2

You can verify that use of the same random seed results in identical test/train splits by using the --print-doc-names option. For example

   rainbow -d ~/model --random-seed=1 --train-set=4pc --print-doc-names=train
will perform the specified test/train split, then print only the training documents. The above command will produce the same output each time it is called. However, the above command with the --random-seed=1 option removed will print different document names each time.

If this option is not given, then the seed is set using the computer's real-time clock.


Last updated: 30 September 1998, mccallum@cs.cmu.edu