A Minorthird Tutorial for ML Researchers

Who should use this tutorial

This tutorial is intended for people with some familiarity with machine learning who are interested in using Minorthird. It also assumes that you have a working copy of Java, some knowledge of Java, and a copy of Minorthird on your classpath. You will probably want to run this tutorial in a freshly-created directory.

I tested these commands under Windows, but of course they will work anywhere else, since Java's so platform independent. Assuming, of course, you understand about shell quotations and stuff. A JAR file of the exact version of Minorthird I used in creating this tutorial can be found here (7Mb or so). You can use it this for the tutorial if you replace every "java" command with "java -cp minorthird-20070122.jar".

An example of "named entity recognition by token classification"

Minorthird is primarily designed for learning how to extract named entities from text.

We'll start out by experimenting with a small toy example, derived from seven short sentences (which William typed in late one night in a whimsical mood while writing some unit tests.)

"Hello there, William Cohen, and welcome to CMU."

"William Clinton is a former US president."

"There's a new book by Hillary Clinton."

"George Washington was father of our country."

"Where in the world is Carmen Sandiego?"

"Which George Bush was most damaging to the economy?"

"I love books about Curious George the monkey."

The substrings in bold are person names, and the task we will consider is learning to detect these names. A fairly common way to using learning methods for this sort of named entity recognition (NER) problem to reduce the task of identifying substrings to the task of classifying tokens. For example, the first sentence might lead to these training examples for a learner, where positive examples are part of a name and negative examples are not:

Token Class Label Features

Hello NEG ...

there NEG ...

, NEG ...

William POS ...

Cohen POS ...

, NEG ...

and NEG ...

welcome NEG ...

to NEG ...

CMU NEG ...

. NEG ...

Token	Class Label	Features
Hello	NEG	...
there	NEG	...
,	NEG	...
William	POS	...
Cohen	POS	...
,	NEG	...
and	NEG	...
welcome	NEG	...
to	NEG	...
CMU	NEG	...
.	NEG	...

The table above doesn't give any features, but some natural features of an example would be the value of the token and whether or not it is capitalized. Similar features might also used for words in a window a few tokens to the left or right of the example:

Token Class Label Features

Hello NEG tokenValue=hello tokenCapitalized=true tokenToRightValue=there tokenToRightCapitalized=false...

there NEG tokenValue=there tokenToRightValue=comma tokenToLeftValue=hello tokenToRightCapitalized=false...

, NEG ...

Minorthird, like many modern machine learning systems, allows only sparse numeric features: one can introduce a large number of features, each of which must have numeric values, and then specify an example by listing only features that have non-zero values. So a plausible feature set might include features something like these (among others):

Token Class Label Features

Hello NEG tokenValueIsHello=1 tokenCapitalized=1 tokenToRightValueIsThere=1 tokenToRightCapitalized=0...

there NEG tokenValueIsThere=1 tokenToRightValueIsComma=1 tokenToLeftValueIsHello=1 tokenToRightCapitalized=0...

, NEG ...

Token	Class Label	Features
Hello	NEG	tokenValue=hello tokenCapitalized=true tokenToRightValue=there tokenToRightCapitalized=false...
there	NEG	tokenValue=there tokenToRightValue=comma tokenToLeftValue=hello tokenToRightCapitalized=false...
,	NEG	...

Token	Class Label	Features
Hello	NEG	tokenValueIsHello=1 tokenCapitalized=1 tokenToRightValueIsThere=1 tokenToRightCapitalized=0...
there	NEG	tokenValueIsThere=1 tokenToRightValueIsComma=1 tokenToLeftValueIsHello=1 tokenToRightCapitalized=0...
,	NEG	...

Ok - so now we know, in principle, how to reduce learning to do information extraction task to learning how to do classification. Later on, we'll come back to this, and discuss how to automatically reduce NER to classification using Minorthird. For now, let's get concrete in another way, and look at how to use Minorthird to learn a classifier.

Using the classify package

The basics

Part of Minorthird is a package of standard machine-learning methods. The name of this package is edu.cmu.minorthird.classify. (Aside: Minorthird uses a somewhat cumbersome Java convention for naming packages, designed to make sure that class names don't clash; every Minorthird class starts with edu.cmu.minorthird.)

Minorthird's classify package contains lots of Java code that is called when you learn extractors. However, the classify package can also be used as a stand-alone package. The package uses a fairly simple format for examples. To continue, download this sample file, which approximately follows the example above, and save it to the file sample1-tokens-binary.data. If you open it in a text editor, you will see something like this:

 
k trainStrings[0] NEG tokens.cap.t=1.0 tokens.eq.charTypePattern.X+x+=1.0 tokens.eq.lc.hello=1.0 right.token_0.eq.charTypePattern.x+=1.0 right.token_0.eq.lc.there=1.0
k trainStrings[0] NEG left.tokenNeg_1.cap.t=1.0 tokens.eq.charTypePattern.x+=1.0 tokens.eq.lc.there=1.0 left.tokenNeg_1.eq.charTypePattern.X+x+=1.0 left.tokenNeg_1.eq.lc.hello=1.0 right.token_0.eq.charTypePattern.,=1.0 right.token_0.eq.lc.,=1.0
k trainStrings[0] NEG right.token_0.cap.t=1.0 tokens.eq.charTypePattern.,=1.0 tokens.eq.lc.,=1.0 left.tokenNeg_1.eq.charTypePattern.x+=1.0 left.tokenNeg_1.eq.lc.there=1.0 right.token_0.eq.charTypePattern.X+x+=1.0 right.token_0.eq.lc.william=1.0 
k trainStrings[0] POS tokens.cap.t=1.0 right.token_0.cap.t=1.0 tokens.eq.charTypePattern.X+x+=1.0 tokens.eq.lc.william=1.0 left.tokenNeg_1.eq.charTypePattern.,=1.0 left.tokenNeg_1.eq.lc.,=1.0 right.token_0.eq.charTypePattern.X+x+=1.0 right.token_0.eq.lc.cohen=1.0 
k trainStrings[0] POS tokens.cap.t=1.0 left.tokenNeg_1.cap.t=1.0 tokens.eq.charTypePattern.X+x+=1.0 tokens.eq.lc.cohen=1.0 left.tokenNeg_1.eq.charTypePattern.X+x+=1.0 left.tokenNeg_1.eq.lc.william=1.0 right.token_0.eq.charTypePattern.,=1.0 right.token_0.eq.lc.,=1.0 
...

(I added the bold for clarity). The data file has one example on each line, and each example corresponds to a token from the example sentences above. The different parts of each example are separated by spaces. Let's step through the first example, left to right, and explain what each part is.

k is a meaningless filler, which is there for backward compatibility with some old formats.
trainStrings[0] is the name of the subpopulation from which this example was taken. In this case, each sentence is a subpopulation. What this means in general will be explained later.
NEG is the class label for the example.
tokens.cap.t=1.0 is the first non-zero feature associated with this example. The string "tokens.cap.t" is the name of the feature, and 1.0 is the value of the feature. Feature names can contain almost any character, other than percent signs (%), spaces, tabs, and equal signs (=). (Aside: these characters can be included in a feature name by escaping them, using the %-hex scheme commonly used in CGI encoding.)

The periods (.) in a feature name conceptually split up different parts of the name. The feature names are arbitrary, but as it happens these features all have a pretty consistent meaning. For instance, the feature tokens.eq.lc.hello=1.0 means if you look at the token being classified, convert it to a string, and then convert it to lower case, the value is "hello". The feature right.token_0.eq.lc.there=1.0 means that if you look to the right of the token, take the first token to the right, and then again convert to a lower-case string, the value will be "there".

Another way of looking at this data is to type in the command

 
% java edu.cmu.minorthird.classify.Train -data sample1-tokens-binary.data -gui

A window should pop up. Click the button labeled "Show train data" in the middle, and another window will pop up. Click on the fourth line in the top pane - the one labeled "[Class: POS 1.0] sample1-tokens-binary.data:4" - and you should see a new window pop up. The lower pane of the window shows the features for the fourth token in the dataset - the one for the token 'william'. The tabs in the lower pane let you see the subpopulation of the examples or its "source". Again, we'll discuss "source" more later - in this case, the example "source" is just the name of the file the example was loaded from, plus its line number, but in other cases the "source" will indicate the original object from which the example was constructed.

Ok - now let's do something with the data. Go back to the first window and hit the button labeled "Start task". Wait for the button labeled "View Results" to become active - when it's no longer grey (this won't take long), press it. You'll see another new window pop up, which one which shows you the classifier that was learned.

An important hint: not all the error output from Minorthird gets sent to the GUI window. If things don't seem to be working for you, always look at your console window to see if there are any errors.

Many of Minorthird's classifiers are linear combinations of features (or are based on linear combinations). The right column of the window shows the weights given to each feature. The radio buttons on the bottom let you pick among different schemes for sorting the feature values: by feature name, by feature value, or by absolute value. There's also a "tree view", which heirarchically structures the features. Initially the features are unsorted.

Changing the learning algorithm

Minorthird contains a number of different learners, and it obviously provided some default one here. So far, we don't even know what learner was used, let alone how to change it, or modify the parameters.

At the top of the screen is a subpane labeled "Parameter modification", which contains a pulldown list (containing only a single item, "DataClassicationTask") and next to that a button labeled "Edit". Hit the edit button: a similar-looking subwindow will pop up, labeled "Property Editor". Hit the edit button a second time, and you'll see a window with the promising label Property Editor for trainParams. We want to change the learner, so hit the Edit button the for learnerParameters. You will now, finally, see a "Property Editor" with a pull-down list containing the string "NaiveBayes".

Just to get the hang of things, do the following:

Select the DecisionTreeLearner from the pull-down list
Hit the "Edit" button next to DecisionTreeLearner and change the maxDepth to 3
Close windows by clicking "Ok" until you're back to the original GUI Window.
Hit "Start task", wait for completion, and hit "View Results".

You will see a window labeled "Result" containing a decision tree classifier. By default only the top level of the tree will be shown, but if you manipulate it a bit you can see the entire tree. Each node in the tree is either a number, which is a confidence (e.g., "-4.34" means predict the NEG class) or a test on a feature value (e.g., tokens.cap.t>=0.5). For test, the "true" condition leads to the first subtree below the test, and the "false" condition leads to the second subtree. Roughly speaking, this tree means "predict POS if and only if the current token is capitalized, and either the first token to the left or the first token to the right is also capitalized."

At this point, I recommend you mess around a little to see what sort of learners are available in the GUI, and what sort of classifiers are output. Notice that some of these classifiers (e.g. "AdaBoost", "BatchVersion") are meta-learning algorithms, which include other learners that can be configured.

One other useful option to investigate in the "trainParams" Property Editor window is the "saveAsFilename". Filling in something here will force the final classifier to be saved. (Bug: the showTestDetails option doesn't do anything and shouldn't be there.)

What's going on with all these windows? What you're looking at is really just an interactive way of creating and editing some Java objects that will be used to guide an experiment. It's not much of a GUI - but if you're trying to understand what sorts of things are possible in Minorthird, it's perhaps a helpful way of exploring the space. And it has the advantage that it is created automatically, using Java reflection - so as soon as we write a new learning method, you can immediately see it in the GUI.

Using command line options

You can also use Minorthird from the command line, by just leaving off the -gui option. The command line above isn't very useful, because it doesn't save a classifier, after it learns one. You can learn a classifier and save it with commands like these

#use default learner
% java edu.cmu.minorthird.classify.Train -data sample1-tokens-binary.data -saveAs naiveBayes.ser
#specify a learner
% java edu.cmu.minorthird.classify.Train -data sample1-tokens-binary.data -saveAs decisionTree.ser -learner "new DecisionTreeLearner()"

(Aside: I'm using the extension ".ser" for a "serialized Java object". Minorthird can construct many different types of classifiers, all of which are stored using Java's serialization tools.) The argument to the -learner is simply passed in to Java (actually an interpreted version called BeanShell) and used to construct a Minorthird ClassifierLearner object. You can look at the Minorthird Javadocs for the classify package if you want to construct your own special learner. A warning: Minorthird will give you an unfriendly-looking stack dump if Java doesn't understand your learner. (This actually sometimes contains some useful information about what went wrong.)

If you prefer shorter command lines, there is a trick: you can prepare a "config file" like this one, the complete contents of which I will reproduce below:

#specify the data to learn from
data=sample1-tokens-binary.data 

#specify the learner
learner=new AdaBoost(new DecisionTreeLearner(2,2),20)

You can then use the command-line option

-config
myConfig.txt

as a short cut: e.g., the following two commands are equivalent, except for the name of the file that's created:

% java edu.cmu.minorthird.classify.Train -data sample1-tokens-binary.data -learner "new AdaBoost(new DecisionTreeLearner(2,2),20)" -saveAs ab1.ser
% java edu.cmu.minorthird.classify.Train -config myConfig.txt -saveAs ab2.ser

Testing a classifier

If you download this test data, you can also test the programs you learn using the Test program, using these commands:

% java edu.cmu.minorthird.classify.Test -test sample1-test-tokens-binary.data -loadFrom naiveBayes.ser 
% java edu.cmu.minorthird.classify.Test -test sample1-test-tokens-binary.data -loadFrom decisionTree.ser


This should print out some summary statistics on the screen, something like this:
java edu.cmu.minorthird.classify.Test -test sample1-test-tokens-binary.data -loadFrom naiveBayes.ser -saveAs nbTest.eval
*** Minorthird: Version 11.6.11.22 ***
option: test=sample1-test-tokens-binary.data
option: loadFrom=naiveBayes.ser
Error Rate:                  0.06976744186046512
. std. deviation error rate: 0.0
Balanced Error Rate:         0.04285714285714286
. error Rate on POS:         0.0
. std. deviation on POS:     0.0
. error Rate on NEG:         0.08571428571428572
. std. deviation on NEG:     0.0
Average Precision:           0.9704861111111112
Maximium F1:                 0.9411764705882353
Average Log Loss:            10.783390577458855
Recall:                      1.0
Precision:                   0.7272727272727273
F1:                          0.8421052631578948
Kappa:                       0.798751950078003




Another way to look at the results is graphically.  When Minorthird
tests a classifier, it creates a Java object (called an Evaluation)
which you can save to a file, or visually inspect.  The command

% java edu.cmu.minorthird.classify.Test -test sample1-test-tokens-binary.data -loadFrom naiveBayes.ser -saveAs nbTest.eval -showResult


Will save the result to nbTest.eval, and also pop up a swing window that lets you look at the
results in a number of different ways (e.g., to view an ROC curve
or a precision-recall curve). A warning: for this tiny sample set,
there are few errors, so the ROC curves and precision-recall curves
are sometimes invisible.



If you want to really see how a classifier is working, use the option -showTestDetails,
for instance with

% java edu.cmu.minorthird.classify.Test -test sample1-test-tokens-binary.data -loadFrom naiveBayes.ser -showResult -showTestDetails


This gives you another visual inspector.  One of the top-level tabs is
evaluation that results from the test.  The second is called
"Details".  After selecting this (and resizing the bottom two panes
to be larger) you'll see a window like this
one.



The upper-right pane shows the classifier being tested.  The
upper-left pane shows the predicted value made by the classifier for
each test example.  (The string describing a test example starts out
with a string describing its actual - e.g., [example: [Class:
NEG 1.0]...] is a negative example.)  Below this are some
filters.  Check the radio button "incorrect" and hit the "Apply"
button to see a smaller list of three incorrect predictions - all of
which are false positive errors.  Click on the first of these
examples, and you will see  something
like this.  The lower-left pane shows a view of the example: the
appears to be a capitalized version of the word "who", with no tokens
to the left, and the token "was" to the right.  



The lower-right pane shows an explanation of the prediction
that was made.  If you open up the subnodes of the explanation you can
see how each feature contributed to the
classification.



A final demo of this tool.  Go to the upper-right pane, sort the
features by weight, and click on the second feature:
tokens.eq.lc.clinton.  The message window on the bottom
will now contain a summary of how often that feature appeared in the
test data, and the feature will have been moved to where the
filters appear below the upper-left pane, with a check-box next
to it.  Check that box, move the radio button to "all", and hit
"Apply".  Finally select the single positive example that appears in
the list in the upper-left pane.  You should see something like this window.  This shows you
the example that contains that feature, what its features are, and how
(and why) it was classified by the classifier.



A warning: these viewers are great for small test sets, but
take a lot of memory.  Use all the viewers, but particularly the
-showTestDetails option, with care on larger datasets.


Some secrets from the experts



If you want to use visually inspect a saved evaluation, you can use the 
main program for the class Evaluation, e.g.,

% java edu.cmu.minorthird.classify.experiments.Evaluation nbTest.eval 


If you want to compare two or more evaluations, you can use the main
program for the class EvaluationGroup:

% java edu.cmu.minorthird.classify.experiments.EvaluationGroup nbTest.eval dtTest.eval



If you use the "Text" tab of a graph - e.g., an ROC curve - you will see
a version of the curve that can be cut-and-pasted into gnuplot.



You can modify an evaluation by adding "properties". If you go to the
"Properties" tab of the graphical viewer, you can add pairs consisting of
a property name and value (each of which is simply a string).  For instance,
after launching this viewer

% java edu.cmu.minorthird.classify.experiments.Evaluation nbTest.eval 


you can navigate to the "Properties" tab, type in "learner" and
"naiveBayes" in the two text boxes at the bottom of the screen, and
then hit the "Insert Property" button.  You should see something like this.  You can then save the
modified evaluation using the "File -> saveAs" menu.



Evaluations are saved in a fairly simple format - if you remember to
unzip them first.  For instance, if you unzip dtTest.eval you'll see
something like this.

% gunzip -c dtTest.eval
[POS,NEG]
NEG 2.9766216671438923 NEG
NEG 4.340420741472285 NEG
POS 3.322545484752822 POS
POS 3.4662239457862545 POS
NEG 4.340420741472285 NEG
...


The first line lists the classes associated with the data.  (Lines
below this containing an equal sign are where property values are
stored, if there are any). Every other line contains the predicted
class, confidence, and actual class for a specific test example.




Doing an cross-validation experiment

When you want to do a cross-validation experiment, training and testing with separate programs 
is cumbersome.  Minorthird lets you do this in one go with the TrainTest program.  Try the following
command

% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens-binary.data -learner "new DecisionTreeLearner()" -splitter l


The argument -splitter l means to do a
leave-one-out-experiment.  In Minorthird, this means that for
each subpopulation S, a classifier
C(S) will be trained and tested. C(S) will be tested on
S, and trained on all the data from a subpopulation other than
S.  For this data, Minorthird will train seven classifiers, one
for each sentence.



The result of a TrainTest experiment is an evaluation, not a
classifier.  If you use the -saveAs option to the
TrainTest, program, you can save this evaluation - exactly as you can
save the evaluation for Test.  You can also use the
-showResult and -showTestDetails option,
exactly as with a regular Test, but there are a few differences.  The
viewer will have seven new subpanes, each associated with one of the
seven test partitions, and each subpane containing a detailed,
four-pane inspector of one classifier's performance on one test
sentence.  There is also an overall evaluation that pools results
across the different test partitions.


The leave-one-out-splitter is seldom used.  Some more common
splitter options are:

k5: a five-fold cross-validation
s10: a stratified ten-fold cross-validation
r70: one test with 70% of the data used for training and the remainder for testing.



Binary and Non-binary classification tasks

Sometimes there is a tradeoff between making learning methods easy
to use, and making them easy to implement.  Minorthird
is primarily an SDK for writing and using learning algorithms, and
secondarily a toolkit for doing experiments with learning - so when
this tradeoff comes up, we often go a with a framework that leads to
easy-to-implement learners.


An example of this is the way binary classification is treated.
When you're implementing, binary classifiers are an (easy) special
case: the classifier only needs to return single real number y
to indicate the prediction (positive or negative, the sign y)
and confidence (the magnitude of y).  In Minorthird you can
implement a special BinaryClassifierLearner
which learns a special BinaryClassifier;
or you can implement a more general learning, which learns multi-class classifier.



To prepare a dataset for a BinaryClassifierLearner, you must use the
two special classes POS and NEG.  To prepare a dataset for a general
learner, you can use any class names you like.  This sample file contains a non-binary
dataset, with the classes trueNameBegin (for the first
token in a "trueName"), trueNameContinue (for subsequent
tokens in a "trueName") and NEG (for all other tokens).



If you download the sample1-tokens.data file above, and
try and run Minorthird with the default learner, it will fail with an
error message, like this:

java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data 
*** Minorthird: Version 11.6.11.22 ***
option: data=sample1-tokens.data
java.lang.IllegalStateException: can only learn binary example data: requested schema is [ExampleSchema: [POS, NEG, trueNameBegin, trueNameContinue]]
	at edu.cmu.minorthird.classify.OnlineBinaryClassifierLearner.setSchema(OnlineBinaryClassifierLearner.java:20)
	at edu.cmu.minorthird.classify.ClassifierTeacher.train(ClassifierTeacher.java:22)
	at edu.cmu.minorthird.classify.experiments.Tester.evaluate(Tester.java:36)
	at edu.cmu.minorthird.classify.TrainTest$DataClassificationTask.doMain(TrainTest.java:158)
	at edu.cmu.minorthird.classify.TrainTest$DataClassificationTask.callMain(TrainTest.java:205)
	at edu.cmu.minorthird.classify.TrainTest.main(TrainTest.java:316)

This error comes from the classification learner that was used - the
default learner, Naive Bayes, which only handles binary classify
tasks. (Bug: the POS class shouldn't be there, it's not present in the data.)



There are a number of learners which can handle multi-class data,
however.  On the GUI menu, the available multi-class learners are
MaxEntLearner and KnnLearner.  You can also
convert your favorite binary learner (say, NaiveBayes) to a
multi-class learner: for instance, you can construct a
OneVsAllLearner that uses YFBL in an inner loop to
distinguish each class from all the others, and then combines the
results to construct a multi-class classifier.  Here are some
examples:

% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data -splitter l -learner "new MaxEntLearner()"
% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data -splitter l -learner "new MultiClassSVMLearner()"
% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data -splitter l -learner "new KnnLearner()"
% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data -splitter l -learner "new OneVsAllLearner(\"new NaiveBayes()\")"
% java edu.cmu.minorthird.classify.TrainTest -data sample1-tokens.data -splitter l -learner "new OneVsAllLearner(\"new VitorBalancedWinnow()\")"




Important: Notice that the argument to
OneVsAllLearner is a
string, not a Java object.  The OneVsAllLearner
will pass this string to BeanShell several times - once for each
class, in fact - in order to create several copies of the learner
specified.



Learning Sequential Classifiers


Minorthird has several algorithms for learning sequential classifiers.  The dataset for 
a sequential classifier looks much like an ordinary dataset, with two important differences:

A sequential dataset contains some lines which contain the single character *

The order of lines in a sequential dataset matters. Specifically, all examples between two 
* lines (or between a * line and the beginning/end of the file) are
considered to be a sequence of labeled examples.


A sequential classifier will assign labels to a sequence of instances.


Here is an example of a sequential
dataset.  Save this file to sample1-tokens.seqdata.
You can learn and apply classifiers and perform experiments with
sequential learners just as you can with ordinary learners: the only
difference is that you need to include the extra argument -type
seq in your command lines.  The following commands should all
work.

% java edu.cmu.minorthird.classify.TrainTest -type seq -data sample1-tokens.seqdata -splitter l -learner "new CRFLearner()"
% java edu.cmu.minorthird.classify.TrainTest -type seq -data sample1-tokens.seqdata -splitter l -learner "new CollinsPerceptronLearner()"
% java edu.cmu.minorthird.classify.TrainTest -type seq -data sample1-tokens.seqdata -splitter l -learner "new CMMLearner(new MaxEntLearner(),1)"


The final commands builds a conditional
Markov model learner that uses MaxEnt as its inner learner - aka
an MEMM.
You can also learn a CMM with other inner learners, for instance, an SVM:

% java edu.cmu.minorthird.classify.TrainTest -type seq -data sample1-tokens.seqdata -splitter l -learner "new CMMLearner(new MultiClassSVMLearner(),1)"


The final numeric argument to the CMMLearner constructor
is the number of history features used by the inner learner.  If you
want to use an ordinary non-sequential learner on a sequential
dataset, one way to do that is to set the history to zero:

% java edu.cmu.minorthird.classify.TrainTest -type seq -data sample1-tokens.seqdata -splitter l -learner "new CMMLearner(new MultiClassSVMLearner(),0)"


(Bug: the classify.Test -type seq command doesn't seem to work when used on the command line.  It does work in the GUI.)


Using the minorthird.ui tools to learn extractors


It is awkward use the minorthird.classify routines to
perform "named entity recognition by token classification", as
outlined above, for several reasons.
First, NER performance is usually measured by statistics like F1 at
the entity level, which doesn't map directly to any measurement at all
at the level of individual tokens.  Second, it's hard to understand
what's going on, since the examples are no longer directly associated
with the text they came from.  This is important for debugging, and
feature engineering.



However, Minorthird has additional facilities to: 

load collections of text;
save, view, and edit the sorts of labels on text required by NER systems; and
automatically convert NER-labeled text to "NER by token tagging" problems;

These tools are mostly in the package minorthird.ui.  We'll explore some of them below.



Viewing annotated text


To get started, use the command:
% java edu.cmu.minorthird.ui.ViewLabels -labels sample1.train


The labeled text corpus sample1.train is built-in
to Minorthird (for debugging and unit tests, mostly) and contains the seven example sentences shown above.
In the right-hand window, click on the "SpanTypes" tab, pull down
"blue" and "trueName", and then click "Apply".  Finally click on the
topmost sentence ("William Clinton is a former US president.") and you
should see something like this.


What you're looking at is a small labeled corpus.  The top-left window
is an overview of the corpus. The top-right window controls what
labeled "spans" are highlighted.  The bottom window zooms into a
particular document that's been highlighted - you can also see how the
document has been tokenized, or how it's been named.



Minorthird supports several types of labelings.  You can assign a
"type" to a span (a sequence of adjacent tokens), or you can assign a
"property" to a span.  You can also assign properties to tokens (these
are visible if you zoom into a document and select the token view).
Normally, when you're learning an NER system, span types (or
properties) are used to generate the labels that will be
predicted by the learned NER, and token properties are used to
extend the features that are used by the NER to do the
prediction.



Warning: Minorthird allows spans that overlap - for instance,
the span "bigram" contains every bigram.  The viewer does not display
these properly.


Training and testing extractors


Now try the command:

% java edu.cmu.minorthird.ui.TrainExtractor -labels sample1.train -gui


A window should pop up. If you hit the "Show labels" button, you'll
see the same viewer we just explored above.
Instead hit the "Edit" button, and you'll see something like this.



To go ahead and train something, you'll need to specify four things (top to bottom).


The learner that will be used. (We'll get back to that.) For now,
the default is fine.

The labeled corpus that will be used.  We've already specified that on the command
line, so that's also fine to leave as is.

A filename where you want to save the extractor that will be
trained.  Again, leave that one alone for now.

Finally - what do you want the extractor to learn?  Unlike a
dataset, a labeled corpus does not have a single "class label" that is
the target for learning - instead, you need to tell Minorthird what
"span type" should be used as the target.  Edit that value, and under
the "spanType" pull-down, select "trueName".





Now, close the editor windows out with "Oks" and hit "Start Task".
When the task finishes, view the result.  You should see something like this. This is a viewer for a sequentially-learned
voted perceptron classifier. Use the "File" menu on the viewer to
save this as "vp1.serialized" (the "serialized" extension is tacked on
automatically when you use the File menu).



To test a previously-trained extractor, use this command:

% java edu.cmu.minorthird.ui.TestExtractor -labels sample1.test -spanType trueName -loadFrom vp1.serialized -gui
 

After you "Start task", the screen will fill with a bunch of numbers,
the most important of which is the span F1, which is 0.5.  
(Bug: -showTestDetails command-line option doesn't work for TestExtractor of TrainTestExtractor?)



You can also do this in one go from the command line:

% java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanType trueName -showResult


From the display, select the "Full Test Set" (or equivalently "Test
Set 1" - there's only one test set, since we didn't do a
cross-validation) from the top tab in the result window.  Then using
the right-hand side control tools, select the span type "_prediction"
to be highlighted in grey.  You should see something like this. If you want something more
colorful, you could try generating this
display instead.  The color key for this is display is: true
positive tokens are green, false positive tokens are red, and false
negative tokens are blue.


Configuring the learner and feature extractor


The default learner is a little like naive Bayes - fast, but not
optimal.  You may want to customize it.  Use the command

% java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanType trueName -gui


and edit the "training parameters" for the default selection, "VPHMM."
You'll see a number of options to set:  



The annotationType is the label that is added to an labeled corpus
when an extractor is run.  By default this is "_prediction", but you
can change this if you like.

Check the box labeled "displayDatasetBeforeLearning".  This will
cause a dataset viewer to pop up when the learner runs, showing the
dataset is created and passed to the learner.

The sequenceClassifierLearner is the learning method that will be
used on this dataset.  There are a number of possibilities, which
should be familiar to you from the sequential learning section above.

The final two options have to do with the way that the dataset is
constructed.  The spanFeatureExtractor controls how a
span is converted to a classification instance. (In this case, a span
will always contain a single token, but that's not always true.)  The
taggingReduction determines how labels are put on these
instances.  Set the featureWindowSize=1 for the
spanFeatureExtractor, and set the
taggingReduction to
BeginContinueOutsideReduction.






Close out the editor windows, and hit "Start Task".  Before the task
finishes, a window will pop up displaying a dataset.  This is much
like the one we saw before, but has a
couple of differences.  First, it includes some features named stuff
like "previousLabel.1.NEG" which explicitly describe the nearby labels
in a sequence - this is because we're viewing a
SequentialDataset instead of a Dataset.
Second, it includes backpointers from the examples being viewed to the
spans used to create those examples.  To see this, select the
"William" token from the top pane, and the "Source" tab on the bottom
pane.  You should see something like this.


Loading data into Minorthird


There are a number of ways to load labeled text data into Minorthird.
One simple way is illustrated by this sample
tarball, which you can download and untar 

% tar -xvzf sample-text.tgz


into the directory seminar-subset.  The files in this directory
contain stuff like this:


<paragraph>Art Department Lecture Series Continues with Baltimore Artist <speaker>Joyce Scott</speaker>    
    <sentence><speaker>Joyce Scott</speaker> has studied traditional Navajo rug weaving, traditional
and contemporary Japanese theatre techniques, and traditional Native
American beadwork</sentence>.  <sentence>Her work centers around political and social issues,
particularly issues of race and gender</sentence>.  <sentence>Small sculptures of beads and
thread, which at first look toy-like, are actually social commentaries
about the psychological and physical dangers of being an
African-American and a woman in our society</sentence>.</paragraph>
<paragraph>    <sentence><speaker>Ms. Scott</speaker> will be speaking Tuesday, November 29, at <stime>5:00 p.m</stime>. in
<location>2315 Doherty Hall</location></sentence>.  <sentence>Admission is free</sentence>.  </paragraph>


The XML-y tags are pre-processed by Minorthird into a labeled corpus
with the spanTypes paragraph, speaker,
sentence, and etc.



Sometimes, of course, you don't want Minorthird to preprocess
XML tags.  You can control this by setting a Java "system property",
like this:


% java -Dedu.cmu.minorthird.sgmlPattern=xyzzy edu.cmu.minorthird.ui.ViewLabels -labels seminar-subset


Here, XML tags are only preprocessed for filenames that match the
regular expression defined by the "sgmlPattern" property (i.e., names
matching xyzzy).  



In this case, you need another way to load in labeled spans.
Minorthird also allows (in fact, encourages, and uses heavily!)
stand-off annotation---externally-stored labels for a set of
documents.  Stand-off annotation from the file foo.labels
is automatically loaded when you load in files from dirctory
foo.  As an example if you download this file and name it
seminar-subset.labels, two words in the document named
cmu.andrew.official.career-center-1495_0 will be
annotated as buzzwords when you ui.ViewLabels the
seminar-subset corpus.  The file, by the way, is:

addToType cmu.andrew.official.career-center-1495_0 203 10 buzzword
addToType cmu.andrew.official.career-center-1495_0 246 13 buzzword
closeAllTypes cmu.andrew.official.career-center-1495_0


And the arguments to the addToType "command" are the byte
offset at which the span starts, the length of the span in bytes, and
the type to be associated with the span.  The
closeAllTypes span indicates that this document has been
completely labeled, and is necessary when you're using stand-off
annotation to define a learning task.


More secrets from the experts


For all main programs in the ui package, the
-help option will often a listing of the valid
command-line options.


There is a main program that will just construct a sequential dataset
from a text corpus, without actually doing any learning at all:

% java edu.cmu.minorthird.ui.PreprocessTextForExtractor -labels foo -spanType trueName -saveAs foo.seqData


This is useful if you want to loosely integrate the front end of
Minorthird with the backend of some other learning system.



If you have a serialized Java file lying around that Minorthird
generated, but you can't remember what it is, it often works to use this main program:

% java edu.cmu.minorthird.util.gui.ViewerFrame vp1.serialized