info.ephyra.questionanalysis.atype.extractor
Class FeatureExtractor

java.lang.Object
  extended by info.ephyra.questionanalysis.atype.extractor.FeatureExtractor
Direct Known Subclasses:
EnglishFeatureExtractor

public abstract class FeatureExtractor
extends java.lang.Object

A feature extractor for question classification. The most important functionality is provided by the createInstance method (a couple convenience versions also provided) which create an edu.cmu.minorthird.classify.Instance object from basic question data (the original question and its syntactic parse tree). createInstance can be used to extraction features for run-time classification feature extraction. It is also used when loading edu.cmu.minorthird.classify.Example objects from a dataset file at training time (see loadFile and createExample). Thus, feature extraction for classification is accomplished by the same code for both training and run-time classification. An important thing for subclassing classes to note is that the Instance returned by a createInstance(...) method must have the original question, as a String, as it's source.

Version:
2008-02-10
Author:
Justin Betteridge

Field Summary
protected  int classLevels
           
protected  java.util.regex.Pattern datasetExamplePattern
          Regular expression describing the format of a line in a question classification dataset.
protected  boolean isInitialized
           
protected  int labelPosition
          The captured group index of the answer type label in the dataset line Pattern.
private static org.apache.log4j.Logger log
           
protected  int numLoaded
           
protected  int parsePosition
          The captured group index of the syntactic parse tree in the dataset line Pattern.
protected  int questionPosition
          The captured group index of the question in the dataset line Pattern.
protected static java.lang.String SPACE_PTRN
           
protected  boolean useClassLevels
           
 
Constructor Summary
FeatureExtractor()
           
 
Method Summary
 edu.cmu.minorthird.classify.Example[] createExample(java.lang.String datasetLine)
          Creates an edu.cmu.minorthird.classify.Example object from one line of a dataset file using createInstance(String, String).
abstract  edu.cmu.minorthird.classify.Instance createInstance(java.util.List<edu.cmu.lti.javelin.qa.Term> terms, java.lang.String parseTree)
          Given a question as a list of Terms and it's syntactic parse tree, creates a Instance for question classification by extracting the appropriate features.
abstract  edu.cmu.minorthird.classify.Instance createInstance(java.lang.String question)
          Creates an Instance for question classification when nothing but the original question is available for feature extraction.
 edu.cmu.minorthird.classify.Instance createInstance(java.lang.String question, java.lang.String parseTree)
          Convenience method that tokenizes the given question by whitespace, creates Terms, and calls createInstance(List, String).
 int getClassLevels()
           
 java.util.regex.Pattern getDatasetExamplePattern()
           
 int getLabelPosition()
           
 int getNumLoaded()
           
 int getParsePosition()
           
 int getQuestionPosition()
           
 void initialize()
          Reads in properties from this class's properties file and sets class data members.
 boolean isInitialized()
           
 boolean isUsingClassLevels()
           
 edu.cmu.minorthird.classify.Example[] loadFile(java.lang.String fileName)
          Loads an array of edu.cmu.minorthird.classify.Example objects from the file at the given location, using datasetExamplePattern and createExample.
 void printFeatures(java.lang.String dataSetFileName, java.util.List<java.lang.String> features)
          Prints the features generated for each example in an input file.
 void printFeaturesFromQuestions(java.lang.String questionSetFileName, java.util.List<java.lang.String> features)
          Prints the features generated for each example in an input file.
 void setClassLevels(int classLevels)
           
 void setDatasetExamplePattern(java.util.regex.Pattern datasetExamplePattern)
           
 void setInitialized(boolean isInitialized)
           
 void setLabelPosition(int labelPosition)
           
 void setParsePosition(int parsePosition)
           
 void setQuestionPosition(int questionPosition)
           
 void setUseClassLevels(boolean useClassLevels)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

private static final org.apache.log4j.Logger log

SPACE_PTRN

protected static java.lang.String SPACE_PTRN

datasetExamplePattern

protected java.util.regex.Pattern datasetExamplePattern
Regular expression describing the format of a line in a question classification dataset. The answer type label, the actual question, and the syntactic parse tree are the fields that must be captured by groups, with the group indices specified by labelPosition, questionPosition, and parsePosition, respectively.


labelPosition

protected int labelPosition
The captured group index of the answer type label in the dataset line Pattern.


questionPosition

protected int questionPosition
The captured group index of the question in the dataset line Pattern.


parsePosition

protected int parsePosition
The captured group index of the syntactic parse tree in the dataset line Pattern.


classLevels

protected int classLevels

useClassLevels

protected boolean useClassLevels

numLoaded

protected int numLoaded

isInitialized

protected boolean isInitialized
Constructor Detail

FeatureExtractor

public FeatureExtractor()
Method Detail

initialize

public void initialize()
                throws java.lang.Exception
Reads in properties from this class's properties file and sets class data members.

Throws:
java.lang.Exception

createInstance

public abstract edu.cmu.minorthird.classify.Instance createInstance(java.util.List<edu.cmu.lti.javelin.qa.Term> terms,
                                                                    java.lang.String parseTree)
Given a question as a list of Terms and it's syntactic parse tree, creates a Instance for question classification by extracting the appropriate features.

Parameters:
terms - the Terms of the question
parseTree - the syntactic parse tree of the question
Returns:
an Instance which can be used for question classification

createInstance

public edu.cmu.minorthird.classify.Instance createInstance(java.lang.String question,
                                                           java.lang.String parseTree)
Convenience method that tokenizes the given question by whitespace, creates Terms, and calls createInstance(List, String).

Parameters:
question - the question to create an Instance from
parseTree - the syntactic parse tree of the question

createInstance

public abstract edu.cmu.minorthird.classify.Instance createInstance(java.lang.String question)
Creates an Instance for question classification when nothing but the original question is available for feature extraction. Assumes words in the input question are separated by white-space.

Parameters:
question - the input question
Returns:
the Instance object

createExample

public edu.cmu.minorthird.classify.Example[] createExample(java.lang.String datasetLine)
                                                    throws java.lang.Exception
Creates an edu.cmu.minorthird.classify.Example object from one line of a dataset file using createInstance(String, String).

Parameters:
datasetLine - the line from the dataset file from which to create the Example
Returns:
the Example created
Throws:
java.lang.Exception

loadFile

public edu.cmu.minorthird.classify.Example[] loadFile(java.lang.String fileName)
Loads an array of edu.cmu.minorthird.classify.Example objects from the file at the given location, using datasetExamplePattern and createExample.

Parameters:
fileName - the name of the dataset file

printFeatures

public void printFeatures(java.lang.String dataSetFileName,
                          java.util.List<java.lang.String> features)
Prints the features generated for each example in an input file. If feature types are included as command-line arguments, only those types are printed. Otherwise, all features are printed.

Parameters:
dataSetFileName - the name of the file containing the dataset to load
features - a List of the features to print

printFeaturesFromQuestions

public void printFeaturesFromQuestions(java.lang.String questionSetFileName,
                                       java.util.List<java.lang.String> features)
Prints the features generated for each example in an input file. If feature types are included as command-line arguments, only those types are printed. Otherwise, all features are printed.

Parameters:
questionSetFileName - the name of the file containing the dataset to load
features - a List of the features to print

isInitialized

public boolean isInitialized()
Returns:
the isInitialized

setInitialized

public void setInitialized(boolean isInitialized)
Parameters:
isInitialized - the isInitialized to set

getNumLoaded

public int getNumLoaded()
Returns:
the number of examples loaded

setClassLevels

public void setClassLevels(int classLevels)

getClassLevels

public int getClassLevels()

setUseClassLevels

public void setUseClassLevels(boolean useClassLevels)

isUsingClassLevels

public boolean isUsingClassLevels()

getDatasetExamplePattern

public java.util.regex.Pattern getDatasetExamplePattern()
Returns:
the datasetExamplePattern

setDatasetExamplePattern

public void setDatasetExamplePattern(java.util.regex.Pattern datasetExamplePattern)
Parameters:
datasetExamplePattern - the datasetExamplePattern to set

getLabelPosition

public int getLabelPosition()
Returns:
the labelPosition

setLabelPosition

public void setLabelPosition(int labelPosition)
Parameters:
labelPosition - the labelPosition to set

getParsePosition

public int getParsePosition()
Returns:
the parsePosition

setParsePosition

public void setParsePosition(int parsePosition)
Parameters:
parsePosition - the parsePosition to set

getQuestionPosition

public int getQuestionPosition()
Returns:
the questionPosition

setQuestionPosition

public void setQuestionPosition(int questionPosition)
Parameters:
questionPosition - the questionPosition to set