info.ephyra.nlp
Class NETagger

java.lang.Object
  extended by info.ephyra.nlp.NETagger

public class NETagger
extends java.lang.Object

This class combines model-based, pattern-based and list-based named entity taggers.

The pattern-based taggers are optimized for the tokenizer provided in this class. Do not use other tokenizers.

Version:
2007-07-24
Author:
Nico Schlaefer, Guido Sautter

Field Summary
private static java.lang.String[] allPatternNames
          Collection of all NE types extracted with regular expressions.
private static java.lang.String[] finderNames
          NE types that are recognized by the OpenNLP name finders.
private static opennlp.tools.lang.english.NameFinder[] finders
          Name finders from the OpenNLP project, created from different models.
private static int fuzzyListLookupThreshold
          Edit distance threshold for fuzzy-lookups in dictionaries.
private static java.lang.String[] listNames
          NE types of the entries in the lists.
private static java.lang.String[] lists
          File names of lists that match different types of NEs.
private static java.lang.String[] MODEL_TYPES
          NE types with model-based taggers.
private static int[] patternMaxTokens
          Maximum number of tokens per instance for the different types of NEs.
private static java.lang.String[] patternNames
          NE types that are matched by the regular expressions.
private static java.util.regex.Pattern[] patterns
          Regular expression patterns that match different types of NEs.
private static java.lang.String[] quantityPatternNames
          NE types that are matched by the regular expressions.
private static java.util.regex.Pattern[] quantityPatterns
          Regular expression patterns that match different types of quantity NEs (number + unit).
private static int[] quantityUnitPatternMaxTokens
          Maximum number of tokens per instance for the different types of quantity units.
private static java.util.regex.Pattern[] quantityUnitPatterns
          Regular expression patterns that match different measurement units.
private static java.lang.String[] stanfordNames
          NE types that are recognized by the Stanford NE tagger.
 
Constructor Summary
NETagger()
           
 
Method Summary
private static void addNames(java.lang.String tag, java.util.List names, opennlp.tools.parser.Parse[] tokens)
          Adds named entity information to parses.
static boolean allModelType(java.lang.String[] neTypes)
          Checks if there is a model-based tagger for each of the given NE types.
static java.lang.String[][] extractNes(opennlp.tools.parser.Parse parse)
          THIS METHOD IS NOT USED Extracts NEs from a parse tree that has been augmented with NE tags.
static java.lang.String[][][] extractNes(java.lang.String[][] sentences)
          Extracts NEs from an array of tokenized sentences.
static java.lang.String[][] extractNes(java.lang.String[][] sentences, int neId)
          Extracts NEs of a particular type from an array of tokenized sentences.
private static void extractNesRec(opennlp.tools.parser.Parse parse, java.util.ArrayList<java.lang.String>[] nes)
          Recursive method called by extractNes(Parse) to extract NEs from a parse tree augmented with NE tags.
static int getFuzzyMatchingThreshold()
          Gets the current value of the edit distance threshold for fuzzy-lookups in dictionaries.
static int[] getNeIds(java.lang.String neType)
          Returns the IDs of the taggers for the given NE type (there may be more than one).
static java.lang.String getNeType(int neId)
          Returns the NE type that is recognized by the tagger with the given ID.
static int getNumberOfTaggers()
          Returns the number of NE taggers.
static boolean hasModelType(java.lang.String[] neTypes)
          Checks if there is a model-based tagger for one of the given NE types.
static boolean isModelType(java.lang.String neType)
          Checks if there is a model-based tagger for the given NE type.
static void loadListTaggers(java.lang.String listDirectory)
          Initializes the list-based NE taggers.
static boolean loadNameFinders(java.lang.String dir)
          Creates the OpenNLP name finders and sets the named entity types that are recognized by the finders.
static void loadRegExTaggers(java.lang.String regExListFileName)
          Initializes the regular expression based NE taggers.
static void setFuzzyMatchingThreshold(int threshold)
          Sets the threshold for fuzzy-lookups in gazetteer lists (aka dictionaries).
static void tagNes(opennlp.tools.parser.Parse[] parses)
          Performs named entity tagging on an array of full parses of sentences.
static java.lang.String[] tagNes(java.lang.String[] sentences)
          THIS METHOD IS NOT USED Performs named entity tagging on an array of (not tokenized) sentences.
static java.lang.String[] tokenize(java.lang.String text)
          A rule-based tokenizer used to prepare a sentence for NE extraction.
static java.lang.String tokenizeWithSpaces(java.lang.String text)
          Applies the rule-based tokenizer and concatenates the tokens with spaces.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MODEL_TYPES

private static java.lang.String[] MODEL_TYPES
NE types with model-based taggers.


finders

private static opennlp.tools.lang.english.NameFinder[] finders
Name finders from the OpenNLP project, created from different models.


finderNames

private static java.lang.String[] finderNames
NE types that are recognized by the OpenNLP name finders. There may be multiple taggers for the same NE type. IMPORTANT: NE types must be prefix-free.


stanfordNames

private static java.lang.String[] stanfordNames
NE types that are recognized by the Stanford NE tagger. There may be multiple taggers for the same NE type. IMPORTANT: NE types must be prefix-free.


lists

private static java.lang.String[] lists
File names of lists that match different types of NEs.


listNames

private static java.lang.String[] listNames
NE types of the entries in the lists. There may be multiple taggers for the same NE type. IMPORTANT: NE types must be prefix-free.


fuzzyListLookupThreshold

private static int fuzzyListLookupThreshold
Edit distance threshold for fuzzy-lookups in dictionaries.


patterns

private static java.util.regex.Pattern[] patterns
Regular expression patterns that match different types of NEs.


patternMaxTokens

private static int[] patternMaxTokens
Maximum number of tokens per instance for the different types of NEs.


patternNames

private static java.lang.String[] patternNames
NE types that are matched by the regular expressions. There may be multiple taggers for the same NE type. IMPORTANT: NE types must be prefix-free.


quantityPatterns

private static java.util.regex.Pattern[] quantityPatterns
Regular expression patterns that match different types of quantity NEs (number + unit).


quantityUnitPatterns

private static java.util.regex.Pattern[] quantityUnitPatterns
Regular expression patterns that match different measurement units.


quantityUnitPatternMaxTokens

private static int[] quantityUnitPatternMaxTokens
Maximum number of tokens per instance for the different types of quantity units.


quantityPatternNames

private static java.lang.String[] quantityPatternNames
NE types that are matched by the regular expressions. There may be multiple taggers for the same NE type. IMPORTANT: NE types must be prefix-free.


allPatternNames

private static java.lang.String[] allPatternNames
Collection of all NE types extracted with regular expressions.

Constructor Detail

NETagger

public NETagger()
Method Detail

loadNameFinders

public static boolean loadNameFinders(java.lang.String dir)
Creates the OpenNLP name finders and sets the named entity types that are recognized by the finders.

Parameters:
dir - directory containing the models for the name finders
Returns:
true, iff the name finders were created successfully

loadListTaggers

public static void loadListTaggers(java.lang.String listDirectory)
Initializes the list-based NE taggers.

Parameters:
listDirectory - path of the directory the list files are located in

loadRegExTaggers

public static void loadRegExTaggers(java.lang.String regExListFileName)
Initializes the regular expression based NE taggers.

Parameters:
regExListFileName - path and name of the file the names of the patterns in use are found in

getNumberOfTaggers

public static int getNumberOfTaggers()
Returns the number of NE taggers.

Returns:
number of name finders and regular expressions

getNeType

public static java.lang.String getNeType(int neId)
Returns the NE type that is recognized by the tagger with the given ID.

Parameters:
neId - ID of a NE tagger
Returns:
corresponding NE type or null, if the ID is invalid

getNeIds

public static int[] getNeIds(java.lang.String neType)
Returns the IDs of the taggers for the given NE type (there may be more than one).

Parameters:
neType - NE type
Returns:
IDs of the NE taggers

isModelType

public static boolean isModelType(java.lang.String neType)
Checks if there is a model-based tagger for the given NE type.

Parameters:
neType - NE type
Returns:
true iff there is a model-based tagger for this type

hasModelType

public static boolean hasModelType(java.lang.String[] neTypes)
Checks if there is a model-based tagger for one of the given NE types.

Parameters:
neTypes - NE types
Returns:
true iff there is a model-based tagger for one of these types

allModelType

public static boolean allModelType(java.lang.String[] neTypes)
Checks if there is a model-based tagger for each of the given NE types.

Parameters:
neTypes - NE types
Returns:
true iff there is a model-based tagger for each of these types

getFuzzyMatchingThreshold

public static int getFuzzyMatchingThreshold()
Gets the current value of the edit distance threshold for fuzzy-lookups in dictionaries.

Returns:
the current value of the fuzzy-lookups threshold

setFuzzyMatchingThreshold

public static void setFuzzyMatchingThreshold(int threshold)
Sets the threshold for fuzzy-lookups in gazetteer lists (aka dictionaries). Setting the threshold to zero (the initial value) will disable fuzzy lookups. The extractNes() and tagNes() methods will then behave as they used to. Setting a higher threshold, in turn, will result in more strings extracted, thus in a certain tolerance with regard to typos in the documents. A side effect is a growth of the processing time for the extractNes() and tagNes() methods, especially for large dictionaries.

Parameters:
threshold - the new value for the edit distance threshold for fuzzy-lookups in dictionaries

addNames

private static void addNames(java.lang.String tag,
                             java.util.List names,
                             opennlp.tools.parser.Parse[] tokens)
Adds named entity information to parses.

Parameters:
tag - named entity type
names - spans of tokens that are named entities
tokens - parses for the tokens

extractNesRec

private static void extractNesRec(opennlp.tools.parser.Parse parse,
                                  java.util.ArrayList<java.lang.String>[] nes)
Recursive method called by extractNes(Parse) to extract NEs from a parse tree augmented with NE tags.

Parameters:
parse - a node of a parse tree
nes - NEs found so far

tokenize

public static java.lang.String[] tokenize(java.lang.String text)
A rule-based tokenizer used to prepare a sentence for NE extraction.

Parameters:
text - text to tokenize
Returns:
array of tokens

tokenizeWithSpaces

public static java.lang.String tokenizeWithSpaces(java.lang.String text)
Applies the rule-based tokenizer and concatenates the tokens with spaces.

Parameters:
text - text to tokenize
Returns:
string of space-delimited tokens

tagNes

public static java.lang.String[] tagNes(java.lang.String[] sentences)
THIS METHOD IS NOT USED Performs named entity tagging on an array of (not tokenized) sentences.

Parameters:
sentences - array of sentences
Returns:
array of tagged sentences

tagNes

public static void tagNes(opennlp.tools.parser.Parse[] parses)
Performs named entity tagging on an array of full parses of sentences.

Parameters:
parses - array of full parses of sentences

extractNes

public static java.lang.String[][][] extractNes(java.lang.String[][] sentences)
Extracts NEs from an array of tokenized sentences.

Parameters:
sentences - array of tokenized sentences
Returns:
NEs per sentence and NE type

extractNes

public static java.lang.String[][] extractNes(java.lang.String[][] sentences,
                                              int neId)
Extracts NEs of a particular type from an array of tokenized sentences.

Parameters:
sentences - array of tokenized sentences
neId - ID of a name finder or regular expression
Returns:
NEs of the particular type per sentence or null, if the ID is invalid

extractNes

public static java.lang.String[][] extractNes(opennlp.tools.parser.Parse parse)
THIS METHOD IS NOT USED Extracts NEs from a parse tree that has been augmented with NE tags.

Parameters:
parse - a parse tree augmented with NE tags
Returns:
NEs per NE type