info.ephyra.nlp
Class OpenNLP

java.lang.Object
  extended by info.ephyra.nlp.OpenNLP

public class OpenNLP
extends java.lang.Object

This class provides a common interface to the OpenNLP toolkit.

It supports the following natural language processing tools:

Version:
2006-05-20
Author:
Nico Schlaefer

Field Summary
private static java.util.regex.Pattern ABUNDANT_BLANKS
          Pattern for abundant blanks.
private static opennlp.tools.lang.english.TreebankChunker chunker
          Chunker from the OpenNLP project.
private static opennlp.tools.lang.english.TreebankLinker linker
          Linker from the OpenNLP project.
private static opennlp.tools.parser.ParserME parser
          Full parser from the OpenNLP project.
private static opennlp.tools.lang.english.SentenceDetector sentenceDetector
          Sentence detector from the OpenNLP project.
private static opennlp.tools.lang.english.PosTagger tagger
          Part of speech tagger from the OpenNLP project.
private static opennlp.tools.lang.english.Tokenizer tokenizer
          Tokenizer from the OpenNLP project.
private static java.util.HashSet<java.lang.String> unJoinablePrepositions
           
 
Constructor Summary
OpenNLP()
           
 
Method Summary
static boolean createChunker(java.lang.String model)
          Creates the chunker from a model file.
static boolean createLinker(java.lang.String dir)
          Creates the linker from a directory containing models.
static boolean createParser(java.lang.String dir)
          Creates the parser from a directory containing models.
static boolean createPosTagger(java.lang.String model, java.lang.String tagdict)
          Creates the part of speech tagger from a model file and a case sensitive tag dictionary.
static boolean createSentenceDetector(java.lang.String model)
          Creates the sentence detector from a model file.
static boolean createTokenizer(java.lang.String model)
          Creates the tokenizer from a model file.
static java.lang.String[] joinNounPhrases(java.lang.String[] tokens, java.lang.String[] chunkTags)
           
static void link(opennlp.tools.parser.Parse[] parses)
          Identifies coreferences in an array of full parses of sentences.
static opennlp.tools.parser.Parse parse(java.lang.String sentence)
          Peforms a full parsing on a sentence of space-delimited tokens.
static java.lang.String[] sentDetect(java.lang.String text)
          Splits a text into sentences.
static java.lang.String[] tagChunks(java.lang.String[] tokens, java.lang.String[] pos)
          Assigns chunk tags to an array of tokens and POS tags.
static java.lang.String tagPos(java.lang.String sentence)
          Assigns POS tags to a sentence of space-delimited tokens.
static java.lang.String[] tagPos(java.lang.String[] sentence)
          Assigns POS tags to an array of tokens that form a sentence.
static java.lang.String[] tokenize(java.lang.String text)
          A model-based tokenizer used to prepare a sentence for POS tagging.
static java.lang.String tokenizeWithSpaces(java.lang.String text)
          Applies the model-based tokenizer and concatenates the tokens with spaces.
static java.lang.String untokenize(java.lang.String text)
          Untokenizes a text by removing abundant blanks.
static java.lang.String untokenize(java.lang.String text, java.lang.String original)
          Untokenizes a text by mapping it to a string that contains the original text as a subsequence.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ABUNDANT_BLANKS

private static final java.util.regex.Pattern ABUNDANT_BLANKS
Pattern for abundant blanks. More specific rules come first. T.b.c.


sentenceDetector

private static opennlp.tools.lang.english.SentenceDetector sentenceDetector
Sentence detector from the OpenNLP project.


tokenizer

private static opennlp.tools.lang.english.Tokenizer tokenizer
Tokenizer from the OpenNLP project.


tagger

private static opennlp.tools.lang.english.PosTagger tagger
Part of speech tagger from the OpenNLP project.


chunker

private static opennlp.tools.lang.english.TreebankChunker chunker
Chunker from the OpenNLP project.


parser

private static opennlp.tools.parser.ParserME parser
Full parser from the OpenNLP project.


linker

private static opennlp.tools.lang.english.TreebankLinker linker
Linker from the OpenNLP project.


unJoinablePrepositions

private static java.util.HashSet<java.lang.String> unJoinablePrepositions
Constructor Detail

OpenNLP

public OpenNLP()
Method Detail

createSentenceDetector

public static boolean createSentenceDetector(java.lang.String model)
Creates the sentence detector from a model file.

Parameters:
model - model file
Returns:
true, iff the sentence detector was created successfully

createTokenizer

public static boolean createTokenizer(java.lang.String model)
Creates the tokenizer from a model file.

Parameters:
model - model file
Returns:
true, iff the tokenizer was created successfully

createPosTagger

public static boolean createPosTagger(java.lang.String model,
                                      java.lang.String tagdict)
Creates the part of speech tagger from a model file and a case sensitive tag dictionary.

Parameters:
model - model file
tagdict - case sensitive tag dictionary
Returns:
true, iff the POS tagger was created successfully

createChunker

public static boolean createChunker(java.lang.String model)
Creates the chunker from a model file.

Parameters:
model - model file
Returns:
true, iff the chunker was created successfully

createParser

public static boolean createParser(java.lang.String dir)
Creates the parser from a directory containing models.

Parameters:
dir - model directory
Returns:
true, iff the parser was created successfully

createLinker

public static boolean createLinker(java.lang.String dir)
Creates the linker from a directory containing models.

Parameters:
dir - model directory
Returns:
true, iff the linker was created successfully

sentDetect

public static java.lang.String[] sentDetect(java.lang.String text)
Splits a text into sentences.

Parameters:
text - sequence of sentences
Returns:
array of sentences in the text or null, if the sentence detector is not initialized

tokenize

public static java.lang.String[] tokenize(java.lang.String text)
A model-based tokenizer used to prepare a sentence for POS tagging.

Parameters:
text - text to tokenize
Returns:
array of tokens or null, if the tokenizer is not initialized

tokenizeWithSpaces

public static java.lang.String tokenizeWithSpaces(java.lang.String text)
Applies the model-based tokenizer and concatenates the tokens with spaces.

Parameters:
text - text to tokenize
Returns:
string of space-delimited tokens or null, if the tokenizer is not initialized

untokenize

public static java.lang.String untokenize(java.lang.String text)

Untokenizes a text by removing abundant blanks.

Note that it is not guaranteed that this method exactly reverts the effect of tokenize().

Parameters:
text - text to untokenize
Returns:
text without abundant blanks

untokenize

public static java.lang.String untokenize(java.lang.String text,
                                          java.lang.String original)

Untokenizes a text by mapping it to a string that contains the original text as a subsequence.

Note that it is not guaranteed that this method exactly reverts the effect of tokenize().

Parameters:
text - text to untokenize
original - string that contains the original text as a subsequence
Returns:
subsequence of the original string or the input text, iff there is no such subsequence

tagPos

public static java.lang.String tagPos(java.lang.String sentence)
Assigns POS tags to a sentence of space-delimited tokens.

Parameters:
sentence - sentence to be annotated with POS tags
Returns:
tagged sentence or null, if the tagger is not initialized

tagPos

public static java.lang.String[] tagPos(java.lang.String[] sentence)
Assigns POS tags to an array of tokens that form a sentence.

Parameters:
sentence - array of tokens to be annotated with POS tags
Returns:
array of POS tags or null, if the tagger is not initialized

tagChunks

public static java.lang.String[] tagChunks(java.lang.String[] tokens,
                                           java.lang.String[] pos)
Assigns chunk tags to an array of tokens and POS tags.

Parameters:
tokens - array of tokens
pos - array of corresponding POS tags
Returns:
array of chunk tags or null, if the chunker is not initialized

parse

public static opennlp.tools.parser.Parse parse(java.lang.String sentence)
Peforms a full parsing on a sentence of space-delimited tokens.

Parameters:
sentence - the sentence
Returns:
parse of the sentence or null, if the parser is not initialized or the sentence is empty

link

public static void link(opennlp.tools.parser.Parse[] parses)
Identifies coreferences in an array of full parses of sentences.

Parameters:
parses - array of full parses of sentences

joinNounPhrases

public static java.lang.String[] joinNounPhrases(java.lang.String[] tokens,
                                                 java.lang.String[] chunkTags)