info.ephyra.nlp
Class SentenceExtractor

java.lang.Object
  extended by info.ephyra.nlp.SentenceExtractor

public class SentenceExtractor
extends java.lang.Object

Extracts sentences and text fragments from an HTML document.

Version:
2005-09-12
Author:
Nico Schlaefer

Field Summary
private static java.lang.String NON_STRUC_TAGS
          Regular expression that describes non-structuring tags, i.e.
 
Constructor Summary
SentenceExtractor()
           
 
Method Summary
static java.lang.String[] getSentencesFromHtml(java.lang.String html)
          Extracts sentences from an HTML document
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NON_STRUC_TAGS

private static final java.lang.String NON_STRUC_TAGS
Regular expression that describes non-structuring tags, i.e. tags that appear within a sentence and that are not sentence delimiters. All other tags are assumed to be sentence delimiters.

See Also:
Constant Field Values
Constructor Detail

SentenceExtractor

public SentenceExtractor()
Method Detail

getSentencesFromHtml

public static java.lang.String[] getSentencesFromHtml(java.lang.String html)
Extracts sentences from an HTML document

Parameters:
html - the HTML document
Returns:
sentences extracted from the document