info.ephyra.util
Class HTMLConverter

java.lang.Object
  extended by info.ephyra.util.HTMLConverter

public class HTMLConverter
extends java.lang.Object

The HTMLConverter can be used to convert an HTML document to plain text.

Version:
2007-06-19
Author:
Nico Schlaefer

Field Summary
private static int TIMEOUT
          Timeout for HTTP connections in milliseconds.
 
Constructor Summary
HTMLConverter()
           
 
Method Summary
static java.lang.String file2text(java.lang.String filename)
          Reads an HTML document from a file and converts it into plain text.
static java.lang.String html2text(java.lang.String html)
          Converts an HTML document into plain text.
static java.lang.String htmlsnippet2text(java.lang.String snippet)
          Converts a snippet with HTML tags and special characters into plain text.
static boolean isUrl(java.lang.String s)
          Checks if the given string is a URL.
static java.lang.String replaceSpecialCharacters(java.lang.String html)
          Handles special characters in HTML documents by replacing sequences of the form &...
static java.lang.String url2text(java.lang.String url)
          Fetches an HTML document from a URL and converts it into plain text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TIMEOUT

private static final int TIMEOUT
Timeout for HTTP connections in milliseconds.

See Also:
Constant Field Values
Constructor Detail

HTMLConverter

public HTMLConverter()
Method Detail

isUrl

public static boolean isUrl(java.lang.String s)
Checks if the given string is a URL.

Parameters:
s - a string
Returns:
true iff the string is a URL

replaceSpecialCharacters

public static java.lang.String replaceSpecialCharacters(java.lang.String html)
Handles special characters in HTML documents by replacing sequences of the form &...; by the corresponding characters.

Parameters:
html - html document
Returns:
transformed html document

htmlsnippet2text

public static java.lang.String htmlsnippet2text(java.lang.String snippet)
Converts a snippet with HTML tags and special characters into plain text.

Parameters:
snippet - HTML snippet
Returns:
plain text

html2text

public static java.lang.String html2text(java.lang.String html)
Converts an HTML document into plain text.

Parameters:
html - HTML document
Returns:
plain text or null if the conversion failed

file2text

public static java.lang.String file2text(java.lang.String filename)
Reads an HTML document from a file and converts it into plain text.

Parameters:
filename - name of file containing HTML documents
Returns:
plain text or null if the reading or conversion failed

url2text

public static java.lang.String url2text(java.lang.String url)
                                 throws java.net.SocketTimeoutException
Fetches an HTML document from a URL and converts it into plain text.

Parameters:
url - URL of HTML document
Returns:
plain text or null if the fetching or conversion failed
Throws:
java.net.SocketTimeoutException