info.ephyra.util
Class HashDictionary

java.lang.Object
  extended by info.ephyra.util.HashDictionary
All Implemented Interfaces:
Dictionary

public class HashDictionary
extends java.lang.Object
implements Dictionary

A Dictionary that is based on a hash set and allows lookups in constant time.

All words are converted to lower case, tokenized and stemmed. E.g. there is no distinction between "Internet" and "internets".

This class implements the interface Dictionary.

Version:
2007-02-06
Author:
Nico Schlaefer

Field Summary
private  int maxTokens
          Maximum number of tokens of a word in the dictionary.
private  java.util.HashSet<java.lang.String> tokens
          HashSet used to store the tokens of words.
private  java.util.HashSet<java.lang.String> words
          HashSet used to store the words.
 
Constructor Summary
HashDictionary()
          Creates an empty HashDictionary.
HashDictionary(java.lang.String fileName)
          Creates a HashDictionary from a list of words in a file.
 
Method Summary
 void add(java.lang.String word)
          Adds a word to the dictionary.
 boolean contains(java.lang.String word)
          Looks up a word.
 boolean containsToken(java.lang.String token)
          Looks up a word token.
 boolean fuzzyContains(java.lang.String word, int maxDistance)
          Does a fuzzy lookup for a word.
 boolean fuzzyContainsToken(java.lang.String token, int maxDistance)
          Does a fuzzy lookup for a token.
static int getCost(char char1, char char2, int substCost, boolean caseSensitive)
          compute edit cost for two chars
 java.util.Iterator<java.lang.String> getIterator()
          Returns an iterator over the dictionary entries.
static int getLevenshteinDistance(java.lang.String string1, java.lang.String string2, int threshold, boolean caseSensitive, int insertCost, int deleteCost)
          compute the Levenshtein distance of two Strings
 int getMaxTokens()
          Returns the maximum number of tokens of a word in the dictionary.
static int min3(int x, int y, int z)
          compute the minimum of three int variables (helper for Levenshtein)
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

words

private java.util.HashSet<java.lang.String> words
HashSet used to store the words.


tokens

private java.util.HashSet<java.lang.String> tokens
HashSet used to store the tokens of words.


maxTokens

private int maxTokens
Maximum number of tokens of a word in the dictionary.

Constructor Detail

HashDictionary

public HashDictionary()
Creates an empty HashDictionary.


HashDictionary

public HashDictionary(java.lang.String fileName)
               throws java.io.IOException
Creates a HashDictionary from a list of words in a file.

Parameters:
fileName - file containing a list of words
Throws:
java.io.IOException - if the list could not be read from the file
Method Detail

add

public void add(java.lang.String word)
Adds a word to the dictionary.

Parameters:
word - the word to add

contains

public boolean contains(java.lang.String word)
Looks up a word.

Specified by:
contains in interface Dictionary
Parameters:
word - the word to look up
Returns:
true iff the word was found

containsToken

public boolean containsToken(java.lang.String token)
Looks up a word token.

Parameters:
token - the word token to look up
Returns:
true iff a word in the dictionary contains the token

fuzzyContains

public boolean fuzzyContains(java.lang.String word,
                             int maxDistance)
Does a fuzzy lookup for a word. The specified word w is considered as contained in the dictionary is there is a word W in the dictionary such that LevenshteinDistance(w, W) <= maxDistance

Parameters:
word - the word to look up
maxDistance - the maximum Levenshtein edit distance for fuzzy comparison
Returns:
true iff the word was found

fuzzyContainsToken

public boolean fuzzyContainsToken(java.lang.String token,
                                  int maxDistance)
Does a fuzzy lookup for a token. The specified token t is considered as contained in the dictionary is there is a token T in the dictionary such that LevenshteinDistance(t, T) <= maxDistance

Parameters:
token - the token to look up
maxDistance - the maximum Levenshtein edit distance for fuzzy comparison
Returns:
true iff a word in the dictionary contains the token

getLevenshteinDistance

public static int getLevenshteinDistance(java.lang.String string1,
                                         java.lang.String string2,
                                         int threshold,
                                         boolean caseSensitive,
                                         int insertCost,
                                         int deleteCost)
compute the Levenshtein distance of two Strings

Parameters:
string1 - the first String
string2 - the second String
threshold - the maximum distance (computation will stop if specified value reached)
caseSensitive - use case sensitive or case insensitive comparison
insertCost - the cost for inserting a Character
deleteCost - the cost for deleting a Character
Returns:
the Levenshtein distance of the specified Strings, maximum the specified threshold plus one, soon as the minimum possible distance exceeds the threshold Note: a threshold of 0 will compute the entire editing distance, regardless of its value

getCost

public static int getCost(char char1,
                          char char2,
                          int substCost,
                          boolean caseSensitive)
compute edit cost for two chars

Parameters:
char1 - the first char
char2 - the second char
substCost - the cost for the substitution of one char with another one
caseSensitive - use case sensitive or case insensitive comparison for the Token's values
Returns:
the edit cost for the two Tokens

min3

public static int min3(int x,
                       int y,
                       int z)
compute the minimum of three int variables (helper for Levenshtein)

Parameters:
x -
y -
z -
Returns:
the minimum of x, y and z

getIterator

public java.util.Iterator<java.lang.String> getIterator()
Returns an iterator over the dictionary entries.

Returns:
iterator

getMaxTokens

public int getMaxTokens()
Returns the maximum number of tokens of a word in the dictionary.

Returns:
maximum number of tokens