|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectinfo.ephyra.util.HashDictionary
public class HashDictionary
A Dictionary that is based on a hash set and allows lookups
in constant time.
All words are converted to lower case, tokenized and stemmed. E.g. there is no distinction between "Internet" and "internets".
This class implements the interface Dictionary.
| Field Summary | |
|---|---|
private int |
maxTokens
Maximum number of tokens of a word in the dictionary. |
private java.util.HashSet<java.lang.String> |
tokens
HashSet used to store the tokens of words. |
private java.util.HashSet<java.lang.String> |
words
HashSet used to store the words. |
| Constructor Summary | |
|---|---|
HashDictionary()
Creates an empty HashDictionary. |
|
HashDictionary(java.lang.String fileName)
Creates a HashDictionary from a list of words in a file. |
|
| Method Summary | |
|---|---|
void |
add(java.lang.String word)
Adds a word to the dictionary. |
boolean |
contains(java.lang.String word)
Looks up a word. |
boolean |
containsToken(java.lang.String token)
Looks up a word token. |
boolean |
fuzzyContains(java.lang.String word,
int maxDistance)
Does a fuzzy lookup for a word. |
boolean |
fuzzyContainsToken(java.lang.String token,
int maxDistance)
Does a fuzzy lookup for a token. |
static int |
getCost(char char1,
char char2,
int substCost,
boolean caseSensitive)
compute edit cost for two chars |
java.util.Iterator<java.lang.String> |
getIterator()
Returns an iterator over the dictionary entries. |
static int |
getLevenshteinDistance(java.lang.String string1,
java.lang.String string2,
int threshold,
boolean caseSensitive,
int insertCost,
int deleteCost)
compute the Levenshtein distance of two Strings |
int |
getMaxTokens()
Returns the maximum number of tokens of a word in the dictionary. |
static int |
min3(int x,
int y,
int z)
compute the minimum of three int variables (helper for Levenshtein) |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
private java.util.HashSet<java.lang.String> words
HashSet used to store the words.
private java.util.HashSet<java.lang.String> tokens
HashSet used to store the tokens of words.
private int maxTokens
| Constructor Detail |
|---|
public HashDictionary()
HashDictionary.
public HashDictionary(java.lang.String fileName)
throws java.io.IOException
HashDictionary from a list of words in a file.
fileName - file containing a list of words
java.io.IOException - if the list could not be read from the file| Method Detail |
|---|
public void add(java.lang.String word)
word - the word to addpublic boolean contains(java.lang.String word)
contains in interface Dictionaryword - the word to look up
true iff the word was foundpublic boolean containsToken(java.lang.String token)
token - the word token to look up
true iff a word in the dictionary contains the token
public boolean fuzzyContains(java.lang.String word,
int maxDistance)
LevenshteinDistance(w, W) <= maxDistance
word - the word to look upmaxDistance - the maximum Levenshtein edit distance for fuzzy
comparison
true iff the word was found
public boolean fuzzyContainsToken(java.lang.String token,
int maxDistance)
LevenshteinDistance(t, T) <= maxDistance
token - the token to look upmaxDistance - the maximum Levenshtein edit distance for fuzzy
comparison
true iff a word in the dictionary contains the token
public static int getLevenshteinDistance(java.lang.String string1,
java.lang.String string2,
int threshold,
boolean caseSensitive,
int insertCost,
int deleteCost)
string1 - the first Stringstring2 - the second Stringthreshold - the maximum distance (computation will stop if specified value reached)caseSensitive - use case sensitive or case insensitive comparisoninsertCost - the cost for inserting a CharacterdeleteCost - the cost for deleting a Character
public static int getCost(char char1,
char char2,
int substCost,
boolean caseSensitive)
char1 - the first charchar2 - the second charsubstCost - the cost for the substitution of one char with another onecaseSensitive - use case sensitive or case insensitive comparison for the Token's values
public static int min3(int x,
int y,
int z)
x - y - z -
public java.util.Iterator<java.lang.String> getIterator()
public int getMaxTokens()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||