info.ephyra.nlp.indices
Class WordFrequencies

java.lang.Object
  extended by info.ephyra.nlp.indices.WordFrequencies

public class WordFrequencies
extends java.lang.Object

Counts the frequencies of words in an arbitrary text corpus and represents them in a dictionary.

Internally, a hash table is used to store the index, which allows access to the index in constant time.

Version:
2008-01-23
Author:
Nico Schlaefer

Field Summary
private static int distinct
          Number of distinct words in the index.
private static java.util.Hashtable<java.lang.String,java.lang.Integer> index
          Hashtable used to store (word, frequency) pairs.
private static boolean LOWER_CASE
          Whether words are converted to lower case.
private static int MAX_WORDS
          Maximum number of words to be parsed (0 = no limit).
private static int MIN_FREQUENCY
          Minimum frequency of a word to remain in the index.
private static boolean SORT_BY_FREQUENCY
          Whether words are saved in the order of their frequencies.
private static int total
          Total number of words that have been parsed.
 
Constructor Summary
WordFrequencies()
           
 
Method Summary
static boolean createIndexFromDir(java.lang.String dirname)
          Creates an index of word frequencies from a folder containing text files.
static boolean createIndexFromFile(java.lang.String filename)
          Creates an index of word frequencies from an arbitrary text file.
static void dropRareWords()
          Drops rare words from the index.
static int getDistinct()
          Returns the number of distinct words in the index.
static java.lang.String[] getSortedWords()
          Sorts the words in the index by their frequencies in descending order.
static int getTotal()
          Returns the total number of words that have been parsed.
static boolean loadIndex(java.lang.String filename)
          Loads an index of word frequencies from an input file.
static int lookup(java.lang.String word)
          Looks up a word in the index and returns its frequency.
static double lookupRel(java.lang.String word)
          Looks up a word in the index and returns its relative frequency.
static void main(java.lang.String[] args)
          Entry point.
static boolean saveIndex(java.lang.String filename)
          Saves index of word frequencies to an ouput file.
static boolean updateIndexFromDir(java.lang.String dir)
          Updates the index by adding the words contained in the files in the given folder.
static boolean updateIndexFromFile(java.lang.String filename)
          Updates the index with the words in an arbitrary text file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_WORDS

private static final int MAX_WORDS
Maximum number of words to be parsed (0 = no limit).

See Also:
Constant Field Values

LOWER_CASE

private static final boolean LOWER_CASE
Whether words are converted to lower case.

See Also:
Constant Field Values

MIN_FREQUENCY

private static final int MIN_FREQUENCY
Minimum frequency of a word to remain in the index.

See Also:
Constant Field Values

SORT_BY_FREQUENCY

private static final boolean SORT_BY_FREQUENCY
Whether words are saved in the order of their frequencies.

See Also:
Constant Field Values

total

private static int total
Total number of words that have been parsed.


distinct

private static int distinct
Number of distinct words in the index.


index

private static java.util.Hashtable<java.lang.String,java.lang.Integer> index
Hashtable used to store (word, frequency) pairs.

Constructor Detail

WordFrequencies

public WordFrequencies()
Method Detail

createIndexFromFile

public static boolean createIndexFromFile(java.lang.String filename)
Creates an index of word frequencies from an arbitrary text file.

Parameters:
filename - name of the text file to parse
Returns:
true, iff the index was created successfully

updateIndexFromFile

public static boolean updateIndexFromFile(java.lang.String filename)
Updates the index with the words in an arbitrary text file.

Parameters:
filename - name of the text file to parse
Returns:
true, iff the index was updated successfully

createIndexFromDir

public static boolean createIndexFromDir(java.lang.String dirname)
Creates an index of word frequencies from a folder containing text files.

Parameters:
dirname - name of the folder to parse
Returns:
true, iff the index was created successfully

updateIndexFromDir

public static boolean updateIndexFromDir(java.lang.String dir)
Updates the index by adding the words contained in the files in the given folder.

Parameters:
dir - name of the folder to parse
Returns:
true, iff the index was updated successfully

dropRareWords

public static void dropRareWords()
Drops rare words from the index.


getSortedWords

public static java.lang.String[] getSortedWords()
Sorts the words in the index by their frequencies in descending order.

Returns:
words sorted by their frequencies

saveIndex

public static boolean saveIndex(java.lang.String filename)
Saves index of word frequencies to an ouput file.

Parameters:
filename - name of the output file to write to
Returns:
true, iff the index was saved successfully

loadIndex

public static boolean loadIndex(java.lang.String filename)
Loads an index of word frequencies from an input file.

Parameters:
filename - name of the input file containing the index
Returns:
true, iff the index was loaded successfully

getTotal

public static int getTotal()
Returns the total number of words that have been parsed.

Returns:
total number of words

getDistinct

public static int getDistinct()
Returns the number of distinct words in the index.

Returns:
total number of distinct words

lookup

public static int lookup(java.lang.String word)
Looks up a word in the index and returns its frequency. If the word is not in the index, the frequency is 0.

Parameters:
word - word to look up
Returns:
frequency of the word

lookupRel

public static double lookupRel(java.lang.String word)
Looks up a word in the index and returns its relative frequency. If the word is not in the index, the relative frequency is 0.

Parameters:
word - word to look up
Returns:
relative frequency of the word

main

public static void main(java.lang.String[] args)
Entry point. Creates the index from the text files in a given folder, drops rare words and saves the index.

Parameters:
args - argument 1: folder containing text files argument 2: output file