info.ephyra.questionanalysis
Class KeywordExtractor

java.lang.Object
  extended by info.ephyra.questionanalysis.KeywordExtractor

public class KeywordExtractor
extends java.lang.Object

Extracts keywords from a question.

The method getKeywords() tokenizes the question string and drops single characters and bad keywords that frequently appear in questions. Furthermore, all words that appear in the FunctionWords dictionary and duplicates are dropped.

The method getInfrequentKeywords() additionally drops the most frequent keywords if the number of words exceeds the threshold specified in MAX_WORDS.

Version:
2007-02-09
Author:
Nico Schlaefer

Field Summary
private static java.util.regex.Pattern DELIMS1
          Tokens that are always separated with blanks.
private static java.util.regex.Pattern DELIMS2
          Tokens that are only separated with blanks if not in between numbers.
private static java.util.regex.Pattern DELIMS3
          Tokens that are only separated with blanks if final token.
private static java.lang.String IGNORE
          Words that should not be part of a query string.
private static int MAX_WORDS
          Maximum number of keywords that are extracted.
 
Constructor Summary
KeywordExtractor()
           
 
Method Summary
static boolean containsKeyword(java.lang.String text, java.lang.String[] kws)
          Checks if the text contains one of the keywords.
private static java.lang.String[] dropBadKeywords(java.lang.String[] words)
          Drops keywords that should not be part of a query string.
private static java.lang.String[] dropDuplicates(java.lang.String[] words)
          Drops duplicates.
private static java.lang.String[] dropFrequentWords(java.lang.String[] words)
          Removes the most frequent words if the number of words exceeds the threshold specified in the MAX_WORDS field.
private static java.lang.String[] dropFunctionWords(java.lang.String[] words)
          Drops function words.
private static java.lang.String[] dropSingleChars(java.lang.String[] words)
          Drops single characters.
static java.lang.String[] getInfrequentKeywords(java.lang.String verbMod)
          Extracts the up to MAX_WORDS least frequent keywords from a question.
static java.lang.String[] getKeywords(java.lang.String verbMod)
          Extracts keywords from a question.
static java.lang.String[] getKeywords(java.lang.String verbMod, java.lang.String context)
          Extracts keywords from a question and a context string.
static java.lang.String[] tokenize(java.lang.String text)
          Applies the rule-based tokenizer and splits the resulting string along whitespaces.
static java.lang.String tokenizeWithSpaces(java.lang.String text)
          A rule-based tokenizer used to extract keywords for a query.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DELIMS1

private static final java.util.regex.Pattern DELIMS1
Tokens that are always separated with blanks.


DELIMS2

private static final java.util.regex.Pattern DELIMS2
Tokens that are only separated with blanks if not in between numbers.


DELIMS3

private static final java.util.regex.Pattern DELIMS3
Tokens that are only separated with blanks if final token.


IGNORE

private static final java.lang.String IGNORE
Words that should not be part of a query string.

See Also:
Constant Field Values

MAX_WORDS

private static final int MAX_WORDS
Maximum number of keywords that are extracted.

See Also:
Constant Field Values
Constructor Detail

KeywordExtractor

public KeywordExtractor()
Method Detail

dropSingleChars

private static java.lang.String[] dropSingleChars(java.lang.String[] words)
Drops single characters.

Parameters:
words - array of words
Returns:
array without single characters

dropBadKeywords

private static java.lang.String[] dropBadKeywords(java.lang.String[] words)
Drops keywords that should not be part of a query string.

Parameters:
words - array of words
Returns:
array without bad keywords

dropFunctionWords

private static java.lang.String[] dropFunctionWords(java.lang.String[] words)
Drops function words.

Parameters:
words - array of words
Returns:
array without function words

dropDuplicates

private static java.lang.String[] dropDuplicates(java.lang.String[] words)
Drops duplicates.

Parameters:
words - array of words
Returns:
array without duplicates

dropFrequentWords

private static java.lang.String[] dropFrequentWords(java.lang.String[] words)
Removes the most frequent words if the number of words exceeds the threshold specified in the MAX_WORDS field.

Parameters:
words - array of words
Returns:
array of at most MAX_WORDS words

tokenizeWithSpaces

public static java.lang.String tokenizeWithSpaces(java.lang.String text)
A rule-based tokenizer used to extract keywords for a query. This tokenizer is conservative, e.g. it does not split "F16" or "1,000.00".

Parameters:
text - text to tokenize
Returns:
string of space-delimited tokens

tokenize

public static java.lang.String[] tokenize(java.lang.String text)
Applies the rule-based tokenizer and splits the resulting string along whitespaces.

Parameters:
text - text to tokenize
Returns:
array of tokens

getKeywords

public static java.lang.String[] getKeywords(java.lang.String verbMod)
Extracts keywords from a question.

Parameters:
verbMod - question string with modified verbs
Returns:
keywords

getKeywords

public static java.lang.String[] getKeywords(java.lang.String verbMod,
                                             java.lang.String context)
Extracts keywords from a question and a context string.

Parameters:
verbMod - question string with modified verbs
context - context string
Returns:
keywords

getInfrequentKeywords

public static java.lang.String[] getInfrequentKeywords(java.lang.String verbMod)
Extracts the up to MAX_WORDS least frequent keywords from a question.

Parameters:
verbMod - question string with modified verbs
Returns:
keywords

containsKeyword

public static boolean containsKeyword(java.lang.String text,
                                      java.lang.String[] kws)
Checks if the text contains one of the keywords.

Parameters:
text - a text
kws - keywords
Returns:
true iff the text contains one of the keywords