|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectinfo.ephyra.questionanalysis.KeywordExtractor
public class KeywordExtractor
Extracts keywords from a question.
The method getKeywords() tokenizes the question string and
drops single characters and bad keywords that frequently appear in questions.
Furthermore, all words that appear in the FunctionWords
dictionary and duplicates are dropped.
The method getInfrequentKeywords() additionally drops the
most frequent keywords if the number of words exceeds the threshold specified
in MAX_WORDS.
| Field Summary | |
|---|---|
private static java.util.regex.Pattern |
DELIMS1
Tokens that are always separated with blanks. |
private static java.util.regex.Pattern |
DELIMS2
Tokens that are only separated with blanks if not in between numbers. |
private static java.util.regex.Pattern |
DELIMS3
Tokens that are only separated with blanks if final token. |
private static java.lang.String |
IGNORE
Words that should not be part of a query string. |
private static int |
MAX_WORDS
Maximum number of keywords that are extracted. |
| Constructor Summary | |
|---|---|
KeywordExtractor()
|
|
| Method Summary | |
|---|---|
static boolean |
containsKeyword(java.lang.String text,
java.lang.String[] kws)
Checks if the text contains one of the keywords. |
private static java.lang.String[] |
dropBadKeywords(java.lang.String[] words)
Drops keywords that should not be part of a query string. |
private static java.lang.String[] |
dropDuplicates(java.lang.String[] words)
Drops duplicates. |
private static java.lang.String[] |
dropFrequentWords(java.lang.String[] words)
Removes the most frequent words if the number of words exceeds the threshold specified in the MAX_WORDS field. |
private static java.lang.String[] |
dropFunctionWords(java.lang.String[] words)
Drops function words. |
private static java.lang.String[] |
dropSingleChars(java.lang.String[] words)
Drops single characters. |
static java.lang.String[] |
getInfrequentKeywords(java.lang.String verbMod)
Extracts the up to MAX_WORDS least frequent keywords from a
question. |
static java.lang.String[] |
getKeywords(java.lang.String verbMod)
Extracts keywords from a question. |
static java.lang.String[] |
getKeywords(java.lang.String verbMod,
java.lang.String context)
Extracts keywords from a question and a context string. |
static java.lang.String[] |
tokenize(java.lang.String text)
Applies the rule-based tokenizer and splits the resulting string along whitespaces. |
static java.lang.String |
tokenizeWithSpaces(java.lang.String text)
A rule-based tokenizer used to extract keywords for a query. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
private static final java.util.regex.Pattern DELIMS1
private static final java.util.regex.Pattern DELIMS2
private static final java.util.regex.Pattern DELIMS3
private static final java.lang.String IGNORE
private static final int MAX_WORDS
| Constructor Detail |
|---|
public KeywordExtractor()
| Method Detail |
|---|
private static java.lang.String[] dropSingleChars(java.lang.String[] words)
words - array of words
private static java.lang.String[] dropBadKeywords(java.lang.String[] words)
words - array of words
private static java.lang.String[] dropFunctionWords(java.lang.String[] words)
words - array of words
private static java.lang.String[] dropDuplicates(java.lang.String[] words)
words - array of words
private static java.lang.String[] dropFrequentWords(java.lang.String[] words)
MAX_WORDS field.
words - array of words
MAX_WORDS wordspublic static java.lang.String tokenizeWithSpaces(java.lang.String text)
text - text to tokenize
public static java.lang.String[] tokenize(java.lang.String text)
text - text to tokenize
public static java.lang.String[] getKeywords(java.lang.String verbMod)
verbMod - question string with modified verbs
public static java.lang.String[] getKeywords(java.lang.String verbMod,
java.lang.String context)
verbMod - question string with modified verbscontext - context string
public static java.lang.String[] getInfrequentKeywords(java.lang.String verbMod)
MAX_WORDS least frequent keywords from a
question.
verbMod - question string with modified verbs
public static boolean containsKeyword(java.lang.String text,
java.lang.String[] kws)
text - a textkws - keywords
true iff the text contains one of the keywords
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||