Parsing documents
Step 3: Connecting other parser elements

So far we've talked about how to tokenize documents into terms, but there are other ways you might want to process the document before committing the terms into an index. Lemur provides support for the most commonly used methods: recognizing acronyms, ignoring stop words, and stemming terms.

The support for recognizing acronyms, ignoring common words, and stemming terms to their roots is compatible with every English and Arabic Lemur parser. (How is this possible? See "Understanding the TextHandler.")

Acronyms
You can provide the parser with a list of acronyms to recognize. If an acronym is on this list, it will not be converted into all lower case like other terms. Otherwise, the acronym might still be tokenized as a term, but will be indexed in lower case. Case is important if you want to retain acronyms by not stemming ones that also happen to be words, you can have your stemmer ignore words in upper case. This list should be a simple text file containing each acronym on its own line, without any special separating character, and the acronyms do not need to be in any special order. Here's a short example for the contents of a file called "myacronyms.txt":

CMU
UMASS
AIDS
PA
MA
LTI
CIIR
Stop words
There might be certain words that you do not want being indexed, for whatever reason. Researchers often like to ignore very common words that don't really provide any meaning to the document, such as words like "the", "it", "or", "and". Generally these words occur so frequently that by ignoring them, it greatly reduces the size of the index.

You can provide a list of words that the Lemur indexing applications should ignore while indexing. This list should be a simple text file containing each stop word on its own line, without any special separating character, and the stop words do not need to be in any special order. Here's a short example for the contents of a file called "stopwords.txt":

the
it
or
and
is
that
Stemming
Lemur provides 2 stemmers for English: Porter and Krovetz stemmers. Using a stemmer will collapse multiple forms of a term into the same term. This reduces the size of the vocabulary in the index. The Krovetz stemmer stems words in such a way that the stemmed word is still a valid English word, whereas the porter stemmer might give you a word fragment. The Krovetz stemmer requires additional files for its processing, such as a file to specify proper nouns. Lemur provides some default files for you in the lemur/data/kstem_dir directory. These files are fully functional, but you can use other files or modify them to suit your needs. To use the Lemur indexing applications with a Krovetz stemmer, all supporting Krovetz files must be in the same directory.



Step 2: <<
Choosing the right parser
[tutorial menu] >> Advanced:
Understanding the TextHandler
Advanced: <<
Customizing a parser
>> Advanced:
Writing a parser from scratch