Parsing documents | Step 3: Connecting other parser elements | |
So far we've talked about how to tokenize documents into terms, but there are other ways you might want to process the document before committing the terms into an index. Lemur provides support for the most commonly used methods: recognizing acronyms, ignoring stop words, and stemming terms.
| ||
The support for recognizing acronyms, ignoring common words, and stemming terms to their roots is compatible with every English and Arabic Lemur parser. (How is this possible? See "Understanding the TextHandler.") | ||
Acronyms CMUStop words There might be certain words that you do not want being indexed, for whatever reason. Researchers often like to ignore very common words that don't really provide any meaning to the document, such as words like "the", "it", "or", "and". Generally these words occur so frequently that by ignoring them, it greatly reduces the size of the index. You can provide a list of words that the Lemur indexing applications should ignore while indexing. This list should be a simple text file containing each stop word on its own line, without any special separating character, and the stop words do not need to be in any special order. Here's a short example for the contents of a file called "stopwords.txt": theStemming Lemur provides 2 stemmers for English: Porter and Krovetz stemmers. Using a stemmer will collapse multiple forms of a term into the same term. This reduces the size of the vocabulary in the index. The Krovetz stemmer stems words in such a way that the stemmed word is still a valid English word, whereas the porter stemmer might give you a word fragment. The Krovetz stemmer requires additional files for its processing, such as a file to specify proper nouns. Lemur provides some default files for you in the lemur/data/kstem_dir directory. These files are fully functional, but you can use other files or modify them to suit your needs. To use the Lemur indexing applications with a Krovetz stemmer, all supporting Krovetz files must be in the same directory. |
Step 2: << Choosing the right parser |
[tutorial menu] | >> Advanced: Understanding the TextHandler |
Advanced: << Customizing a parser |
>> Advanced: Writing a parser from scratch |