Parsing documents | Step 2: Choosing the right parser | |
You'll probably want to use Lemur's WebParser. This parser handles regular text documents that could be, but don't have to be, in html. It removes any HTML tags, text within any SCRIPT tags, as well as any HTML comments. It also removes contractions and possessives, and it converts all terms into lower case. Document boundaries are specified with NIST style format: <DOC>The document tags help distinguish between documents so that you can have multiple documents in the same file. They also specify a document ID, which the index uses to identify the documents. It's best for the document ID's to be unique. Most of the parsers in Lemur require similar, if not identical, document separator tags. | ||
Even if you don't have html documents, you can still use the WebParser since it does not rely on having html. It ignores most html tags and can easily "ignore" what you don't have. | ||
Adding these tags around your documents should be trivial. Here's an example perl script that adds the tags to given documents, using the file name as the document ID, and puts them all into one file called "collection.dat".
Other parsers in Lemur for tokenizing documents include one that can handle part of speech tag from Brill's tagger and one that can handle named entities from the Identifinder tagger. See the thread "Indexing with term properties" if you want to use those.
Lemur also has the following parsers:
| ||
Lemur has support for text documents in Arabic and Chinese! | ||
|
Step 1: << Preparing your documents |
[tutorial menu] | >> Advanced: Customizing a parser |
>> Step 3: Connecting other parser elements |