Parsing documents
Step 1: Preparing your documents

Lemur is a toolkit for indexing text documents. So far we have support for text in English, Chinese and Arabic. We will be focusing on just English documents, but most things will be applicable to all languages we support.

Lemur can index any type of text document as long as you have a parser to tokenize the terms and give them to the indexer.

Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.

If you don't have a text collection from TREC, in order to use the Lemur indexes, you must either write your own parser or write a simple script to make your documents conform to the format expected by one of the Lemur parsers. In fact, we'll even provide a sample perl script in the next section. The last section of the Parsing tutorial explains how to write a parser that fits with Lemur.





[tutorial menu] >> Step 2:
Choosing the right parser