Lemur Beginner's Guide to Indexing

Contents

What is an index?
What kind of data/documents can Lemur index?
Do the parsers add all words into the index?
What type of indexes does Lemur have?

1. What is an index?
An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it's indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference.
Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.

2. What kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and "pushes" it into the index. (See the Parsing Tutorial.) However, we do provide several parsers with the toolkit.
Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.
The 2 most frequently used parsers are the TrecParser and WebParser.
TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:

<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:
<DOC>
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>
In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)
If your documents are not from NIST, these are the methods you can take to parse and index your documents:

Write a script to add the NIST style tags around your documents. Then use one of the parsers provided by Lemur with either your own or one of Lemur's applications.

Write your own parser and feed the terms into an index by using the PushIndex API in your own application.

Implement your own TextHandler class (a parser to handle your document formats), which you can then use in pipeline fashion with other TextHandlers already in Lemur to further pre-process terms (i.e. stopping,stemming) and use with InvFPTextHandler to build an index. (See "Parsing in Lemur" for more information.)

You might also find the Parsing Tutorial helpful.
3. Do the parsers add all words into the index?
After the initial parsing of a document into terms, there might be other considerations to be made before adding the term into the index, such as whether or not that word is important enough to add, whether to add the word as is or to index its stem form instead, and whether to recognize certain words as acronyms. Having an acronyms list, ignoring stopwords (very common words, like "the", "and", "it"), and indexing word stems (so "stem", "stemming", and "stems" would all become the same term) are features supported by Lemur. These features are all supported by the provided application, BuildInvertedIndex.
4. What type of indexes does Lemur have?
Lemur currently has the following indexes: InvIndex, InvFPIndex, KeyfileIncIndex, and BasicIndex. The indexes are different in that they might index different data or represent the data differently on disk. Each index has a "table of contents" file which has some summary statistics on what's in the index as well as which files are needed to load the index. When you want to use an index, you will need its table of contents file to load it. Each index in lemur has its own unique extension for its table of contents file.

Index Name Extension
File Limit
Stores positions
Loads fast
Disk space usage
Application
Add* documents to Index

InvIndex
.inv
no
no
no
less
BuildInvertedIndex
no

InvFPIndex
.ifp
no
yes
no
more
BuildInvertedIndex yes, use IncIndexer

KeyfileIncIndex .key
no
yes
yes
even more
BuildKeyfileIncIndex yes, use BuildKeyfileIncIndex

BasicIndex**
.bsc
yes
no
no

BuildBasicIndex**
no

* Supports adding new documents to index, not updating existing documents.
** Will be deprecated.

The Lemur Project
Last modified: Wed Jun 16 12:15:22 EDT 2004

Index Name	Extension	File Limit	Stores positions	Loads fast	Disk space usage	Application	Add* documents to Index
InvIndex	.inv	no	no	no	less	BuildInvertedIndex	no
InvFPIndex	.ifp	no	yes	no	more	BuildInvertedIndex	yes, use IncIndexer
KeyfileIncIndex	.key	no	yes	yes	even more	BuildKeyfileIncIndex	yes, use BuildKeyfileIncIndex
BasicIndex**	.bsc	yes	no	no		BuildBasicIndex**	no