Contents
- What is an index?
- What kind of data/documents can Lemur index?
- Do the parsers add all words into the index?
- What type of indexes does Lemur have?
1. What is an index?
An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it's indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference.Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations.
2. What kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and "pushes" it into the index. (See the Parsing Tutorial.) However, we do provide several parsers with the toolkit.Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.
The 2 most frequently used parsers are the TrecParser and WebParser.
TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:<DOC>WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC><DOC>In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>If your documents are not from NIST, these are the methods you can take to parse and index your documents:
You might also find the Parsing Tutorial helpful.
- Write a script to add the NIST style tags around your documents. Then use one of the parsers provided by Lemur with either your own or one of Lemur's applications.
- Write your own parser and feed the terms into an index by using the PushIndex API in your own application.
- Implement your own TextHandler class (a parser to handle your document formats), which you can then use in pipeline fashion with other TextHandlers already in Lemur to further pre-process terms (i.e. stopping,stemming) and use with InvFPTextHandler to build an index. (See "Parsing in Lemur" for more information.)
3. Do the parsers add all words into the index?
After the initial parsing of a document into terms, there might be other considerations to be made before adding the term into the index, such as whether or not that word is important enough to add, whether to add the word as is or to index its stem form instead, and whether to recognize certain words as acronyms. Having an acronyms list, ignoring stopwords (very common words, like "the", "and", "it"), and indexing word stems (so "stem", "stemming", and "stems" would all become the same term) are features supported by Lemur. These features are all supported by the provided application, BuildInvertedIndex.4. What type of indexes does Lemur have?
Lemur currently has the following indexes: InvIndex, InvFPIndex, KeyfileIncIndex, and BasicIndex. The indexes are different in that they might index different data or represent the data differently on disk. Each index has a "table of contents" file which has some summary statistics on what's in the index as well as which files are needed to load the index. When you want to use an index, you will need its table of contents file to load it. Each index in lemur has its own unique extension for its table of contents file.
Index Name Extension
File Limit
Stores positions
Loads fast
Disk space usage
Application
Add* documents to Index
InvIndex
.inv
no
no
no
less
BuildInvertedIndex
no
InvFPIndex
.ifp
no
yes
no
more
BuildInvertedIndex yes, use IncIndexer
KeyfileIncIndex .key
no
yes
yes
even more
BuildKeyfileIncIndex yes, use BuildKeyfileIncIndex
BasicIndex**
.bsc
yes
no
no
BuildBasicIndex**
no
* Supports adding new documents to index, not updating existing documents.
** Will be deprecated.
The Lemur Project Last modified: Wed Jun 16 12:15:22 EDT 2004