Lemur Indexing Applications

Contents

BuildInvertedIndex
BuildKeyfileIncIndex
BuildDocMgr
BuildPropIndex
BuildBasicIndex
PassageIndexer
IncIndexer
IncPassageIndexer

1. BuildInvertedIndex / (previously called PushIndexer)

This application builds an Inv(FP) index for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:

index: name of the index table-of-content file without the .ifp extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
position: store position information (def = 1).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

trec for standard TREC formatted documents
web for web TREC formatted documents
chinese for segmented Chinese text (TREC format, GB encoding)
chinesechar for unsegmented Chinese text (TREC format, GB encoding)
arabic for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

porter Porter stemmer.
krovetz Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

arabic arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.

2. BuildKeyfileIncIndex

This application builds or adds to a Keyfile positional index for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:

index: name of the index table-of-content file without the .ifp extension.
memory: memory (in bytes) for index cache (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

"trec" for standard TREC formatted documents
"web" for web TREC formatted documents
"chinese" for segmented Chinese text (TREC format, GB encoding)
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.

3. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.
Summary of required parameters:

manager:required name of the document manager (without extension)
managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
docFormat:

trec for standard TREC formatted documents
web for web TREC formatted documents
chinese for segmented Chinese text (TREC format, GB encoding)
chinesechar for unsegmented Chinese text (TREC format, GB encoding)
arabic for Arabic text (TREC format, Windows CP1256 encoding)

dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
The following parameters are optional for building an index

index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
position: store position information (def = 1).
stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
stemmer:

porter Porter stemmer.
krovetz Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

arabic arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

4. BuildPropIndex

This application builds an InvFPIndex for a collection of documents with properties associated with terms.

Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter
The parameters are:

index: name of the index to create (don't include extension)
indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

"brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
"identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.

5. BuildBasicIndex

This application builds a basic index for a collection of documents.
To use it, follow the general steps of running a Lemur application and set the following variables in the parameter file:

inputFile: the path to the source file.
outputPrefix: a prefix name for your index.

maxDocuments: maximum number of documents to index (default: 1000000)
maxMemory: maximum amount of memory to use for indexing (default:0x8000000, or 128MB)

In general, the outputPrefix should be an absolute path, unless you always open the index from the same directory as where the index is. A "table-of-content" (TOC) file with a name of the format outputPrefix.bsc will be written in the directory where the index is stored. The following is an example of use:
 

 % cat buildparam
   
 inputFile    = /usr0/mydata/source;
 outputPrefix    = /usr0/mydata/index;
 maxDocuments = 200000;
 maxMemory    = 0x10000000;

 % BuildBasicIndex buildparam
 
 The TOC file is /usr0/mydata/index.bsc.
 
 
See also the testing scripts in test_basic_index.sh and the parameter file build_param in the directory data/basicparam.
6. PassageIndexer
This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:

index: name of the index table-of-content file without the .ifp extension.
memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

trec for standard TREC formatted documents
web for web TREC formatted documents
chinese for segmented Chinese text (TREC format, GB encoding)
chinesechar for unsegmented Chinese text (TREC format, GB encoding)
arabic for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

porter Porter stemmer.
krovetz Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.

arabic arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.
passageSize: Number of terms per passage.

7. IncIndexer
This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.
To use it, follow the general steps of running a lemur application.
The parameters are:

index: name of the index table-of-content file without the .ifp extension.
memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

trec for standard TREC formatted documents
web for web TREC formatted documents
chinese for segmented Chinese text (TREC format, GB encoding)
chinesechar for unsegmented Chinese text (TREC format, GB encoding)
arabic for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

porter Porter stemmer.
krovetz Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.

arabic arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.

8. IncPassageIndexer
This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.
To use it, follow the general steps of running a lemur application.
The parameters are:

index: name of the index table-of-content file without the .ifp extension.
memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:

trec for standard TREC formatted documents
web for web TREC formatted documents
chinese for segmented Chinese text (TREC format, GB encoding)
chinesechar for unsegmented Chinese text (TREC format, GB encoding)
arabic for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

porter Porter stemmer.
krovetz Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.

arabic arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles: name of file containing list of datafiles to index.
passageSize: Number of terms per passage.

The Lemur Project
Last modified: Fri Feb 13 18:29:36 EST 2004