Lemur Indexing Applications


Contents

  1. BuildInvertedIndex
  2. BuildKeyfileIncIndex
  3. BuildDocMgr
  4. BuildPropIndex
  5. BuildBasicIndex
  6. PassageIndexer
  7. IncIndexer
  8. IncPassageIndexer


1. BuildInvertedIndex / (previously called PushIndexer)

This application builds an Inv(FP) index for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  3. position: store position information (def = 1).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

2. BuildKeyfileIncIndex

This application builds or adds to a Keyfile positional index for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) for index cache (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • "trec" for standard TREC formatted documents
    • "web" for web TREC formatted documents
    • "chinese" for segmented Chinese text (TREC format, GB encoding)
    • "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
    • "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.

3. BuildDocMgr

BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.

Summary of required parameters:

  1. manager:required name of the document manager (without extension)
  2. managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
  3. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  4. dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
The following parameters are optional for building an index
  1. index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
  3. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  4. position: store position information (def = 1).
  5. stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
  6. acronyms: name of file containing the acronym list.
  7. countStopWords: If true, count stopwords in document length.
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words

4. BuildPropIndex

This application builds an InvFPIndex for a collection of documents with properties associated with terms.

Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...

* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter

The parameters are:

  1. index: name of the index to create (don't include extension)
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
  3. memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
    • "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
  8. stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

5. BuildBasicIndex

This application builds a basic index for a collection of documents.

To use it, follow the general steps of running a Lemur application and set the following variables in the parameter file:

  1. inputFile: the path to the source file.
  2. outputPrefix: a prefix name for your index.
  3. maxDocuments: maximum number of documents to index (default: 1000000)
  4. maxMemory: maximum amount of memory to use for indexing (default:0x8000000, or 128MB)

In general, the outputPrefix should be an absolute path, unless you always open the index from the same directory as where the index is. A "table-of-content" (TOC) file with a name of the format outputPrefix.bsc will be written in the directory where the index is stored. The following is an example of use:

 

 % cat buildparam
   
 inputFile    = /usr0/mydata/source;
 outputPrefix    = /usr0/mydata/index;
 maxDocuments = 200000;
 maxMemory    = 0x10000000;

 % BuildBasicIndex buildparam
 
 The TOC file is /usr0/mydata/index.bsc.
 
 
See also the testing scripts in test_basic_index.sh and the parameter file build_param in the directory data/basicparam.

6. PassageIndexer

This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

7. IncIndexer

This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.

8. IncPassageIndexer

This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

The Lemur Project
Last modified: Fri Feb 13 18:29:36 EST 2004