Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

Indri Repository Builder

This application builds an Indri Repository for a collection of documents. Parameter formats for all Indri applications are also described in IndriParameters.html

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index
path to where to place the Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
corpus
a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
Combining each of these elements, the paramter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>
metadata
a complex element containing one or more field entry specifying the metadata fields to index, eg DOCNO. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
field
a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times. The subelements are:
name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
integer value of 1 if the field contains numeric data, otherwise 0, specified as <field><numeric>0</numeric></field> in the parameter file and as -field.numeric=0 on the command line. This is an optional parameter, defaulting to 0.
stemmer
a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.


Generated on Wed Nov 3 13:00:02 2004 for Lemur Toolkit by doxygen1.2.18