Main Page Namespace List Class Hierarchy Alphabetical List Compound List File List Namespace Members Compound Members File Members Related Pages
This application builds an Indri Repository for a collection of documents. Parameter formats for all Indri applications are also described in IndriParameters.html
Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as
-memory=100M
on the command line. - index
- path to where to place the Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as
-index=/path/to/repository
on the command line. - corpus
- a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as
-corpus.path=/path/to/file_or_directory
on the command line. - class
- The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as
-corpus.class=trecweb
on the command line. The known classes are:
- html -- web page data.
- trecweb -- TREC web format, eg terabyte track.
- trectext -- TREC format, eg TREC-3 onward.
- doc -- Microsoft Word format (windows platform only).
- ppt -- Microsoft Powerpoint format (windows platform only).
- pdf -- Adobe PDF format.
- txt -- Plain text format.
Combining each of these elements, the paramter file would contain:
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>
- metadata
- a complex element containing one or more
field
entry specifying the metadata fields to index, eg DOCNO. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname
on the command line. - field
- a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times. The subelements are:
- name
- the field name, specified as <field><name>fieldname</name></field> in the parameter file and as
-field.name=fieldname
on the command line. - numeric
- integer value of 1 if the field contains numeric data, otherwise 0, specified as <field><numeric>0</numeric></field> in the parameter file and as
-field.numeric=0
on the command line. This is an optional parameter, defaulting to 0.
- stemmer
- a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as
-stemmer.name=stemmername
on the command line. This is an optional parameter with the default of no stemming. - stopper
- a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as
-stopper.word=stopword
on the command line. This is an optional parameter with the default of no stopping.
Generated on Wed Nov 3 13:00:02 2004 for Lemur Toolkit by
1.2.18