Lemur Applications User's Guide (Version 2.0)


Contents

  1. Where are the Lemur Applications

  2. Running a Lemur Application

  3. Documentation of Individual Applications



1. Where are the Lemur Applications

The source code for all the Lemur applications is all in the directory app/src. After finishing "gmake", you will see all the executables for the applications generated in the directory app/obj. After finishing "gmake install", all the applications will have been copied to the directory "LEMUR_INSTALL_PATH/lemur/bin".

2. Running a Lemur Application

The usage for different applications may vary, but most applications tend to have the following general usage.
  1. Create a parameter file with value definitions for all the input variables of an application. Terminate each line with a semicolon. For example,
        index = /usr0/mydata/index.bsc;
    
    specifies a particular path to the "table-of-contents" file of a basic index for the input variable index. (The special suffix ".bsc" means basic index; it is ".ifp" for the position index. An index manager will be able to recognize the suffix and open the index with the corresponding index class.)

    In general, all the file paths must be absolute paths. Lemur does not have the capability of searching for files along different paths.

    Most applications will display a list of the major required input variables, if you run it with the "--help" option. But, generally, you should read this documentation or the doxygen documentation to find out the exact parameters that an application recognizes.



  2. Run the application program with the parameter as the only argument, or the first argument, if the application can take other parameters from the command line. Most applications only recognize parameters defined in the parameter file, but there are a few exceptions. PushIndexer allows (optional) arguments in addition to the parameter file, which specifies the source file(s) to use for building the index.
  3. For new versions of gcc, you would need a shared library to run an application, and you will need to set the environment variable "LD_LIBRARY_PATH" to the path to the corresponding shared library. See gcc documentation for more details about this.

3. Documentation of Individual Applications

In this section, we provide detailed documentation of each individual application. There are many good examples of using all these applications in the directory data and the subdirectories there. In particular, test_basic_index.sh and test_pos_index.sh have commented example commands for building an index and running different kinds of retrieval experiments, using the basic indexer and the position indexer respectively. Subdirectories basicparam and posparam have many example parameter files that you can modify to run your own experiments.

Pre-processing:

Building/Adding to an index: Retrieval and Evaluation: Summarization: Query-based Sampling: Distributed: Structured Query Language:

ParseQuery

ParseQuery parses queries using either the TrecParser or WebParser class and an Index.

Usage: ParseQuery paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

  1. queryOutFile The name of the file to write the parsed queries to.

  2. index Name of the index (with the .ifp or .bsc extension).

  3. stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are left in the query. 

  4. acronyms Name  of file containing acronym list (one word per line). Uppercase words recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as USA if USA is in the acronym list.  If no acronym list is specified, acronyms will not be recognized.

  5. docFormat Specify trec for standard TREC formatted documents or web for web TREC formatted documents. The default is trec.

  6. stemmer Specify porter to use Porter's stemmer. If no stemmer is specified, no stemmer will be used.
For more information on parsing, please see Parsing in Lemur.

ParseToFile

ParseToFile parses documents and writes output compatible with BuildBasicIndex. The program uses either the TrecParser class or WebParser class to parse.

Usage: ParseToFile paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

  1. outputFile Name of file to output parsed documents to.

  2. stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are output to the file.


  3. acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.

  4. docFormat Specify trec for standard TREC formatted documents or web for web TREC formatted documents. The default is trec.

  5. stemmer Specify porter to use Porter's stemmer.If no stemmer is specified, no stemmer will be used.
For more information on parsing, please see Parsing in Lemur.

ParseInQueryOp

This application ( ParseInqueryOp.cpp ) parses a file containing structured queries into BasicDocStream format. The parameters are:

  1. stopwords: name of file containing the stopword list.
  2. acronyms: name of file containing the acronym list.
  3. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  4. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  5. outputFile: name of the output file.

The structured query operators are:

   Sum Operator:   #sum (T1 ... Tn )

     The terms or nodes contained in the sum operator are treated as
     having equal influence on the final result.  The belief values
     provided by the arguments of the sum are averaged to produce the
     belief value of the #sum node.

   Weighted Sum Operator:  #wsum (W1 T1 ... Wn Tn)

     The terms or nodes contained in the wsum operator contribute
     unequally to the final result according to the weight associated
     with each (Wx).  Note that this is a change from the InQuery
     operator, as there is no initial weight, Ws, for scaling the belief
     value of the sum.

   Ordered Distance Operator:  #N (T1 ... Tn)  or #odN (T1 ... Tn)

     The terms within an ODN operator must be found within N words of
     each other in the text in order to contribute to the document's
     belief value.  The "#N" version is an abbreviation of #ODN, thus
     #3(health care) is equivalent to #od3(health care).

   Un-ordered Window Operator:  #uwN(T1 ... Tn)

     The terms contained in a UWN operator must be found in any order
     within a window of N words in order for this operator to contribute
     to the belief value of the document.

   Phrase Operator:  #phrase(T1 ... Tn)

     The operator is treated as an ordered distance operator of 3
     (#od3).

   Passage Operator:  #passageN(T1 ... Tn)

     The passage operator looks for the terms or nodes within the
     operator to be found in a passage window of N words.  The document
     is rated based upon the score of it's best passage.

   Synonym Operator:  #syn(T1 ... Tn)

     The terms of the operator are treated as instances of the same
     term.

   And Operator:  #and(T1 ... Tn)

     The more terms contained in the AND operator which are found in a
     document, the higher the belief value of that document.

   Boolean And Operator:  #band(T1 ... Tn)

     All of the terms within a BAND operator must be found in a document
     in order for this operator to contribute to the belief value of
     that document.

   Boolean And Not Operator:  #bandnot (T N)

     Search for document matching the first argument but not the second.
     
   Or Operator:  #or(T1 ... Tn)

     One of terms within the OR operator must be found in a document for
     that document to get credit for this operator.


   Maximum Operator:  #max(T1 ... Tn)

     The maximum belief value of all the terms or nodes contained in the
     MAX operator is taken to be the belief value of this operator.

   Filter Require Operator: #filreq(arg1 arg2)

     Use the documents returned (belief list) of the first argument if
     and only if the second argument would return documents.  The value
     of the second argument does not effect the belief values of the
     first argument; only whether they will be returned or not.

   Filter Reject Operator: #filrej(arg1 arg2)

     Use the documents returned by the first argument if and only if
     there were no documents returned by the second argument.  The value
     of the second argument does not effect the belief values of the
     first argument; only whether they will be returned or not.

   Negation Operator:  #not(T1)

     The term or node contained in this operator is negated so that
     documents which do not contain it are rewarded.  

The input query file is of the form:

#qN = queryNode ;
where N is the query id and queryNode is one of the aforementioned query operators. The query may span multiple lines and must be terminated with the semicolon. The body of the query must not contain a semicolon, as that will prematurely terminate the query.

An example query:

#q18=#wsum(1 #sum(Languages and compilers for #1(parallel processors))
 2 #sum(highly horizontal microcoded machines)
 1 code 1 compaction
);

PushIndexer / BuildInvertedIndex

This application builds an Inv(FP) index for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  3. position: store position information (def = 1).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

BuildBasicIndex

This application builds a basic index for a collection of documents.

To use it, follow the general steps of running a Lemur application and set the following variables in the parameter file:

  1. inputFile: the path to the source file.
  2. outputPrefix: a prefix name for your index.
  3. maxDocuments: maximum number of documents to index (default: 1000000)
  4. maxMemory: maximum amount of memory to use for indexing (default:0x8000000, or 128MB)

In general, the outputPrefix should be an absolute path, unless you always open the index from the same directory as where the index is. A "table-of-content" (TOC) file with a name of the format outputPrefix.bsc will be written in the directory where the index is stored. The following is an example of use:

 

 % cat buildparam
   
 inputFile    = /usr0/mydata/source;
 outputPrefix    = /usr0/mydata/index;
 maxDocuments = 200000;
 maxMemory    = 0x10000000;

 % BuildBasicIndex buildparam
 
 The TOC file is /usr0/mydata/index.bsc.
 
 
See also the testing scripts in test_basic_index.sh and the parameter file build_param in the directory data/basicparam.

PassageIndexer

This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

IncIndexer

This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.

IncPassageIndexer

This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.


GenerateSmoothSupport

This application generates two support files for retrieval using the language modeling approach. Both files contain some pre-computed quantities that are needed to speed up the retrieval process.

One file (name given by the parameter smoothSupportFile, see below) is needed by retrieval using smoothed unigram language model. Each entry in this support file corresponds to one document and records two pieces of information: (a) the count of unique terms in the document; (b) the sum of collection language model probabilities for the words in the document.

The other file (with an extra suffix ".mc" is needed if you run feedback based on the Markov chain query model. Each line in this file contains a term and a sum of the probability of the word given all documents in the collection. (i.e., a sum of p(w|d) over all possible d's.)

To run the application, follow the general steps of running a Lemur application and set the following variables in the parameter file:

  1. index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex or the .ifp file created by PushIndexer. )
  2. smoothSupportFile: file path for the support file (e.g., /usr0/mydata/index.supp)

This application is also a good example of using the doc index (i.e., doc->term index).

RetEval

This application runs retrieval experiments (with/without feedback) to evaluate different retrieval models as well as different parameter settings for those models.

Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-TREC format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.

It currently supports three different models:

  1. The popular TFIDF retrieval model
  2. The Okapi BM25 retrieval function
  3. The KL-divergence language model based retrieval method

The parameter to select the model is retModel (with value 0 for TFIDF, 1 for Okapi, and 2 for KL). It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.

Other common parameters (for all retrieval methods) are:

  1. index: The complete name of the index table-of-content file for the database index.
  2. textQuery: the query text stream
  3. resultFile: the result file
  4. TRECResultFormat: whether the result format is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format)
  5. resultCount: the number of documents to return as result for each query
  6. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
  7. feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:



RelFBEval

This application (RelFBEval.cpp) runs retrieval experiments with relevance feedback. Different retrieval models can be used with different settings for the corresponding parameters. Although this program is designed for relevance feedback, it can be easily used for pseudo feedback -- you just need to set the parameter feedbackDocuments to a result file, i.e., interpreting a result file as if all the entries represent relevant documents.

Two important notes:

Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-trec format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.

It currently supports three different models:

  1. The popular TFIDF retrieval model
  2. The Okapi BM25 retrieval function
  3. The KL-divergence language model based retrieval method

The parameter to select the model is retModel (with value 0 for TFIDF, 1 for Okapi, and 2 for KL). It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.

Other common parameters (for all retrieval methods) are:

  1. index: The complete name of the index table-of-content file for the database index.
  2. textQuerySet: the query text stream
  3. resultFile: the result file
  4. resultCount: the number of documents to return as result for each query
  5. feedbackDocuments : the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
  6. feedbackDocCount: the number of docs to use for feedback (negative value means using all judged documents for feedback). The documents in the feedbackDocuments are sorted in decreasing order according to the numerical value in the third column, and then the top documents are used for feedback.
  7. feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:



GenerateQueryModel

This application (GenerateQueryModel.cpp) computes an expanded query model based on feedback documents and the original query model for the KL-divergence retrieval method. It can be regarded as performing a feedback in the language modeling approach to retrieval. The original query model can be computed based on the original query text (when the parameter "initQuery" is not set, or set to a null string), or based on a previously saved query model (the model is given by the parameter "initQuery"). Expanding a saved query model makes it possible to do iterative feedback. Feedback can be based on true relevance judgments or any previously returned retrieval results.

Two important notes:

Parameters:

  1. index: The complete name of the index table-of-content file for the database index.
  2. smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
  3. textQuery: the original query text stream
  4. initQuery: the file with a saved initial query model. When this parameter is set to a non-empty string, the model stored in this file will be used for expansion; otherwise, the original query text is used the initial query model for expansion.
  5. feedbackDocuments: the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
  6. TRECResultFormat: whether the feedback document file (given by feedbackDocuments is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format). VERY IMPORTANT: For relevance feedback, TRECResultFormat should always be set to 0, since the judgment file is always a simple format.
  7. expandedQuery: the file to store the expanded query model
  8. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
  9. queryUpdateMethod: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
  10. Method-specific feedback parameters:

    For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

    1. feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
    2. feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
    3. feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
    4. feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

    Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.

    All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

    • For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
    • For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
    • For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

    In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )



QueryModelEval

This application loads an expanded query model (e.g., one computed by GenerateQueryModel), and evaluates it with the KL-divergence retrieval model.

Parameters:

  1. index: The complete name of the index table-of-content file for the database index.
  2. smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
  3. queryModel: the file of the query model to be evaluated
  4. resultFile: the result file
  5. TRECResultFormat: whether the result format should be of the TREC format (i.e., six-column) or just a simple three-column format <queryID, docID, score>. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format)
  6. resultCount: the number of documents to return as result for each query

    The following are document model smoothing parameters:

  7. smoothMethod: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
  8. smoothStrategy: Either interpolate (0) or backoff (1)
  9. JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
  10. DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
  11. discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.




EstimateDirPrior

This application (EstimateDirPrior.cpp) uses the leave-one-out method to estimate an optimal setting for the Dirichlet prior smoothing parameter (i.e., the "prior sample size").

To run the application, follow the general steps of running a lemur application and set the following variables in the parameter file:

  1. index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex)
  2. initValue: the initial value for the parameter in the Newton method.The default value is 1. In general, you do not need to set this parameter.

    After completion, it will print out the estimated parameter value to the standard output.

    TwoStageRetEval

    This application (TwoStageRetEval.cpp) runs retrieval experiments (with/without feedback) in exactly the same way as the application RetEval.cpp, except that it always uses the two-stage smoothing method for the initial retrieval and the KL-divergence model for feedback. It thus ignores the the parameter retModel.

    It recognizes all the parameters relevant to the KL-divergence retrieval model, except for the smoothing method parameter SmoothMethod which is forced to the "Two-stage Smoothing" (value of 3) and JelinekMercerLambda, which gets ignored, since it automatically estimates the value of JelinekMercerLambda using a mixture model. For details on all the parameters, see the documentation for RetEval,

    To achieve the effect of the completely automatic two-stage smoothing method, the parameter DirichletPrior should be set to the estimated value of the Dirichlet prior smoothing parameter using the application EstimateDirPrior, which computes a Maximum Likelihood estimate of DirichletPrior based on "leave-one-out".

    GenL2Norm

    This application ( GenL2Norm.cpp ) generates a support file for retrieval using the cosine similarity. The file contains the L2 norms for each document, used to speed up the retrieval process. To run the application, follow the general steps of running a lemur application and set the following variables in the parameter file:
    1. index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex or the .ifp file created by PushIndexer. )
    2. L2File: file path for the support file (e.g., /usr0/mydata/index.L2)
    This application is also a good example of using the doc index (i.e., doc->term index)

    QueryClarity

    This application (QueryClarity.cpp) computes clarity scores for a query model which could be an expanded model based on feedback documents and the original query model using the KL-divergence retrieval method. The original query model can be computed based on the original query text (when the parameter "initQuery" is not set, or set to a null string), or based on a previously saved query model (the model is given by the parameter "initQuery"). If the feedbackDocCount==0 then computs the clarity score only for the original or given query files. Clarity scores for each entire query, and each individual term within each query are written to the file specified by the parameter "expandedQuery". Feedback can be based on true relevance judgments or any previously returned retrieval results.

    Two important notes:

    • All the feedback algorithms currently in Lemur assume that all entries in a judgment file are relevant documents, so you must remove all the entries of judged non-relevant documents. However, the judgment status is recorded in the internal representation of judgments, so that it is possible to distinguish judged relevant documents from judged non-relevant documents in a feedback algorithm.
    • The format of the judgment file, when used for feedback, must be of three columns, i.e., with the second column removed so that each line has a query id, a document id, and a judgment value. This is to be consistent with the format of a result file. An alternative would be to use the original four-column format directly, but, then we would need to add a parameter to distinguish this four-column format from the three-column format of a result file.

    Parameters:

    1. index: The complete name of the index table-of-content file for the database index.
    2. smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
    3. textQuery: the original query text stream
    4. initQuery: the file with a saved initial query model. When this parameter is set to a non-empty string, the model stored in this file will be used for expansion; otherwise, the original query text is used the initial query model for expansion.
    5. feedbackDocuments: the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
    6. TRECResultFormat: whether the feedback document file (given by feedbackDocuments is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format). VERY IMPORTANT: For relevance feedback, TRECResultFormat should always be set to 0, since the judgment file is always a simple format.
    7. expandedQuery: the file to store the query clarity scores.
    8. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
    9. queryUpdateMethod: feedback method (0, 1, 2, 3, 4 for mixture model, divergence minimization, and Markov chain, relevance model1 and model2 respectively).
    10. Method-specific feedback parameters:

      For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

      1. feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
      2. feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
      3. feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
      4. feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

      Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.

      All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

      • For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
      • For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
      • For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

      In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )

    ireval.pl

    This is a Perl script that does TREC-style retrieval evaluation. The usage is

    ireval.pl -j judgmentfile < resultfile

    if the resultfile is of a simple three column format (i.e., queryid, docid, score), or

    ireval.pl -j judgmentfile -trec < resultfile

    if the resultfile is of the 6-column Trec format.

    BasicSummApp

    This application demonstrates the very simplest summarizer one can create with the provided API. The sentence selection algorithm is a quick scoring algorithm that scores all passages which make up a document. Passages are then "pulled" back out of the summarizer.

    NOTE: This summarizer will attempt to locate end of sentence markers in the document vector for a particular document. This is currently done by looking for a token "*eos". If no such tokens are located, it chops the document into sequential passages of a fixed length and scores those. A provided file, webparser_extended.l is provided if you wish to use it. Replacing the parser with one generated by this lex file will translate <s> tokens in a source document into *eos tokens for you. It will also identify titles in HTML documents by inserting the special token *title prior to terms in the document vecotrs that appear inside the html <title> tag.

    MMRSummApp

    This application demonstrates a more complex summarizer which does comparisons between passages. The algorithm requires a query, as it is query-based by nature, although it will auto-generate a query if one is not provided that is appropriate for the document. The application itself, however, is also a simple program, the complication is encapsulated in the class MMRSumm.

    NOTE: See note above regarding *eos and *title markers. This implemntation also utilizes identification of pronouns in the same way, if available. The algorithm will work without pronoun identification. The previously mentioned webparser_extended.l will recognize a <pronoun> tag in a source document, assuming it appears just prior to a pronoun in text, as used by this application.

    CollSelIndex

    CollSelIndex builds a collection selection database using either document frequency or collection term frequency for the database's term frequency counts.

    Usage: CollSelIndex paramfile [datfile1]* [datfile2] ...

    Summary of parameters in paramfile:

    1. dfIndex Name of the index to build using document frequency counts(without the .ifpextension).


    2. ctfIndex Name of the index to build using collection term frequency (without the index extension)


    3. dfCounts Name of the file to write out counts (needed for ranking)


    4. dfDocs


    5. countStopWords


    6. memory Memory (in bytes) of PushIndex (def = 96000000).


    7. stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.


    8. acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.


    9. docFormat Specify "trec" for standard TREC formatted documents or "web" for web TREC formatted documents. The default is "trec".


    10. stemmer Specify "porter" to use Porter's stemmer. If no stemmer is specified, no stemmer will be used.

    DistRetEval

    This is a sample application that does distributed retrieval, using a resource selection index and individual indexes. resource selection is done using the CORI_CS (the only resource selection implemented thus far). results merging uses either CORIMergeMethod (The retrieval method of each individual databases should be CORI_DOC) or SingleRegrMergeMethod (The retrieval method of each individual databases should be CORI_DOC).

    1. index the collection selection database


    2. collCounts collection counts file for the collection selection index (needed by CORI)


    3. ranksFile Name of the file to write ranking results (optional)


    4. resultFile file to write final results


    5. resultCount maximum number of results to output for each query (default to 1000)


    6. textQuery file of text queries in docstream format


    7. cutoff maximum number of databases to search (default to 10)


    8. "dbids" = "db's param file" Required for each database in the collection selection index. Key should be the database character id string and value should be the name of the file that has parameters for that database:
      index = the individual database
      retModel = the retrieval model to use
      "modelvals" - whatever parameters are required for that retModel
    9. CSTF_factor TFfactor parameter in the CORI_CS resource selection method


    10. CSTF_baseline TFbaseline parameter in the CORI_CS resource selection method


    11. mergeMethod resource merging method (0 for CORI results merging method, 1 for single regression results merging method)


    12. Merging Method-specific parameters:
      For CORI merging method: None

      For Single regression merging method:
      1. csDbDataBaseIndex The centralized sampling database index
      2. DOCTF_factor The TFfactor parameter in the CORI_DOC retrieval method for the centralized sampling database.
      3. DOCTF_baseline The TFbaseline parameter in the CORI_DOC retrieval method for the centralized sampling database.

    StructQueryEval

    This application (StructQueryEval.cpp) runs retrieval experiments to evaluate the performance of the structured query model using the inquery retrieval method. Feedback is implemented as a WSUM of the original query combined with terms selected using the Rocchio implementation of the TFIDF retrieval method. The expanded query has the form:

    #wsum( (1-a) <original query>
          a*w1  t1
          a*w2  t2
          ...
          a*wN  tN
          )
    
    

    where a is the value of the parameter feedbackPosCoeff.

    Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. By default, scoring is on the whole collection.

    The parameters are:

    1. index: The complete name of the index table-of-content file for the database index.
    2. QuerySet: the query text stream parsed by ParseInQuery
    3. resultFile: the result file
    4. resultCount: the number of documents to return as result for each query
    5. DefaultBelief: The default belief for a document: Default=0.4
    6. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
    7. feedbackTermCount: the number of terms to add to a query when doing feedback.
    8. feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback, as implemented for TFIDF.


    The Lemur Project
    Last modified: Mon Apr 14 19:52:31 EDT 2003