Contents
- Where are the Lemur Applications
- Running a Lemur Application
- Documentation of Individual Applications
1. Where are the Lemur Applications
The source code for all the Lemur applications is all in the directory app/src. After finishing "gmake", you will see all the executables for the applications generated in the directory app/obj. After finishing "gmake install", all the applications will have been copied to the directory "LEMUR_INSTALL_PATH/lemur/bin".2. Running a Lemur Application
The usage for different applications may vary, but most applications tend to have the following general usage.
- Create a parameter file with value definitions for all the input variables of an application. Terminate each line with a semicolon. For example,
index = /usr0/mydata/index.bsc;specifies a particular path to the "table-of-contents" file of a basic index for the input variable index. (The special suffix ".bsc" means basic index; it is ".ifp" for the position index. An index manager will be able to recognize the suffix and open the index with the corresponding index class.)In general, all the file paths must be absolute paths. Version 1.0 of Lemur does not have the capability of searching for files along different paths.
Most applications will display a list of the major required input variables, if you run it with the "--help" option. But, generally, you should read this documentation or the doxygen documentation to find out the exact parameters that an application recognizes.
- Run the application program with the parameter as the only argument, or the first argument, if the application can take other parameters from the command line. Most applications only recognize parameters defined in the parameter file, but there are a few exceptions. PushIndexer requires at least one extra argument following the parameter file, which specifies the source file(s) to use for building the index. The advantage of taking source files in this way is to allow multiple source files (you just need to put the names of all the source files as the extra arguments following the parameter file.
3. Documentation of Individual Applications
In this section, we provide detailed documentation of each individual application. There are many good examples of using all these applications in the directory data and the subdirectories there. In particular, test_basic_index.sh and test_pos_index.sh have commented example commands for building an index and running different kinds of retrieval experiments, using the basic indexer and the position indexer respectively. Subdirectories basicparam and posparam have many example parameter files that you can modify to run your own experiments.
- BuildBasicIndex
- GenerateSmoothSupport
- RetEval
- GenerateQueryModel
- QueryModelEval
- PushIndexer, ParseQuery, ParseToFile
- ireval.pl
- BuildBasicIndex
This application builds a basic index for a collection of documents.
To use it, follow the general steps of running a Lemur application and set the following variables in the parameter file:
- inputFile: the path to the source file.
- outputPrefix: a prefix name for your index.
- maxDocuments: maximum number of documents to index (default: 1000000)
- maxMemory: maximum amount of memory to use for indexing (default:0x8000000, or 128MB)
In general, the outputPrefix should be an absolute path, unless you always open the index from the same directory as where the index is. A "table-of-content" (TOC) file with a name of the format outputPrefix.bsc will be written in the directory where the index is stored. The following is an example of use:
% cat buildparam inputFile = /usr0/mydata/source; outputPrefix = /usr0/mydata/index; maxDocuments = 200000; maxMemory = 0x10000000; % BuildBasicIndex buildparam The TOC file is /usr0/mydata/index.bsc.See also the testing scripts in test_basic_index.sh and the parameter file build_param in the directory data/basicparam.
- GenerateSmoothSupport
This application generates two support files for retrieval using the language modeling approach. Both files contain some pre-computed quantities that are needed to speed up the retrieval process.
One file (name given by the parameter smoothSupportFile, see below) is needed by retrieval using smoothed unigram language model. Each entry in this support file corresponds to one document and records two pieces of information: (a) the count of unique terms in the document; (b) the sum of collection language model probabilities for the words in the document.
The other file (with an extra suffix ".mc" is needed if you run feedback based on the Markov chain query model. Each line in this file contains a term and a sum of the probability of the word given all documents in the collection. (i.e., a sum of p(w|d) over all possible d's.)
To run the application, follow the general steps of running a Lemur application and set the following variables in the parameter file:
- index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex or the .ifp file created by PushIndexer. )
- smoothSupportFile: file path for the support file (e.g., /usr0/mydata/index.supp)
This application is also a good example of using the doc index (i.e., doc->term index).
- RetEval
This application runs retrieval experiments (with/without feedback) to evaluate different retrieval models as well as different parameter settings for those models.
Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-TREC format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.
It currently supports three different models:
- The popular TFIDF retrieval model
- The Okapi BM25 retrieval function
- The KL-divergence language model based retrieval method
The parameter to select the model is retModel (with value 0 for TFIDF, 1 for Okapi, and 2 for KL). It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.
Other common parameters (for all retrieval methods) are:
- index: The complete name of the index table-of-content file for the database index.
- textQuery: the query text stream
- resultFile: the result file
- TRECResultFormat: whether the result format is of the TREC format (i.e., six-column) or just a simple three-column format
. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format) - resultCount: the number of documents to return as result for each query
- feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
- feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)
Model-specific parameters are:
- For TFIDF:
- feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback. We only implemented the positive part and non-relevant documents are ignored.
- doc.tfMethod: document term TF weighting method: 0 for RawTF, 1 for log-TF, and 2 for BM25TF
- doc.bm25K1: BM25 k1 for doc term TF
- doc.bm25B : BM25 b for doc term TF
- query.tfMethod: query term TF weighting method: 0 for RawTF, 1 for log-TF, and 2 for BM25TF
- query.bm25K1: BM25 k1 for query term TF. bm25B is set to zero for query terms
- For Okapi:
- BM25K1 : BM25 K1
- BM25B : BM25 B
- BM25K3: BM25 K3
- BM25QTF: The TF for expanded terms in feedback (the original paper about the Okapi system is not clear about how this is set, so it's implemented as a parameter.)
- For KL-divergence:
Document model smoothing parameters:
- smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
- smoothMethod: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
- smoothStrategy: Either interpolate (0) or backoff (1)
- JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
- DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
- discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.
Query model updating method (i.e., pseudo feedback):
- queryUpdateMethod: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
- Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:
Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
- feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
- feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
- feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
- feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.
- For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
- For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
- For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.
In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )
- GenerateQueryModel
This application (GenerateQueryModel.cpp) computes an expanded query model based on feedback documents and the original query model for the KL-divergence retrieval method. It can be regarded as performing a feedback in the language modeling approach to retrieval.
Parameters:
- index: The complete name of the index table-of-content file for the database index.
- smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
- textQuery: the original query text stream
- resultFile: the result file to be used for feedback
- TRECResultFormat: whether the result format is of the TREC format (i.e., six-column) or just a simple three-column format
. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format) - expandedQuery: the file to store the expanded query model
- feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
- queryUpdateMethod: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
- Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:
Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
- feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
- feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
- feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
- feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.
- For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
- For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
- For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.
In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )
- QueryModelEval
This application loads an expanded query model (e.g., one computed by GenerateQueryModel), and evaluates it with the KL-divergence retrieval model.
Parameters:
- index: The complete name of the index table-of-content file for the database index.
- smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
- queryModel: the file of the query model to be evaluated
- resultFile: the result file
- TRECResultFormat: whether the result format should be of the TREC format (i.e., six-column) or just a simple three-column format <queryID, docID, score>. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format)
- resultCount: the number of documents to return as result for each query
The following are document model smoothing parameters:
- smoothMethod: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
- smoothStrategy: Either interpolate (0) or backoff (1)
- JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
- DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
- discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.
- PushIndexer, ParseQuery, ParseToFile
Please see Parsing in Lemur.
- ireval.pl
This is a Perl script that does TREC-style retrieval evaluation. The usage is
ireval.pl -j judgmentfile < resultfile
if the resultfile is of a simple three column format (i.e., queryid, docid, score), or
ireval.pl -j judgmentfile -trec < resultfile
if the resultfile is of the 6-column Trec format.
The Lemur Project Last modified: Tue Dec 18 20:47:32 EST 2001