Main Page   Namespace List   Class Hierarchy   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

Retrieval Evaluation Application

This application (RetEval.cpp) runs retrieval experiments (with/without feedback) to evaluate different retrieval models as well as different parameter settings for those models.

Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has either a non-zero (integer) value or the value true, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-trec format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.

It currently supports five different models:

  1. The popular TFIDF retrieval model

  2. The Okapi BM25 retrieval function

  3. The KL-divergence language model based retrieval method
  4. The InQuery (CORI) retrieval model
  5. Cosine similarity model

The parameter to select the model is retModel. Valid values are:

It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.

Other common parameters (for all retrieval methods) are:

  1. index: The complete name of the index table-of-content file for the database index.

  2. textQuerySet: the query text stream

  3. resultFile: the result file

  4. resultFormat: whether the result format should be of the TREC format (i.e., six-column) or just a simple three-column format <queryID, docID, score>. String value, either trec for TREC format or 3col for three column format. The integer values, zero for non-TREC format, and non-zero for TREC format used in previous versions of lemur are accepted. Default: TREC format.

  5. resultCount: the number of documents to return as result for each query

  6. feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)

  7. feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:


Generated on Fri Feb 6 07:12:10 2004 for LEMUR by doxygen1.2.16