This application (RelFBEval.cpp) runs retrieval experiments with relevance feedback. Different retrieval models can be used with different settings for the corresponding parameters. Although this program is designed for relevance feedback, it can be easily used for pseudo feedback -- you just need to set the parameter feedbackDocuments
to a result file, i.e., interpreting a result file as if all the entries represent relevant documents.
Two important notes:
Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-trec format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.
It currently supports three different models:
The parameter to select the model is retModel
(with value 0 for TFIDF, 1 for Okapi, and 2 for KL). It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.
Other common parameters (for all retrieval methods) are:
index
: The complete name of the index table-of-content file for the database index.
textQuerySet
: the query text stream
resultFile
: the result file
resultCount
: the number of documents to return as result for each query
feedbackDocuments
: the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
feedbackDocCount
: the number of docs to use for feedback (negative value means using all judged documents for feedback). The documents in the feedbackDocuments
are sorted in decreasing order according to the numerical value in the third column, and then the top documents are used for feedback.
feedbackTermCount
: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.) Model-specific parameters are:
feedbackPosCoeff
: the coefficient for positive terms in (positive) Rocchio feedback. We only implemented the positive part and non-relevant documents are ignored. doc.tfMethod
: document term TF weighting method: 0 for RawTF
, 1 for log-TF
, and 2 for BM25TF
doc.bm25K1
: BM25 k1 for doc term TF
doc.bm25B
: BM25 b for doc term TF
query.tfMethod
: query term TF weighting method: 0 for RawTF
, 1 for log-TF
, and 2 for BM25TF
query.bm25K1
: BM25 k1 for query term TF. bm25B is set to zero for query terms BM25K1
: BM25 K1
BM25B
: BM25 B
BM25K3
: BM25 K3
BM25QTF
: The TF for expanded terms in feedback (the original paper about the Okapi system is not clear about how this is set, so it's implemented as a parameter.)
Document model smoothing parameters:
smoothSupportFile
: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
smoothMethod
: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
smoothStrategy
: Either interpolate
(0) or backoff
(1)
JelinekMercerLambda
: The collection model weight in the JM interpolation method. Default: 0.5
DirichletPrior
: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
discountDelta
: The delta (discounting constant) in the absolute discounting method. Default 0.7.
Query model updating method (i.e., pseudo feedback):
queryUpdateMethod
: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:
feedbackCoefficient
: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
feedbackTermCount
: Truncate the feedback model to no more than a given number of words/terms.
feedbackProbThresh
: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
feedbackProbSumThresh
: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.
feedbackTermCount
, feedbackProbThresh
, and feedbackProbSumThresh
work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
All the three feedback methods also recognize the parameter feedbackMixtureNoise
(default value :0.5), but with <font color=red> different interpretations</font>.
feedbackMixtureNoise
is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated". feedbackMixtureNoise
means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.) feedbackMixtureNoise
is the probability of not stopping, i.e., 1- alpha
, where alpha is the stopping probability while walking through the chain.
In addition, the collection mixture model also recognizes the parameter emIterations
, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp
for details. )