Contents
1. Overview
The Lemur structured query language is a reimplementation of the InQuery (developed at the CIIR) structured query language. Among other things, this query language enables the use of proximity operators (ordered and unordered windows) in queries. Feedback is implemented as a WSUM of the original query combined with terms selected using the Rocchio implementation of the Lemur TFIDF retrieval method. The expanded query has the form:
#wsum( (1-a) <original query> a*w1 t1 a*w2 t2 ... a*wN tN )
where a is the value of the parameter feedbackPosCoeff.2. Applications
ParseInQueryOp
This application ( ParseInqueryOp.cpp ) parses a file containing structured queries into BasicDocStream format. The parameters are:
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- outputFile: name of the output file.
StructQueryEval
This application (StructQueryEval.cpp) runs retrieval experiments to evaluate the performance of the structured query model using the inquery retrieval method. Feedback is implemented as a WSUM of the original query combined with terms selected using the Rocchio implementation of the TFIDF retrieval method. The expanded query has the form:
#wsum( (1-a) <original query> a*w1 t1 a*w2 t2 ... a*wN tN )
where a is the value of the parameter feedbackPosCoeff.Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. By default, scoring is on the whole collection.
The parameters are:
- index: The complete name of the index table-of-content file for the database index.
- QuerySet: the query text stream parsed by ParseInQuery
- resultFile: the result file
- resultCount: the number of documents to return as result for each query
- DefaultBelief: The default belief for a document: Default=0.4
- feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
- feedbackTermCount: the number of terms to add to a query when doing feedback.
- feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback, as implemented for TFIDF.
3. Structured Query Language
The structured query operators are:
Sum Operator: #sum (T1 ... Tn ) The terms or nodes contained in the sum operator are treated as having equal influence on the final result. The belief values provided by the arguments of the sum are averaged to produce the belief value of the #sum node. Weighted Sum Operator: #wsum (W1 T1 ... Wn Tn) The terms or nodes contained in the wsum operator contribute unequally to the final result according to the weight associated with each (Wx). Note that this is a change from the InQuery operator, as there is no initial weight, Ws, for scaling the belief value of the sum. Ordered Distance Operator: #N (T1 ... Tn) or #odN (T1 ... Tn) The terms within an ODN operator must be found within N words of each other in the text in order to contribute to the document's belief value. The "#N" version is an abbreviation of #ODN, thus #3(health care) is equivalent to #od3(health care). Un-ordered Window Operator: #uwN(T1 ... Tn) The terms contained in a UWN operator must be found in any order within a window of N words in order for this operator to contribute to the belief value of the document. Phrase Operator: #phrase(T1 ... Tn) The operator is treated as an ordered distance operator of 3 (#od3). Note that this is a simplification of the more complicated heuristic used by InQuery. Passage Operator: #passageN(T1 ... Tn) The passage operator looks for the terms or nodes within the operator to be found in a passage window of N words. The document is rated based upon the score of it's best passage. Synonym Operator: #syn(T1 ... Tn) The terms of the operator are treated as instances of the same term. And Operator: #and(T1 ... Tn) The more terms contained in the AND operator which are found in a document, the higher the belief value of that document. Boolean And Operator: #band(T1 ... Tn) All of the terms within a BAND operator must be found in a document in order for this operator to contribute to the belief value of that document. Boolean And Not Operator: #bandnot (arg1 arg2) Search for document matching the first argument but not the second. Or Operator: #or(T1 ... Tn) One of terms within the OR operator must be found in a document for that document to get credit for this operator. Maximum Operator: #max(T1 ... Tn) The maximum belief value of all the terms or nodes contained in the MAX operator is taken to be the belief value of this operator. Filter Require Operator: #filreq(arg1 arg2) Use the documents returned (belief list) of the first argument if and only if the second argument would return documents. The value of the second argument does not effect the belief values of the first argument; only whether they will be returned or not. Filter Reject Operator: #filrej(arg1 arg2) Use the documents returned by the first argument if and only if there were no documents returned by the second argument. The value of the second argument does not effect the belief values of the first argument; only whether they will be returned or not. Negation Operator: #not(arg1) The term or node contained in this operator is negated so that documents which do not contain it are rewarded. Property Operator: #prop(arg1 arg2) Return documents where arg1 is a property of arg2.The input query file is of the form:
#qN = queryNode ;where N is the query id and queryNode is one of the aforementioned query operators. The query may span multiple lines and must be terminated with the semicolon. The body of the query must not contain a semicolon, as that will prematurely terminate the query.An example query:
#q18=#wsum(1 #sum(Languages and compilers for #1(parallel processors)) 2 #sum(highly horizontal microcoded machines) 1 code 1 compaction );
The Lemur Project Last modified: Fri Feb 13 18:27:56 EST 2004