Lemur 2.2 release notes

We have tested using gcc 3.2.2, 3.2.3, VC++ 6.0, and VC++ .NET .
New Applications:
- BuildKeyfileIncIndex - Builds a KeyfileIncIndex. This is a fast-loading index that uses b-trees as the underlying data structures for its dictionaries. KeyfileIncIndex stores position information and can add new documents to an existing index.
- Lemur CGI interface - available in a separate download package. This package allows you to put a web front on top of existing positional Lemur indexes. It uses the InQueryRetMethod. To get original documents back, the index should have a DocumentManager associated with the documents. To display results using the document title or headline (instead of document id), use the new ElemDocMgr, which can retrieve document elements (if so processed during build time). Build an Index and a DocumentManager using BuildDocMgr.
- Lemur Retrieval GUI - available in a separate download package. This is a stand-alone GUI built with Java Swing. It makes use of a very simple and limited JNI on top the Lemur library. It is compatible with any Lemur index and currently has options to use InQueryRetMethod, OkapiRetMethod, or SimpleKLRetMethod.
Additions, Enhancements, and other changes:
- New KeyfileDocMgr DocumentManager. It provides the raw document text in the same fashion as FlattextDocMgr and additionally provides the start and end byte offsets for each token within each document, relative to the start of that document, without having to reparse the raw document. It uses b-trees to store its data, allowing any instance of the document manager to be opened in constant time. It is faster than the FlattextDocMgr for loading large collections. Integrated into BuildDocMgr application.
- Support for getting (raw) elements back from a document, ie the document title:
  1. New ElemDocMgr DocumentManager. It provides everything KeyfileDocMgr does. In addition, it has support to retrieve any document element by handling begin and eng tags. Parser must send event. Has been integrated into BuildDocMgr application.
  2. Modified TrecParser and WebParser to send begin and end element property tags to be used by ElemDocMgr. WebParser sends <TITLE> elements with name "TITLE". TrecParser sends "TITLE" (<TTL>) and "HEADLINE" (<HEADLINE>, <HEAD>, <HL>).
- New MatchInfo class that identifies the information in the document that matched the query. It provides a list of match offsets (token based). Also provides byte offsets in source text when a DocumentManager is available to provide the raw document text. Offsets can be retrieved directly instead of parsed if the DocumentManager is a KeyfileDocMgr or ElemDocMgr.
- New QueryDocument class to facilitate the creation of interactive queries (not read from a batch file). QueryDocument inherits from both TextHandler and Document so it can be used at the end of a query processing TextHandler chain, then passed to a TextQuery or StructQuery. It can also be used independently from other TextHandler objects to create a query that's compatible with current retrieval methods.
- Addition of new MultiRegrMergeMethod, a multi-regression merge method for distributed IR. This merge method can merge results from individual databases that used different retrieval methods from each other. Integrated into DistRetEval application.
- Standardized Lemur applications parameters through use of object manager classes. Allow retrieval models to be specified as a string (tfidf, okapi, kl, inquery, cori_cs, cos) rather than as a number.
- New modular makefiles on unix. These new makefiles make compiling the distributed information retrieval (distrib) and summarization (summarization) modules of Lemur optional, based on the configure script.
- Modified SimpleKLRetMethod and SimpleKLQueryModel for better performance when using pseudo feedback relevance models. Added new methods colQueryLikelihood and setScoreMethod, and changed background model from Laplace to maximum likelihood estimator.
- Modified InQueryRetMethod to accept any Index. However this index still needs to have position information. It checks for underlying inverted list data structure to have positions (InvFPDocList).
Bugs Fixed:
1. Problem: DocumentManagers do not retrieve documents correctly when files are in windows text format with \r\n
  Solution: modified Parsers and DocumentManagers to handle text files in binary mode
2. Problem: FlattextDocMgr::getMyID can cause memory corruption
  Solution: modified method so it does not return value from temporary variable that only exists within scope
3. Problem: PropIndexTH fails with NULL original term
  Solution: add check for NULL original term in handleWord method
4. Problem: PropIndexTH fails to add tag for end of named entity if last term in entity is a stopword
  Solution: modified method to add end tag
5. Problem: conflict occurs in Parsers using "WORD" as a pre-defined symbol
  Solution: change enumeration in TextHandler to use WORDSTR and SYMBOLSTR as TokenTypes
6. Problem: QryBasedSample stops 1 short of query list
  Solution: fixed loop check in QryBasedSampler
7. Problem: Parsers do not always report correct position when using parseBuffer
  Solution: fixed parseBuffer method to reset yyloc
8. Problem: non-string properties are not copied correctly in Property class
  Solution: changed use of strncpy to memcpy
9. Problem: InQueryRetMethod crashes when passed regular query beginning with OOV term
  Solution: modifed InQueryRetMethod to wrap regular queries in #SUM operator
10. Problem: ParamGet.. returns junk or crashes after repeat push and pop of same parameter file
  Solution: modified param_pop_file to clear stack when necessary but not remove parameters table that is being cached for use by another file
11. Problem: Summarizer and subclasses use array of abstract Passage class
  Solution: fixed methods to use pointer to class
12. Problem: linking errors occur with QryBasedSample when using Visual Studio .NET
  Solution: changed inclusion of afx.h to time.h
13. Problem: SimpleKLRetMethod does not perform as expected for pseudo-feedback relevance models
  Solution: fixed SimpleKLRetMethod::interpolateWith to start summing with the first term of the query model
  Solution: add computation of collection query likelihood for the new model in interpolateWith
  Solution: add computation of collection query likelihood for the new model in initial load
  Solution: change clarity computation to use log base 2

Last modified: Mon Feb 9 17:25:23 EST 2004