Lemur 2.1 release notes
- We have tested using gcc 3.2.3, and VC++ 6.0.
- New Applications:
- BuildDocMgr - Builds a DocumentManager (FlattextDocMgr) for a set of files. Can optionally simultaneously build an index while building the DocumentManager with same parse of documents.
- BuildPropIndex - Builds an index with capability to add properties to each term. Used with BrillPOSParser to parse output from Brill's part of speech tagger, or with IdentifinderParser to parser output from Identifinder's named entity tags. This builds an InvFPIndex, keeping properties at the same position as the term.
- Additions, Enhancements, and other changes:
- Support for associating properties with terms in a position index (InvFPIndex):
- New TextHandler API to support having the original token produced by the parser, unmodified by stopword list removal or stemming. The new API also supports passing on more document information through the use of Property objects, such as structure found in XML, a part of speech or named entity tag. Maintains backwards compatibility. No change is required to any existing TextHandler sub-classes. See Lemur Parsing documentation for more information.
- New Property supporting class for any type property.
- New PropertyList API to keep list of Property objects. LinkedPropertyList example class is included, which uses the STL list object.
- Addition of PropIndexTH, a destination TextHandler to push terms and properties into InvFPPushIndex. Use with other new TextHandler classes IndentifinderParser, BrillPOSParser, and BrillPOSTokenizer.
- New InvFPTermPropList class to make iterating through a sequential InvFPTermList with property information more convenient. For example, InvFPTermPropList::nextTerm method to skip term properties and get next entry with actual term information.
- Addition of #PROP operator to structured query language. This supports getting terms associated with a property for indexes that contain that information. For example, one can query for #prop(place carnegie) or #prop(person carnegie). This operator can be nested with other existing operators. See the Structured Query Language documentation for more information.
- New TextHandlerManager class to facilitate the creation of Parser, Stemmer, and Stopper objects. Any new TextHandler class can be added just to the TextHandlerManager to be utilized by all existing applications that use the TextHandlerManager. It accepts what type to create as a parameter, but will check the parameter stack if nothing is specified.
The parameter names for the objects are consistent with current Lemur applications.
- New DocMgrManager class to facilitate the creation and opening of DocumentManager objects. Use and benefits are similar to TextHandlerManager and IndexManager.
- Modified RetMethodManager to use the RetModel parameter
whenever it is specified, rather than using it as the fallback default for what's on the parameter stack. This change makes its use consistent with other Manager classes.
- Changed Index::docManager method to return pointer to actual DocumentManager object instead of its ID. *This is the only change that is not backwards compatible.
- Changed TextQueryRep to ignore the weighting of documents (based
on doc scores) in pseudo feedback, which can be a problem for some language
modeling approaches when the scores are negative. The new feedback performance with language models is very slightly better.
- Bugs Fixed:
- Problem: StructQueryEval crashes if not using InvFPIndex
Solution: modified to throw exception instead of just crashing
- Problem: Error with InvFPTermList when getting InvFPIndex::termInfoList for empty document
Solution: add a test for listlen of 0 in InvFPTermList::countTerms
- Problem: cannot simultaneously use multiple instances of BasicDocStream
Solution: remove use of static variable
- Problem: retrieval using structured queries crashes when all terms are OOV
Solution: fix QueryNode to not dereference a NULL object. Add test for empty child list in QueryNode::unionDocList.
- Problem: InvFPDocList::termCTF does not return correct value
Solution: Make InvDocList::termCTF virtual and implement overriding InvFPDocList::termCTF
- Problem: qfilef in util.c tries to close empty pointer
Solution: add check for NULL
- Problem: InvFPIndex::docLengthCounted returns length in bytes instead of count
Solution: fix implementation to return the correct value of the counted document length (length without omitted tokens, i.e. stopwords)
- Problem: Memory leaks in various classes
Solution: implemented destructor for DocScoreVector class
Solution: fixed pop_ddinf in parameters.c
Solution: fixed loading (not building) part of FlattextDocMgr
Solution: free centralized sample index in DistRetEval
Solution: fix QBEGIN case for InQueryOpParser and InqArabicParser
Last modified: Tue Nov 25 17:44:45 EST 2003