Distributed Information Retrieval in Lemur


Contents

  1. Overview
  2. Applications
  3. Distributed Search and Merge API

1. Overview

The distributed retrieval in Lemur is built around the RetrievalMethod API. The DistSearchMethod class searches multiple indexes using the same query and stores in the results. These results are then passed to a DistMergeMethod for scores merging, based on the index ranking score and each individual document score. DistMergeMethod is an abstract API to support the implementation of different merging techniques.

2. Applications

CollSelIndex - builds a collection selection index

DistRetEval - distributed retrieval (rank, search, and merge) using a collection selection index and individual indexes

3. Summarization API

RetrievalMethod

Collection selection, or database ranking, uses the TextRetrievalMethod API, as implemented by CORIRetMethod. Basically it treats a collection selection index similar to a regular index where each "document" is actually a database.

DistSearchMethod

The main method in this class is scoreIndexSet(Query &qry, IndexedRealVector &indexset, DocScoreVector** results). The indexes in indexset should correspond to the indexes in the collection selection database passed into the constructor or set by setIndex. This method will load each individual databases's parameter file and score its documents against the given query. It will use whichever RetrievalMethod is specified in the parameter file, or it will use a set default. Set the default by using setDefaultRetMethod(RetMethodManager::RetModel rt). Although the method does not actually use the ranking scores from indexset here, it accepts this data structure that's the same as the one returned by the ranking method for convenience. There is another scoreIndexSet method that accepts a vector of database id strings.

A DocScoreVector is allocated for each index in indexset and stored in results. The caller should free this memory. Unlike a RetrievalMethod which returns the scores according to the index's internal document ids, DistSearchMethod converts the internal document ids to external document character ids. This is so there are no id conflicts when the scores are later merged into one list.

DistMergeMethod

This is an abstract interface for the merging of scores from individual databases. These databases should have ranking scores. Applications using this should call mergeScoreSet(IndexedRealVector &indexset, DocScoreVector** scoreset, DocScoreVector &results), where indexset is the same one used for DistSearchMethod and scoreset is the results from DistSearchMethod. Implementing classes should override the score(double dbscore, double docscore) method. For each document in each index, mergeScoreSet creates a merged score using the score method and stores it in the results vector. The returned results are not sorted, but can be by using the Sort method in DocScoreVector. CORIMergeMethod is one implementation of this interface. SingleRegrMergeMethod is another.