Query-Based Sampling in Lemur


Contents

  1. Overview
  2. QryBasedSample Application
  3. Sampling API
  4. Adding New Database Systems

1. Overview

The query-based sampling application and utility classes provide an extensible tool for creating descriptions of text databases. The QryBasedSample application allows for sampling from text databases . The QryBasedSampler utility class gives an API for building other applications that require a query-based sampling component. This document first describes usage of the application and then describes the sampling API. Finally, the document describes the process of extending the query-based sampling tools to sample non-Lemur databases.

2. QryBasedSample Application

The application QryBasedSample performs query-based sampling on text databases. The output of the application is documents and database profiles . QryBasedSample takes a single command line argument, which is a parameter file. As with other Lemur applications, lines in the parameter file have the form:
 parameter = value; /* comment */ 
Summary of parameters:

3. Sampling API

The class central to the sampling API is QryBasedSampler. It performs query-based sampling on a database, and outputs the profile and documents to disk. Other important classes include FreqCounter which builds the database profile and DBManager which gives an API for simple text database access. This section gives a brief description of these classes. For more detailed descriptions of functions, refer to the source code documentation.

QryBasedSampler

This class uses a DBManager and a FreqCounter to sample documents from a database and build a profile of the database's vocabulary. The probe function is does this, and its single argument is an intial query. If the initial query does not retrieve any documents, probe returns false. Before probe can be called, the application must create and set the sampler's database manager and frequency counter.

After the initial query, the sampler selects random query terms from the frequency counter. The means for selecting words is determined by the frequency counter's random mode. See the FreqCounter class for more information.

FreqCounter

This class builds a profile or simple language model from a TextHandler stream. In order to have the model updated properly when sampling, the application must build a TextHandler chain with the database manager's parser as the source and the frequency counter as the destination. See Parsing in the Lemur Toolkit for more details. The use of the TextHandler class here allows easy inclusion of a stemmer or indexing components. That is, a sampling application could easily build a collection selection database or normal retrieval database while sampling from a database.

A frequency counter can use an internal stopword list (Stopper class) specified in the constructor to filter out stopwords. A frequency counter can load its frequencies from a file using input and write frequencies to a file using output.

Frequency counters can also return random words. The randomWord function returns a word guarenteed unique since the last call to clear. The method used for selecting the random word is one of the following: R_CTF, R_DF, R_AVETF, or R_UNIF. R_CTF selects words with probability proportional to the terms' collection term frequency. R_DF chooses a word with probability according to the term's document frequency. R_AVETF selects words with probability proportional to the terms' average term frequency (ctf/df). R_UNIF selects words with equal probability. The mode for selection is set using the setRandomMode function.

DBManager

The DBManager provides a simplified API for querying a database and retrieving documents. The goal in providing this class is to supply only the functionality needed for query-based sampling in a simple, contained class.

The query function takes a string (char *) and a number of documents to retrieve. The function must return a results_t structure, which has two fields: num, the number of results in the list, and docs, an array of docid_t (char *) containing the results. The caller is responsible for freeing structure and its contents.

The getDoc function takes a docid_t and returns a doc_t structure. The doc_t structure consists of a docid_t called docid, a char * called doc, and a integer len which indicates the number of charaters in the document. The caller should make no assumptions about the format of the data in the doc field. The caller is responsible for freeing the structure and the contents of the structure.

The getParser function returns a MemParser that is capable of parsing the contents of the doc field of a doc_t.

The output function writes a document to file, which is specified using setOutputFile.

There are currently two implementations of the DBManager interface: LemurDBManager and MindDbManager. The LemurDbManager provides an example of communicating with a local database using an API, while the MindDbManager uses XML to communicate with a remote database.

MemParser

The MemParser class extends the Parser class. It adds a parse function that takes a doc_t. It is not required that you override the existing functions of Parser. Most important is that it is a TextHandler. See Parsing in the Lemur Toolkit for more details on TextHandlers.

4. Adding New Database Systems

Adding a new database system requires that you: