Query-Based Sampling in Lemur

Contents

Overview
QryBasedSample Application
Sampling API
Adding New Database Systems

1. Overview
The query-based sampling application and utility classes provide an extensible tool for creating descriptions of text databases. The QryBasedSample application allows for sampling from text databases . The QryBasedSampler utility class gives an API for building other applications that require a query-based sampling component. This document first describes usage of the application and then describes the sampling API. Finally, the document describes the process of extending the query-based sampling tools to sample non-Lemur databases.
2. QryBasedSample Application
The application QryBasedSample performs query-based sampling on text databases. The output of the application is documents and database profiles . QryBasedSample takes a single command line argument, which is a parameter file. As with other Lemur applications, lines in the parameter file have the form:
 parameter = value; /* comment */ 
Summary of parameters:
dbManager Use to indicate which database manager to use. Specify lemur to sample Lemur databases and mind to sample MIND databases.

numDocs Terminate probe when the specified number of unique docs from the database have been seen.

numWords Terminate probe when the specified number of unique words from the database have been seen.

numQueries Terminate probe when the specified number of unique queries have been run.

docsPerQuery Use the specified number of documents per query to build the database description.

queryMode Selects the mode for query selection:

unif Words are chosen with equal probability from the documents seen so far.
avetf Words are chosen with probability proportional to their average term frequency.
ctf Words are chosen with probability proportional to their collection term frequency.
df Words chosen with probability proportional to their document frequency.
listFile Use to specify the file containing list of databases to probe and their output prefixes. The file format is:
 db      prefix      dbname
where the items are seperated by tabs and there is one tuple per line. For Lemur databases, the db field contains a parameter file specifying the retrieval parameters as in RetEval. The output prefix is used by the query-based sampler to create the filenames for outputting documents and database profiles. Documents are written to "prefixdocs" and profiles are written to "prefixmodel". For a MIND database, the db field contains a list of semicolon seperated items. The items are the xml urn name for the proxy, the url for the proxy, the xml urn name for the proxy's interface, the xml urn name for the proxy's construction text component, and the number of documents in the database. Example:
urn:proxy.Google;http://mind.proxy.url;urn:proxy-interface.Google;urn:proxy-construction-text.Google;2073418204
When sampling MIND databases, the documents are not stored locally, but a model built from the document sample field is stored.
initModel A language model to use for initial query selection. Words are selected using the specifed query mode. The initial model has the same format as models generated by the sampler:
 word      ctf      df
where ctf is the collection term frequency of the word and df is the document frequency.
MindRegistry If sampling MIND databases, this parameter is required. It should contain the url of the MIND Registry.
3. Sampling API
The class central to the sampling API is QryBasedSampler. It performs query-based sampling on a database, and outputs the profile and documents to disk. Other important classes include FreqCounter which builds the database profile and DBManager which gives an API for simple text database access. This section gives a brief description of these classes. For more detailed descriptions of functions, refer to the source code documentation.
QryBasedSampler
This class uses a DBManager and a FreqCounter to sample documents from a database and build a profile of the database's vocabulary. The probe function is does this, and its single argument is an intial query. If the initial query does not retrieve any documents, probe returns false. Before probe can be called, the application must create and set the sampler's database manager and frequency counter.
After the initial query, the sampler selects random query terms from the frequency counter. The means for selecting words is determined by the frequency counter's random mode. See the FreqCounter class for more information.
FreqCounter
This class builds a profile or simple language model from a TextHandler stream. In order to have the model updated properly when sampling, the application must build a TextHandler chain with the database manager's parser as the source and the frequency counter as the destination. See Parsing in the Lemur Toolkit for more details. The use of the TextHandler class here allows easy inclusion of a stemmer or indexing components. That is, a sampling application could easily build a collection selection database or normal retrieval database while sampling from a database.
A frequency counter can use an internal stopword list (Stopper class) specified in the constructor to filter out stopwords. A frequency counter can load its frequencies from a file using input and write frequencies to a file using output.
Frequency counters can also return random words. The randomWord function returns a word guarenteed unique since the last call to clear. The method used for selecting the random word is one of the following: R_CTF, R_DF, R_AVETF, or R_UNIF. R_CTF selects words with probability proportional to the terms' collection term frequency. R_DF chooses a word with probability according to the term's document frequency. R_AVETF selects words with probability proportional to the terms' average term frequency (ctf/df). R_UNIF selects words with equal probability. The mode for selection is set using the setRandomMode function.
DBManager
The DBManager provides a simplified API for querying a database and retrieving documents. The goal in providing this class is to supply only the functionality needed for query-based sampling in a simple, contained class.
The query function takes a string (char *) and a number of documents to retrieve. The function must return a results_t structure, which has two fields: num, the number of results in the list, and docs, an array of docid_t (char *) containing the results. The caller is responsible for freeing structure and its contents.
The getDoc function takes a docid_t and returns a doc_t structure. The doc_t structure consists of a docid_t called docid, a char * called doc, and a integer len which indicates the number of charaters in the document. The caller should make no assumptions about the format of the data in the doc field. The caller is responsible for freeing the structure and the contents of the structure.
The getParser function returns a MemParser that is capable of parsing the contents of the doc field of a doc_t.
The output function writes a document to file, which is specified using setOutputFile.
There are currently two implementations of the DBManager interface: LemurDBManager and MindDbManager. The LemurDbManager provides an example of communicating with a local database using an API, while the MindDbManager uses XML to communicate with a remote database.
MemParser
The MemParser class extends the Parser class. It adds a parse function that takes a doc_t. It is not required that you override the existing functions of Parser. Most important is that it is a TextHandler. See Parsing in the Lemur Toolkit for more details on TextHandlers.
4. Adding New Database Systems
Adding a new database system requires that you:

Wrap the database in a class the inherits from DBManager.
Provide a parser which inherits from MemParser.
Integrate the database wrapper into the QryBasedSample application:

~~Add a parameter in the LocalParameter namespace that allows the specification of which DBManager should be used.~~
~~Modify AppMain so that it creates the DBManager specified by the new parameter.~~ Use the new dbManager parameter to check which database manager the application should use.
Modify AppMain so that the QryBasedSampler object is passed the DBManager you created.
Modify the program so that when the program is terminating, it will free any memory you may have allocated.
Update the usage function in QryBasedSample to reflect the changes you've made.