Parsing in Lemur

Contents

Overview

The Parser Architecture

The Parser Applications

1. Overview
This document discusses the parsing utilities provided by the Lemur toolkit. They have been designed with flexibility and extendibility in mind. If the functionality required is not currently implemented by the toolkit, it should be easy to add the functionality and plug it into the parser framework. The first section describes the parser applications and their options. The other section describes the parser architecture or API for developers.

2. The Parser Architecture for Lemur

The Lemur parser architecture revolves around one class, TextHandler, that allows for the chaining or pipelining of common parser components. A TextHandler may be a stop-word list, stemmer, indexer, or parser. Information is passed from a source, through TextHandlers that modify information and pass it on, to a destination TextHandler. An example of a source TextHandler would be a parser. A stemmer would modify text and pass the information on to other TextHandlers. A destination TextHandler might write parsed data to a file or push build an index. The diagram below is an example of how TextHandlers might be chained.

The TextHandler class enforces chaining through its interface. A diagram summarizing the functions of the TextHandler class is given below.The next TextHandler in a chain is set using the setTextHandler function. For example calling the Parser's setTextHandler function with an argument of the Stop-word list would cause information to be passed from the Parser to the Stop-word list. TextHandlers may modify the information it receives before passing the information on to the next TextHandler.To do this, provide implementations to either the handleDoc, handleSymbol or handleWord functions. For example, a stemmer would stem the word in the handleWord function. An Indexer would need to implement both handleDoc and handleWord functions. Inside those functions, the Indexer would push the words and documents into an index.

The foundDoc, foundSymbol, and foundWord functions enforce the chaining of the calls.When either is called, the corresponding handleDoc/handleWord/handleSymbol function is called with an argument of the document number or word. The foundDoc, foundSymbol, or foundWord function of the textHandler of the object is then called using the return value of the handleDoc/handleWord/handleSymbol as the argument. Base implementations of all functions are provided by the TextHandler class, a subclass will only need to override the functions that it needs. In general, subclasses should only override handleDoc and handleWord functions. Classes that provide sources for information should call the foundWord and foundDoc functions of their textHandler.

The TextHandler class provides the basis for most of the classes used by Lemur for parsing. The hope is that this class will provide a flexible base for extending parser functionality. The following subsections discuss classes used by the parser applications. The only of the following classes that does not extend the TextHandler class is the WordSet class.
WordSet

The WordSet class is a simple wrapper to a set. It is useful for stop-word lists or acronym lists. It can load a list from a file. The file format is one word per line. WordSet does NOT remove white space on either side of the word be careful when editing these files. The contains function is used to check the presence of a word in the set.

Parser

The Parser class is a generic interface for the parsers in the toolkit. It assumes subclasses implement a parse function, which takes a filename. The acronym list is WordSet, and some of the toolkit parsers check uppercase words and recognized acronyms against this list. If the word is in the acronym list, it is left uppercase. Otherwise, the word is converted to lowercase. If you do not wish to support the acronym list when you design your parser, that is fine. You can simply ignore the acronym list.

Both the TrecParser and the WebParser remove contractions and possessives, have a simple acronym recognizer, and convert words to lowercase.
The parsers assume that there is some SGML style markup seperating documents and specifying document number. The format for web documents is

<DOC>
<DOCNO> document_number </DOCNO>
document text
</DOC>

and the format for trec formatted documents is

<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
document text
</TEXT>
</DOC>
These document formats allow the inclusion of multiple documents in the same text file.
TrecParser

The TrecParser provides a simple but effective parser for NIST's TREC document format. It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
WebParser

The WebParser behaves very similarly to the TrecParser. It parses HTML documents in the NIST TREC format used for the Web Tracks.The parser removes HTML tags. Text within SCRIPT tags is removed, as is text in HTML comments.

ReutersParser

The ReutersParser extracts the TEXT, HEADLINE, and TITLE fields and removes other tags.

InQueryOpParser

The ArabicParser provides parsing for the InQuery structured query language structured queries.
ArabicParser

The ArabicParser provides a simple but effective parser for NIST's TREC document format for Arabic documents encoded in Windows CodePage 1256 encoding (CP1256). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
InqArabicParser

The InqArabicParser provides parsing for the InQuery structured query language structured queries in Arabic encoded in Windows CodePage 1256 encoding (CP1256).
ChineseParser

The ChineseParser provides a simple but effective parser for NIST's TREC document format for Chinese documents encoded in GB encoding (GB2312). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. This parser is suitable for parsing segmented (tokenized) documents.
ChineseCharParser

The ChineseCharParser provides a simple but effective parser for NIST's TREC document format for Chinese documents encoded in GB encoding (GB2312). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. This parser is suitable for parsing unsegmented documents, producing one token per Chinese character.
Stemmer

The Stemmer class provides an interface for stemmers. All that is required of a subclass is that it implement the stemWord function. The stemWord function may overwrite the current word, but should return the stem as its return value. Currently, the toolkit provides three subclasses; PorterStemmer, KStemmer, and ArabicStemmer.

PorterStemmer

PorterStemmer uses Porter's official stemmer (in c) to stem words. The PorterStemmer class does not stem words beginning with an uppercase letter. This is to prevent stemming of acronyms or names.

KStemmer

KStemmer uses Krovetz' stemmer (in c) to stem words. This is a less aggressive stemmer than the Porter stemmer.

ArabicStemmer

ArabicStemmer uses one of Larkey's Arabic stemmers (in c) to stem Arabic words. It provides five different stemming functions:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light stemming
arabic_light10_stop : light stemming with stopping

Stopper

The Stopper class is a subclass of the WordSet class and the TextHandler class. It replaces words in the stop-word list with a NULL pointer.

QueryTextHandler

The QueryTextHandler checks to see if a word in the query occurs more often in uppercase than original form in an Index. If the uppercase form is more common than the original form, the word is added to the query. This is to handle cases where acronyms are not capitalized in the query,

WriterTextHandler

The WriterTextHandler class writes information from a TextHandler chain to a file. This file is in a format compatible with BuildBasicIndex.

WriterInQueryHandler

The WriterInQueryHandler class writes information from a TextHandler chain processing the InQuery structured query language to a file. This file is in a format compatible with BuildBasicIndex.

InvFPTextHander

The InvFPTextHandler takes information from a TextHandler chain and uses InvFPPushIndex to build an InvFPIndex. Stop-words are not counted in the document length.

3. The Parser Applications
There are four parser applications provided in the toolkit. PushIndexer (and the incremental and passage versions of PushIndexer; IncIndexer, PassageIndexer, and IncPassageIndexer) builds a database, ParseToFile writes parsed text to a file, ParseQuery parses queries and writes output to file, and ParseInQueryOp parses InQuery structured query language queries and writes output to file. All applications use a parameter file for specifying parameters for parsing. The format of the file is:
 parameter = value; /* comment */
The first command line argument must be the parameter file. The other command line arguments specify the data files for applications to parse.
3.1 PushIndexer
PushIndexer builds a database using one of the toolkit Parser classes and InvFPPushIndex.

Usage: PushIndexer paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a file specified as the dataFiles parameter
Summary of parameters in paramfile:

index Name of the index (without the .ifp extension).

memory Memory (in bytes) of InvFPPushIndex (def = 96000000).

stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.

acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.

docFormat:

"trec" for standard TREC formatted documents
"web" for web TREC formatted documents
"chinese" for segmented Chinese text (TREC format, GB encoding)
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

dataFiles Name of file containing list of datafiles (one line per datafile name). If not specified, must enter datafiles on command line.

PassageIndexer and IncPassageIndexer each take an additional paramete:

passageSize which specifies the number of terms per passage (default 50).
3.2 ParseToFile
ParseToFile parses documents and writes output compatible with BuildBasicIndex. The program uses one of the toolkit's Parser classes to parse.
Usage: ParseToFile paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

outputFile Name of file to output parsed documents to.

stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are output to the file.

acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.

docFormat:

"trec" for standard TREC formatted documents
"web" for web TREC formatted documents
"chinese" for segmented Chinese text (TREC format, GB encoding)
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

3.3 ParseQuery
ParseQuery parses queries using one of the toolkit's Parser classes and an Index.
Usage: ParseQuery paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

queryOutFile The name of the file to write the parsed queries to.

index Name of the index (with the .ifp or .bsc extension).

stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are left in the query.�

acronyms Name� of file containing acronym list (one word per line). Uppercase words recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as USA if USA is in the acronym list.� If no acronym list is specified, acronyms will not be recognized.

docFormat:

"trec" for standard TREC formatted documents
"web" for web TREC formatted documents
"chinese" for segmented Chinese text (TREC format, GB encoding)
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

3.4 ParseInQueryOp
ParseInQueryOp parses queries using the InQueryOpParser class.
Usage: ParseQuery paramfile datfile1 datfile2 ...
The parameters are:

stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
docFormat:

"trec" for standard TREC formatted documents
"web" for web TREC formatted documents
"chinese" for segmented Chinese text (TREC format, GB encoding)
"chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
"arabic" for Arabic text (TREC format, Windows CP1256 encoding)

stemmer:

"porter" Porter stemmer.
"krovetz" Krovetz stemmer, requires additional parameters

KstemmerDir: Path to directory of data files used by Krovetz's stemmer.

"arabic" arabic stemmer, requires additional parameters

arabicStemDir: Path to directory of data files used by the Arabic stemmers.
arabicStemFunc: Which stemming algorithm to apply, one of:

arabic_stop : arabic_stop
arabic_norm2 : table normalization
arabic_norm2_stop : table normalization with stopping
arabic_light10 : light9 plus ll prefix
arabic_light10_stop : light10 and remove stop words

outputFile: name of the output file.

The Lemur Project
Last modified: Thu Dec 13 16:32:42 EST 2001