Changes to the TextHandler class


Contents

  1. TextHandler
  2. TextHandler::TokenType
  3. PropertyList
  4. Property

1. TextHandler

The TextHandler class was originally designed to facilitate passing document tokenization information along a chain of components from a tokenizer/parser (classes inheriting from the Parser class) through components like stopword lists and stemmers to an indexer, a query builder, or tool that writes the parsed documents to file. While it has served these goals well for many of the components in use in Lemur, it was not an adequate API for all tasks. In particular, we sometimes wished to have the original token produced by the parser, unmodified by stopword list removal or stemming. Also, we wished to pass on more document information, such as structure found in XML, a part of speech or named entity tag.

To facilitate these goals, we have modified the API of the TextHandler class. Information along the TextHandler chain is now passed using the foundToken function. The foundToken function takes the place of the foundWord and foundDoc functions. It takes two extra arguments: one for indicating the token type and another for passing on the unmodified token from the source of the TextHandler chain. Here's its prototype:

     void foundToken(TextHandler::TokenType type, char * token,
char * orig, PropertyList * properties);
In addition, we would like to add another method to the TextHandler API to enable getting the next TextHandler in the chain, if there is one
  virtual TextHandler * getTextHandler();

Because of the TextHandler base class, all of these changes are backwards compatible and do not require changing any of the current TextHandler implementations.

2. TextHandler::TokenType

TokenType is an enumeration including words, tags, and document boundary markers. You may add to this list of types for your own tools. For example, you may wish to use a parser identifies sentence boundaries. An appropriate way to pass this information along the TextHandler chain would be to add types for beginning of sentence and end of sentence boundaries. Here's a list of the current types:

3. PropertyList

A PropertyList is a container for properties of tokens. Example properties may be the byte offset of the token in the file, attributes associated with a tag, document properties, and so on. Items in the property list are (name, value) pairs.

There are some unconvential behaviors of Property and PropertyList objects. While unconvential, they are very simple. They are made to facilitate object reuse and fast processing in the expected use of these objects (processing and tokenizing documents). This behavior is similar to that InvPushIndexer not assuming that the document properties object is stable from call to call.

A PropertyList object is owned by its creator. That is, you should not assume that the properties in it will be the same in subsequent calls to TextHandler::foundToken. The creator is also responsible for freeing the memory associated with the list.

The PropertyList interface has the following function prototypes:

These calls support iteration over all properties in the list: These calls support addition/removal of properties to/from the list:

4. Property

Note that the name and the value returned in calls to a Property object are owned by the Property object. That is, you should not the free the pointers returned from getName or getValue.


The Lemur Project
Last modified: Mon Mar 17 15:32:42 EST 2003