Parsing documents
Advanced: Understanding the TextHandler

Most of the parsing elements in Lemur inherit from the TextHandler class. In doing so, they inherit an API that enables them to be "chained" together. The idea is that each TextHandler has a method that handles what it wants to do with a given token. Then it passes that original token, along with the token it has modified, to the next TextHandler in the chain. It's up to you to chain them together in the order that you want.

The base TextHandler class has implementations for all the methods. A subclass would only need to override implementations for specific handling methods. This is commonly just one method, the one that handles what to do when a certain token is received.

Tokenizing
A TextHandler at the beginning of a chain normally wouldn't even need to override any methods, unless it wants to add a new TokenType . The WebParser is an example of a tokenizer that is at the beginning of a chain. When it finds a token, it should call the following method for the next TextHandler in the chain.

void foundToken(TextHandler::TokenType type,
                char * token,
                char * original,
                PropertyList * properties)

type is an enumerated type which classifies the token, ie as a WORDSTR. token is the current token to be processed, but it might have been modified by the previous TextHandler. original is the token as it originally was tokenized without any modification. properties is a list of properties to be associated with this token, ie a part of speech or named entity tag.

foundToken calls handling methods depending on what the TokenType is. A subclass of TextHandler in the middle of the chain would override methods to handle what TokenType it cares to receive. This is most commonly just one method:

char* handleWord(char * token,
                 char * original,
                 PropertyList * properties)

Modifying
A stemmer is an example of a TextHandler in the middle of the chain. A stemmer's handleWord method would, for example, modify the word "tables" and return the token "table". "table" would get passed on as the token and "tables" as the original. The code for passing the token down the chain is in the base class so you get that in the subclass without doing anything extra. Other common methods to implement are handleBeginDoc and handleEndDoc.

Using
A TextHandler at the end of the chain should usually do something other than just pass the tokens along, such as output them to screen or to a file. Lemur has some end-chain TextHandler's that push the documents and tokens into an index, such as InvFPTextHandler.

As long as a parsing element inherits the TextHandler API and implements the appropriate overriding methods, you can re-use the same elements with each other to create the parsing chain that you want.

There are 2 methods to set or get back the next TextHandler in the chain.

void setTextHandler(TextHandler * th)
TextHandler * getTextHandler()



Step 3: <<
Connecting other parser elements
[tutorial menu] >> Advanced:
Writing a parser from scratch