The Lemur Toolkit Tutorial Pages

Parsing documents
Advanced: Understanding the `TextHandler`
Most of the parsing elements in Lemur inherit from the `TextHandler` class. In doing so, they inherit an API that enables them to be "chained" together. The idea is that each `TextHandler` has a method that handles what it wants to do with a given token. Then it passes that original token, along with the token it has modified, to the next `TextHandler` in the chain. It's up to you to chain them together in the order that you want.
	The base `TextHandler` class has implementations for all the methods. A subclass would only need to override implementations for specific handling methods. This is commonly just one method, the one that handles what to do when a certain token is received.
Tokenizing A `TextHandler` at the beginning of a chain normally wouldn't even need to override any methods, unless it wants to add a new `TokenType` . The `WebParser` is an example of a tokenizer that is at the beginning of a chain. When it finds a token, it should call the following method for the next `TextHandler` in the chain. void foundToken(TextHandler::TokenType type, char * token, char * original, PropertyList * properties) `type` is an enumerated type which classifies the token, ie as a `WORDSTR`. `token` is the current token to be processed, but it might have been modified by the previous `TextHandler`. `original` is the token as it originally was tokenized without any modification. `properties` is a list of properties to be associated with this token, ie a part of speech or named entity tag. `foundToken` calls handling methods depending on what the `TokenType` is. A subclass of `TextHandler` in the middle of the chain would override methods to handle what `TokenType` it cares to receive. This is most commonly just one method: char* handleWord(char * token, char * original, PropertyList * properties) Modifying A stemmer is an example of a `TextHandler` in the middle of the chain. A stemmer's `handleWord` method would, for example, modify the word "tables" and return the token "table". "table" would get passed on as the token and "tables" as the original. The code for passing the token down the chain is in the base class so you get that in the subclass without doing anything extra. Other common methods to implement are `handleBeginDoc` and `handleEndDoc`. Using A `TextHandler` at the end of the chain should usually do something other than just pass the tokens along, such as output them to screen or to a file. Lemur has some end-chain `TextHandler`'s that push the documents and tokens into an index, such as `InvFPTextHandler`.
	As long as a parsing element inherits the `TextHandler` API and implements the appropriate overriding methods, you can re-use the same elements with each other to create the parsing chain that you want.
There are 2 methods to set or get back the next `TextHandler` in the chain. void setTextHandler(TextHandler * th) TextHandler * getTextHandler()


Step 3: << Connecting other parser elements	[tutorial menu]	>> Advanced: Writing a parser from scratch