Contents
1. TextHandler
The TextHandler class was originally designed to facilitate passing document tokenization information along a chain of components from a tokenizer/parser (classes inheriting from the Parser class) through components like stopword lists and stemmers to an indexer, a query builder, or tool that writes the parsed documents to file. While it has served these goals well for many of the components in use in Lemur, it was not an adequate API for all tasks. In particular, we sometimes wished to have the original token produced by the parser, unmodified by stopword list removal or stemming. Also, we wished to pass on more document information, such as structure found in XML, a part of speech or named entity tag.To facilitate these goals, we have modified the API of the TextHandler class. Information along the TextHandler chain is now passed using the foundToken function. The foundToken function takes the place of the foundWord and foundDoc functions. It takes two extra arguments: one for indicating the token type and another for passing on the unmodified token from the source of the TextHandler chain. Here's its prototype:
void foundToken(TextHandler::TokenType type, char * token, char * orig, PropertyList * properties);In addition, we would like to add another method to the TextHandler API to enable getting the next TextHandler in the chain, if there is onevirtual TextHandler * getTextHandler();Because of the TextHandler base class, all of these changes are backwards compatible and do not require changing any of the current TextHandler implementations.
2. TextHandler::TokenType
TokenType is an enumeration including words, tags, and document boundary markers. You may add to this list of types for your own tools. For example, you may wish to use a parser identifies sentence boundaries. An appropriate way to pass this information along the TextHandler chain would be to add types for beginning of sentence and end of sentence boundaries. Here's a list of the current types:
- WORD
Calling foundToken with TextHandler::WORD as the token type is equivalent to the foundWord call of the old TextHandler class.
- BEGINDOC
The BEGINDOC type is reserved for signaling the beginning of a document. The token and orig arguments to foundToken should contain the document number. This call is equivalent to the old foundDoc function.
- ENDDOC
The ENDDOC type is used to signal the end of a document. This is a new call; there was no equivalent in the previous TextHandler class. Classes using the TextHandler class will now expect this call; make sure your parsers produce it. Prior to the inclusion of this call, some classes that needed to know the end of document boundaries used an inelegant hack.
- BEGINTAG
This type has been added for support of XML. This could also be used for HTML or SGML parsers. Or even more generally, it could be used to represent hierarchical structure boundaries.
The token and orig arguments should contain only the type of the tag. If the tag is <h3 align="center">, then token and orig should contain "h3". The properties argument to the foundToken call should the alignment information.
- ENDTAG
This type has also been added for support of XML. The token argument should contain just the type of the tag (i.e. "h3").
3. PropertyList
A PropertyList is a container for properties of tokens. Example properties may be the byte offset of the token in the file, attributes associated with a tag, document properties, and so on. Items in the property list are (name, value) pairs.There are some unconvential behaviors of Property and PropertyList objects. While unconvential, they are very simple. They are made to facilitate object reuse and fast processing in the expected use of these objects (processing and tokenizing documents). This behavior is similar to that InvPushIndexer not assuming that the document properties object is stable from call to call.
A PropertyList object is owned by its creator. That is, you should not assume that the properties in it will be the same in subsequent calls to TextHandler::foundToken. The creator is also responsible for freeing the memory associated with the list.
The PropertyList interface has the following function prototypes:
These calls support iteration over all properties in the list:
Property * getProperty(char * name);This call supports looking up a property by name. NULL is returned if there is no property in the list of that name.These calls support addition/removal of properties to/from the list:
void startIteration();Resets the iterator location to the beginning of the list.
Property * nextElement();Returns the next Property in the list. The Property returned is owned by the PropertyList. That is, you must not free the memory, and you should not expect that the Property object will remain the same after making additional calls to the PropertyList or during subsequent calls to foundToken. If you need a persistant copy of the Property, create a copy using the Property(Property * propertyToCopy) constructor.
bool hasNextElement();
void addProperty(Property * property);Inserts a copy of the property into the list. If a property with the same name already exists in the list, it is overwritten. The PropertyList does not assume that the property passed in will remain valid beyond the extent of the call, so it creates a copy. If you want to change the value of a Property in the list, you must call addProperty function again.
void removeProperty(char * name);Removes a property w/ the given name (if it exists). If the property does not exist in the list, this function fails silently.
void clear();Removes all properties from the list.
4. Property
Note that the name and the value returned in calls to a Property object are owned by the Property object. That is, you should not the free the pointers returned from getName or getValue.
Property(Property * propertyToCopy)Create a property, copying name and value from propertyToCopy.
void setName(char * name);Sets the name of the property. A Property object uses its own internal copy of the name, and should not assume that the text at the pointer for the name argument is stable.
void setValue(void * value);Sets the value of the property. A Property object uses its own internal copy of the value, and should not assume that the data at the pointer for the value argument are stable.
char * getName();Returns the name of the property.
void * getValue();Returns a reference to the value of the property.
Property::DataType getType();Initial property types: INT (32 bit), STRING (char *) (not assumed to be NULL terminated)
int getSize();Returns size in bytes (8 bit) of the value. For example, if the value is of type INT, getSize would return 4.
int getLength();Returns the number of items in the value. For example, if the value is of type INT, getLength would return 1. If the value is of type STRING, getLength would return the number of characters in the array.
The Lemur Project Last modified: Mon Mar 17 15:32:42 EST 2003