The Lemur Toolkit Tutorial Pages

Parsing documents
Advanced: Writing a parser from scratch
This section is useful if you need to write your own parser for your documents, but still want to use some of the other parsing elements included with Lemur, such as the stopper, stemmer, or indexer.
	You can create a parser that fits into the Lemur parsing architecture by making it a child of the `TextHandler` class, either by inheriting from `TextHandler` itself or from the `Parser` class, which already inherits from `TextHandler`.
Which class you choose to sub-class should depend on whether you plan on using the Lemur applications. Some applications, such as `BuildInvertedIndex`, rely on methods from the `Parser` API. The `Parser` API has more methods to implement, but it's probably worth it to be able to use the Lemur applications. In either case, having read "Understanding the TextHandler" would probably make the rest of this easier to understand. Sub-classing `TextHandler`: Sub-classing your new parser class from `TextHandler` will enable you to use it with other Lemur parsing elements, but not with existing Lemur applications that rely on `Parser` classes. You can chain another `TextHandler` to it just as you would any other `TextHandler` by calling your parser's `setTextHandler` method. Your parser would pass tokens down the chain by calling the `foundToken` method of the `TextHandler` you've just chained. For example, let's write a sample class called `SimpleFileTH` that reads from a file, using the file's name for the document ID and recognizes words as anything delimited by any kind of white space. Your header file would look something like this: `#include "TextHandler.hpp" class SimpleFileTH : public TextHandler { public: // default constructor that sets a default maximum world length value SimpleFileTH() { maxlength=70; } SimpleFileTH(int max) { maxlength=max; } ~SimpleFileTH() {} // process this file as one document // returns whether we were successful bool processDoc(char* filename); protected: int maxlength; // the maximum length we'll allow as a word }` Now to implement `processDoc`: bool SimpleFileTH::processDoc(char* filename){ // if there's nothing chained, there's no need to do anything // textHandler is an inherited member for the next TH in chain // chain by calling SimpleFileTH::setTextHandler(TextHandler) if (!textHandler) return false; // open the file ifstream readstream(filename); if (!readstream.is_open()) return false; char word = new char[maxlength]; // start the document textHandler->foundToken(BEGINDOC, filename, filename); // tokenize terms while (readstream >> word) textHandler->foundToken(WORDSTR, word, word); // end the document readstream.close(); textHandler->foundToken(ENDDOC); delete[] word; return true; } Now that's an extremely simple parser that doesn't even remove punctuation marks, but you can send your tokenized terms through a stopper and stemmer if you want, and push them into an index with existing Lemur `TextHandler` classes by chaining them together. Sub-classing `Parser`: You start sub-classing Parser and use your custom class exactly the same way you do for a custom `TextHandler` like described above. However you have to support a few additional methods from the `Parser` API. Those additional methods support knowing where in the file the document started and where you currently are in the file in terms of bytes. While it's a bit more work, sub-classing from `Parser` also gets you the use of an acronym list from the base class. Let's see how we would create a class called `SimpleFileParser` that behaves like `SimpleFileTH`, using the filename as the document ID and tokenizing words with white space. First there are the parse methods `parseFile` and `parseBuffer` which are for actual processing of the documents. `fileTell` and `getDocBytePos` are for keeping track of file position. `getDocBytePos` is already implemented in the base class to return a member variable called `docpos`. So all we have to do is keep `docpos` up to date. We could employ the same tactics for `fileTell`, having it return a variable that we'll keep up to date. Our header file for `SimpleFileParser` would look like this: #include "Parser.hpp" class SimpleFileParser : public Parser { public: // default constructor that sets a default maximum world length value SimpleFileParser() { maxlength=70; } SimpleFileParser(int max) { maxlength=max; } ~SimpleFileParser() {} // process this file as one document void parseFile(char* filename); // the API requires this, but we actually won't support it void parseBuffer(char* buf, int len) {} // return the current file position long fileTell() { return filepos; } protected: int maxlength; // the maximum length we'll allow as a word long filepos; // byte position of where we currently are in the file } Now we just need to implement `parseFile`, making sure to keep `filepos` and `docpos` updated. It will be very similar to our `SimpleFileTH::processDoc` method. We'll highlight the main differences. bool SimpleFileParser::parseFile(char* filename){ // if there's nothing chained, there's no need to do anything // textHandler is an inherited member for the next TH in chain // chain by calling SimpleFileParser::setTextHandler(TextHandler) if (!textHandler) return; docpos = 0; filepos = 0; // open the file ifstream readstream(filename); if (!readstream.is_open()) return; char word = new char[maxlength]; // start the document textHandler->foundToken(BEGINDOC, filename, filename); // tokenize terms while (readstream >> word) { filepos = readstream.tellg(); textHandler->foundToken(WORDSTR, word, word); } // end the document readstream.close(); textHandler->foundToken(ENDDOC); delete[] word; }
	A custom `Parser` class can be used by all the included Lemur applications without you having to modify each application. All you have to do to use your parser is modify one method, `TextHandlerManager::createParser`, to add your parser to the list.


Step 3: << Connecting other parser elements	[tutorial menu]	>> Advanced: Understanding the `TextHandler`