Which class you choose to sub-class should depend on whether you plan on using the Lemur applications. Some applications, such as BuildInvertedIndex, rely on methods from the Parser API. The Parser API has more methods to implement, but it's probably worth it to be able to use the Lemur applications. In either case, having read "Understanding the TextHandler" would probably make the rest of this easier to understand.
Sub-classing TextHandler:
Sub-classing your new parser class from TextHandler will enable you to use it with other Lemur parsing elements, but not with existing Lemur applications that rely on Parser classes. You can chain another TextHandler to it just as you would any other TextHandler by calling your parser's setTextHandler method. Your parser would pass tokens down the chain by calling the foundToken method of the TextHandler you've just chained.
For example, let's write a sample class called SimpleFileTH that reads from a file, using the file's name for the document ID and recognizes words as anything delimited by any kind of white space. Your header file would look something like this:
#include "TextHandler.hpp"
class SimpleFileTH : public TextHandler {
public:
// default constructor that sets a default maximum world length value
SimpleFileTH() { maxlength=70; }
SimpleFileTH(int max) { maxlength=max; }
~SimpleFileTH() {}
// process this file as one document
// returns whether we were successful
bool processDoc(char* filename);
protected:
int maxlength; // the maximum length we'll allow as a word
}
Now to implement processDoc :
bool SimpleFileTH::processDoc(char* filename){
// if there's nothing chained, there's no need to do anything
// textHandler is an inherited member for the next TH in chain
// chain by calling SimpleFileTH::setTextHandler(TextHandler*)
if (!textHandler)
return false;
// open the file
ifstream readstream(filename);
if (!readstream.is_open())
return false;
char* word = new char[maxlength];
// start the document
textHandler->foundToken(BEGINDOC, filename, filename);
// tokenize terms
while (readstream >> word)
textHandler->foundToken(WORDSTR, word, word);
// end the document
readstream.close();
textHandler->foundToken(ENDDOC);
delete[] word;
return true;
}
Now that's an extremely simple parser that doesn't even remove punctuation marks, but you can send your tokenized terms through a stopper and stemmer if you want, and push them into an index with existing Lemur TextHandler classes by chaining them together.
Sub-classing Parser:
You start sub-classing Parser and use your custom class exactly the same way you do for a custom TextHandler like described above. However you have to support a few additional methods from the Parser API. Those additional methods support knowing where in the file the document started and where you currently are in the file in terms of bytes. While it's a bit more work, sub-classing from Parser also gets you the use of an acronym list from the base class.
Let's see how we would create a class called SimpleFileParser that behaves like SimpleFileTH, using the filename as the document ID and tokenizing words with white space. First there are the parse methods parseFile and parseBuffer which are for actual processing of the documents. fileTell and getDocBytePos are for keeping track of file position. getDocBytePos is already implemented in the base class to return a member variable called docpos. So all we have to do is keep docpos up to date. We could employ the same tactics for fileTell , having it return a variable that we'll keep up to date. Our header file for SimpleFileParser would look like this:
#include "Parser.hpp"
class SimpleFileParser : public Parser {
public:
// default constructor that sets a default maximum world length value
SimpleFileParser() { maxlength=70; }
SimpleFileParser(int max) { maxlength=max; }
~SimpleFileParser() {}
// process this file as one document
void parseFile(char* filename);
// the API requires this, but we actually won't support it
void parseBuffer(char* buf, int len) {}
// return the current file position
long fileTell() { return filepos; }
protected:
int maxlength; // the maximum length we'll allow as a word
long filepos; // byte position of where we currently are in the file
}
Now we just need to implement parseFile , making sure to keep filepos and docpos updated. It will be very similar to our SimpleFileTH::processDoc method. We'll highlight the main differences.
bool SimpleFileParser::parseFile(char* filename){
// if there's nothing chained, there's no need to do anything
// textHandler is an inherited member for the next TH in chain
// chain by calling SimpleFileParser::setTextHandler(TextHandler*)
if (!textHandler)
return;
docpos = 0;
filepos = 0;
// open the file
ifstream readstream(filename);
if (!readstream.is_open())
return;
char* word = new char[maxlength];
// start the document
textHandler->foundToken(BEGINDOC, filename, filename);
// tokenize terms
while (readstream >> word) {
filepos = readstream.tellg();
textHandler->foundToken(WORDSTR, word, word);
}
// end the document
readstream.close();
textHandler->foundToken(ENDDOC);
delete[] word;
}
|