Parsing documents | Advanced: Customizing a parser | |
Most of the parsers in Lemur are created using Flex. Flex is a tool that creates the scanner to tokenize the text, then methods are written in C++ to handle the tokens when they are found and returned. To customize a parser in Lemur, you should edit its lexer input file with a ".l" extension, not its .cpp file. These files are located in lemur/utility/src directory.
| ||
Many Lemur parsers are written using GNU Flex. To download Flex and for full documentation, visit its website: Flex - a scanner generator | ||
In the .l file, you can specify rules with a pattern to match, and then what to do when the flexer finds text that matches that pattern. The patterns are defined using regular expression, the syntax used by perl and some unix commands. If you don't know how to use regular expressions, here is a good explanation: Regular Expression Basics You can customize the Lemur parsers by editing the rules on what patterns to match. For example, you can edit what it would recognize as a document separator, or as a term. Here are some sample rules from the WebParser.l file, where the left column is patterns and the right column is actions: You can think of the WebParser as a state machine that behaves appropriately when certain states are invoked, such as B_DOC (begin doc), E_DOC (end doc), or WORD (found a word). When states are returned, they are handled in a method called doParse. If you don't want to change its behavior, you don't have to worry about the method and just focus on the flex rules near the top of the file."(PATTERN)" {ACTION;} "<DOC>" {webloc += webleng; return B_DOC; } "</DOC>" {webloc += webleng; return E_DOC; } [a-zA-Z0-9]+ {webloc += webleng; return WORD; } According to the Flex rules above, a word is recognized as something with at least 1 upper or lower case letter or number. Let's say that you don't want to include numbers as words. You would replace the above pattern with [a-zA-Z]+. And let's say you want your documents to be separated by [BEGIN] and [END] markers instead of <DOC> and </DOC>. Your new Flex rules would look like this: Now let's say that you don't want to ignore numbers completely. You can add a new state to the parser by defining it as a unique number at the top of the file above the Flex rules. For example,"\[BEGIN\]" {webloc += webleng; return B_DOC; } "\[END\]" {webloc += webleng; return E_DOC; } [a-zA-Z]+ {webloc += webleng; return WORD; } #define NUMBER 100 Now you'll need to add a rule to recognize the numbers:
(It is important that you have webloc += webleng; This statement counts the number of characters that the scanner has scanned so far. This value gets used elsewhere.) Next you have to add what to do in the doParse method. This method is defined near the end of the file. Add a[0-9]+ {webloc += webleng; return NUMBER; } case NUMBER: segment inside the switch (tok) { statement that's already there to handle a NUMBER. For example, inside doParse:
This same part of the file is also what you should modify if you want to change the behavior of the parser for already existing states, like WORD. In that case, you would edit the code inside the case WORD: segment.while (tok = weblex()) { switch (tok) { case B_DOC: // handle begin doc state = DOC; ... break; case E_DOC: // handle end doc ... break; case WORD: // handle word ... break; case NUMBER: // handle number // whatever it is you want to do with this number ... break; } } After you make changes to the .l file, you will need to generate a new .cpp file using Flex and recompile Lemur. If you are using Lemur's unix makefiles, it should automatically do it for you when you run make. If this doesn't happen or if you're using windows, run the Flex command before recompiling. From the Lemur root directory, the command would be |
Step 2: << Choosing the right parser |
[tutorial menu] | >> Step 3: Connecting other parser elements |