Parsing documents
Step 2: Choosing the right parser

You'll probably want to use Lemur's WebParser. This parser handles regular text documents that could be, but don't have to be, in html. It removes any HTML tags, text within any SCRIPT tags, as well as any HTML comments. It also removes contractions and possessives, and it converts all terms into lower case. Document boundaries are specified with NIST style format:

<DOC>
<DOCNO> document_ID </DOCNO>
Document text here could be in HTML.
</DOC>
The document tags help distinguish between documents so that you can have multiple documents in the same file. They also specify a document ID, which the index uses to identify the documents. It's best for the document ID's to be unique. Most of the parsers in Lemur require similar, if not identical, document separator tags.
Even if you don't have html documents, you can still use the WebParser since it does not rely on having html. It ignores most html tags and can easily "ignore" what you don't have.

Adding these tags around your documents should be trivial. Here's an example perl script that adds the tags to given documents, using the file name as the document ID, and puts them all into one file called "collection.dat".

open(OUT, ">collection.dat");
for ($i=0;$i<@ARGV;$i++) {
  open(INDATA, $ARGV[$i]);
  print OUT "<DOC>\n";
  print OUT "<DOCNO> $ARGV[$i] </DOCNO>\n";
  while (<INDATA>) {
    print OUT $_;
  }
  print OUT "</DOC>\n";
  close(INDATA);
}
close(OUT);
Other parsers in Lemur for tokenizing documents include one that can handle part of speech tag from Brill's tagger and one that can handle named entities from the Identifinder tagger. See the thread "Indexing with term properties" if you want to use those.

Lemur also has the following parsers:

TrecParser
The TrecParser provides a simple but effective parser for NIST's TREC document format. It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. It ignores other fields.
ReutersParser
The ReutersParser extracts the TEXT, HEADLINE, and TITLE fields and removes other tags.
Lemur has support for text documents in Arabic and Chinese!

ArabicParser
The ArabicParser provides a simple but effective parser for NIST's TREC document format for Arabic documents encoded in Windows CodePage 1256 encoding (CP1256). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
ChineseParser
The ChineseParser provides a simple but effective parser for NIST's TREC document format for Chinese documents encoded in GB encoding (GB2312). It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. This parser is suitable for parsing segmented (tokenized) documents.
ChineseCharParser
Similar to the ChineseParser, the ChineseCharParser is for documents in GB encoding (GB2312), recognizing test in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. However, it is for parsing unsegmented documents, producing 1 token per chinese character.



Step 1: <<
Preparing your documents
[tutorial menu] >> Advanced:
Customizing a parser
>> Step 3:
Connecting other parser elements