Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

IdentifinderParser.hpp File Reference

#include "Parser.hpp"
#include "TextHandler.hpp"
#include "LinkedPropertyList.hpp"

Go to the source code of this file.

Compounds

class  IdentifinderParser

Defines

#define BEGIN_PREFIX   "B_"
#define END_PREFIX   "E_"
#define PREFIX_LEN   2


Define Documentation

#define BEGIN_PREFIX   "B_"
 

Parses documents in with similar document separation tags NIST's Web format. <DOC></DOC> around documents and <DOCNO></DOCNO> around docids. This parser recognizes named entity tags from the Identifinder tagger and passed them along as properties. For each tag X, also adds in b_X and e_X to the first and last token of each entity. For example, "Carnegie Mellon University" was identified as a place, it would be parsed with the following properties: Carnegie [b_place] [place] Mellon [place] University [e_place] [place] A single token entity, like Madonna would be Madonna [b_person] [person] [e_person] Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped.

U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers.

#define END_PREFIX   "E_"
 

#define PREFIX_LEN   2
 


Generated on Wed Nov 3 12:59:12 2004 for Lemur Toolkit by doxygen1.2.18