Parses documents in NIST's Web TREC format. Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped. U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers. The DOCHDR is ignored. Text in <script> tags is ignored. Text in HTML comments is ignored.
More...
Parses documents in NIST's Web TREC format. Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped. U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers. The DOCHDR is ignored. Text in <script> tags is ignored. Text in HTML comments is ignored.