Document Prologs

While the use of frequency checks provides one way to ignore common substrings, other techniques can also be useful. In particular, most of the problematic strings such as addresses, funding agencies acknowledgements etc., appear at the start of a document. Also, when a Postscript file is converted to text, the first words of text are often from the preamble of the Postscript file and indicate the tools used to generate the file; they have nothing to do with the actual text of the document.

One simple approach is to ignore the first part of a document. In Section 7 we show that ignoring the first 1000 characters of a document gives useful reductions of false negatives without significantly affecting other matches. Moreover, it is useful in tandem with the technique described in the previous subsection.

