next up previous
Next: Reducing False Positives Up: Scalable Document Fingerprinting (Extended Previous: Fingerprint Storage

Document Noise


To generate fingerprints and perform document matching, we must first obtain text versions of documents. Unfortunately, this is an unreliable process that introduces many errors. One of the main formats we wish to support is Postscript. Postscript interpreters can be adapted to produce text, but they are typically slow and often produce poor results. Alternatively, for Postscript output by specific tools, we can exploit the format of Postscript generated by the tool to recover the text quickly and fairly accurately (this is the case for example with TeX/dvips generated Postscript). The problem is that the formats change as the tools evolve, and we need different programs to deal with different Postscript tools.

For Postscript conversion, the main errors introduced involve punctuation, non-alphabetic characters and spacing. In particular, word boundaries are often distorted. There are some secondary problems with vowels and uppercase/lowercase distinctions. We factor out these problems by ignoring all but non-vowel characters and converting everything to lower case. This allows us to use fast Postscript to text converters based on string extraction (the translator we use is a modified version of Jason Black's ps2txt program, which in turn is based on a program by Iqbal Qazi). By focusing on non-vowel characters and converting to lower case, we have obtained very reliable results for Postscipt generated from TeX, PageMaker, Microsoft Word and FrameMaker. Note that by considering only consonants, our approach is not actually based on document substrings, but rather on character subsequences of the original document. We use subsequences of length 20, and given the typical distribution of consonants, this corresponds to spans of about 30-45 characters in the original document.

Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996