Next: Fingerprinting Up: Scalable Document Fingerprinting (Extended Previous: Related Work

Textual Relationships

We first address the question of what kinds of textual relationships should be considered significant. Checking for exact matches is easy, but not satisfactory. For example it would miss matches where one document is the result of minor edits of the other. It would also misses identical documents that differ because of noise introduced by document translation processes (Postscript to text conversion, OCR, etc.). Moreover, it would ignore most of the interesting textual relationships between documents. In this paper, we consider the following general kinds of relationships to be significant:

Identical documents.
Documents that are the result of small edits/corrections to other documents.
Documents that are reorganizations of other documents.
Documents that are revisions of other documents.
Documents that are condensed/expanded versions of other documents (e.g. journal versus conference versions of papers).
Documents that include portions (say several hundred words) of other documents.

We require that the first five classes of relationships be identified with very high probability; for the remaining class we will tolerate a small number of false positives and false negatives.

Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996