We first address the question of what kinds of textual relationships should be considered significant. Checking for exact matches is easy, but not satisfactory. For example it would miss matches where one document is the result of minor edits of the other. It would also misses identical documents that differ because of noise introduced by document translation processes (Postscript to text conversion, OCR, etc.). Moreover, it would ignore most of the interesting textual relationships between documents. In this paper, we consider the following general kinds of relationships to be significant:
We require that the first five classes of relationships be identified with very high probability; for the remaining class we will tolerate a small number of false positives and false negatives.