Consider the following simple fingerprinting scheme: given a document, let the
fingerprint of the document consist of the set of all possible document
substrings of length . There are such substrings, where
*l* is the length of the document. Comparing two documents under this scheme is
simply a matter of counting the number of substrings common to both
fingerprints: if we compare a document *A* of size *|A|* against a document *B*,
and if *n* is the number of substrings common to both documents then is
the measure of how much of *A* is contained in *B*. If is chosen
appropriately, this simple fingerprint gives reliable document matching results.
We refer to this scheme as *full fingerprinting*. Although it is not
practical (for space reasons), it is a very useful measure of document
similarity, and we shall use it for the evaluation of our system in Section
7. (We remark that it would not be a good idea to construct
fingerprints by chopping up the document into
substrings by making a cut at every character, because insertion
of a character at the start of the document would shift the substrings by 1 and
the resulting fingerprint would be a poor match to the original, even though the
two documents are almost identical).

The choice of in this fingerprinting scheme is particularly important, and is subject to two conflicting constraints. If is too small, then there will be many false matches (e.g. if is the size of a word, then the scheme reduces to little more than comparing lists of words in documents, a poor similarity metric). If is too large, then there will be many false negatives because one character change can affect substrings in the fingerprint (e.g. if is the size of a paragraph, then a single character change in a paragraph would effectively prevent matching for the entire paragraph). We remark that there is no ``right'' value for : we cannot quantitatively or empirically calculate a definitive value for . In essence, the choice of defines the notion of document similarity for the system. Different values of will be useful for different kinds of searches. The value of used in this paper is effectively around 30-45 (more precisely, our strings consist of 20 character sequences of consonants; we discuss this in detail in Section 5). We investigate the effect of different values of in Section 7.

Thu Oct 3 20:48:58 EDT 1996