Consider the following simple fingerprinting scheme: given a document, let the fingerprint of the document consist of the set of all possible document substrings of length . There are such substrings, where l is the length of the document. Comparing two documents under this scheme is simply a matter of counting the number of substrings common to both fingerprints: if we compare a document A of size |A| against a document B, and if n is the number of substrings common to both documents then is the measure of how much of A is contained in B. If is chosen appropriately, this simple fingerprint gives reliable document matching results. We refer to this scheme as full fingerprinting. Although it is not practical (for space reasons), it is a very useful measure of document similarity, and we shall use it for the evaluation of our system in Section 7. (We remark that it would not be a good idea to construct fingerprints by chopping up the document into substrings by making a cut at every character, because insertion of a character at the start of the document would shift the substrings by 1 and the resulting fingerprint would be a poor match to the original, even though the two documents are almost identical).
The choice of in this fingerprinting scheme is particularly important, and is subject to two conflicting constraints. If is too small, then there will be many false matches (e.g. if is the size of a word, then the scheme reduces to little more than comparing lists of words in documents, a poor similarity metric). If is too large, then there will be many false negatives because one character change can affect substrings in the fingerprint (e.g. if is the size of a paragraph, then a single character change in a paragraph would effectively prevent matching for the entire paragraph). We remark that there is no ``right'' value for : we cannot quantitatively or empirically calculate a definitive value for . In essence, the choice of defines the notion of document similarity for the system. Different values of will be useful for different kinds of searches. The value of used in this paper is effectively around 30-45 (more precisely, our strings consist of 20 character sequences of consonants; we discuss this in detail in Section 5). We investigate the effect of different values of in Section 7.