The selection strategy used to choose the substrings to include in a fingerprint can have a significant impact on the number of false positives. The issue is that some substrings occur significantly more frequently than others - sometimes by many orders of magnitude. Moreover, these frequent substrings are more likely to be selected in a fingerprint if fingerprint construction does not consider substring frequency. To illustrate the potential impact of this, first suppose that each substring is equally likely to appear in a document, and consider comparing two unrelated documents. If the hash function used is well behaved, then the search fingerprint contains 1000 randomly selected elements from a space of elements and the storage fingerprint contain 100 randomly selected elements from the same space. Now, the probability that at least one element from the space will appear in both fingerprints is given by the probability that a particular element appears multiplied by the number of possible elements. This can be approximated by:

which is about . Hence, for a database of *1M* documents, we can
expect about *370* random noise hits for each search.

Now suppose that a fingerprint consists of substrings that
are 10 times more frequent than the average. Then the probability of a one
substring match between two documents increases by a factor of to
, or near *40K* random noise hits for a search against *1M* documents.

We can significantly reduce false positives by using a substring selection strategy based on frequency measures. One way to do this is to compute the set of all substrings for the document and then pick the least frequently occurring substrings. However this is computationally expensive and does not yield good results because the space of substrings in one document is not a useful indication of the overall frequency of substrings.

Instead, we use a frequency measure based on the first five letters of a substring. This is not only cheaper to compute, but gives useful results. The intuition is that the distribution of five letter sequences in a specific document is a useful approximation to the general distribution of five letter sequences. Hence if we pick substrings whose first five letters occur infrequently in a document, it is likely that the first five letters of these substrings will occur infrequently in general. Substrings whose first five letters occur infrequently are likely to be such that the entire substring occurs relatively infrequently. In Section 7 we give experimental evidence that indicates this technique can reduce false positives by more than a factor of two.

Underlying this approach is the assumption that the substring frequency distribution of (significant) overlapping text segments does not vary substantially from the frequency distribution of other text segments (which would imply that match ratios for related documents are not affected by focusing on infrequent substrings). This assumption appears to be valid in practice.

One problem with the use of five letter frequency distributions for substring
selection is that different documents may have slightly different five letter
frequency distributions. The use of different size fingerprints for storage and
search provides an effective way to address this. To illustrate why, consider a
substring *s* that is common to two documents *A* and *B*. If *s* is selected
in *B*'s stored fingerprint (that is, *s* is ``very infrequent'' according to
*B*'s frequency measure), then although it may not appear in *A*'s stored
fingerprint (because the frequency measures for *A* and *B* differ), it most
probably will appear in *A*'s (much larger) search fingerprint because it is
likely to be ``moderately infrequently'' according to *A*'s frequency measure.

We remark that another way to identify infrequent substrings is to use the fingerprint database to provide frequency measures. One disadvantage of this approach is that fingerprint generation can no longer be performed as a stand-alone operation. However it has the potential to very reliably identify infrequent substrings and deserves further investigation.

Thu Oct 3 20:48:58 EDT 1996