The selection strategy used to choose the substrings to include in a fingerprint can have a significant impact on the number of false positives. The issue is that some substrings occur significantly more frequently than others - sometimes by many orders of magnitude. Moreover, these frequent substrings are more likely to be selected in a fingerprint if fingerprint construction does not consider substring frequency. To illustrate the potential impact of this, first suppose that each substring is equally likely to appear in a document, and consider comparing two unrelated documents. If the hash function used is well behaved, then the search fingerprint contains 1000 randomly selected elements from a space of elements and the storage fingerprint contain 100 randomly selected elements from the same space. Now, the probability that at least one element from the space will appear in both fingerprints is given by the probability that a particular element appears multiplied by the number of possible elements. This can be approximated by:
which is about . Hence, for a database of 1M documents, we can expect about 370 random noise hits for each search.
Now suppose that a fingerprint consists of substrings that are 10 times more frequent than the average. Then the probability of a one substring match between two documents increases by a factor of to , or near 40K random noise hits for a search against 1M documents.
We can significantly reduce false positives by using a substring selection strategy based on frequency measures. One way to do this is to compute the set of all substrings for the document and then pick the least frequently occurring substrings. However this is computationally expensive and does not yield good results because the space of substrings in one document is not a useful indication of the overall frequency of substrings.
Instead, we use a frequency measure based on the first five letters of a substring. This is not only cheaper to compute, but gives useful results. The intuition is that the distribution of five letter sequences in a specific document is a useful approximation to the general distribution of five letter sequences. Hence if we pick substrings whose first five letters occur infrequently in a document, it is likely that the first five letters of these substrings will occur infrequently in general. Substrings whose first five letters occur infrequently are likely to be such that the entire substring occurs relatively infrequently. In Section 7 we give experimental evidence that indicates this technique can reduce false positives by more than a factor of two.
Underlying this approach is the assumption that the substring frequency distribution of (significant) overlapping text segments does not vary substantially from the frequency distribution of other text segments (which would imply that match ratios for related documents are not affected by focusing on infrequent substrings). This assumption appears to be valid in practice.
One problem with the use of five letter frequency distributions for substring selection is that different documents may have slightly different five letter frequency distributions. The use of different size fingerprints for storage and search provides an effective way to address this. To illustrate why, consider a substring s that is common to two documents A and B. If s is selected in B's stored fingerprint (that is, s is ``very infrequent'' according to B's frequency measure), then although it may not appear in A's stored fingerprint (because the frequency measures for A and B differ), it most probably will appear in A's (much larger) search fingerprint because it is likely to be ``moderately infrequently'' according to A's frequency measure.
We remark that another way to identify infrequent substrings is to use the fingerprint database to provide frequency measures. One disadvantage of this approach is that fingerprint generation can no longer be performed as a stand-alone operation. However it has the potential to very reliably identify infrequent substrings and deserves further investigation.