next up previous
Next: Selective Fingerprinting Up: Fingerprinting Previous: Fingerprinting

Full Fingerprinting

Consider the following simple fingerprinting scheme: given a document, let the fingerprint of the document consist of the set of all possible document substrings of length tex2html_wrap_inline419 . There are tex2html_wrap_inline421 such substrings, where l is the length of the document. Comparing two documents under this scheme is simply a matter of counting the number of substrings common to both fingerprints: if we compare a document A of size |A| against a document B, and if n is the number of substrings common to both documents then tex2html_wrap_inline433 is the measure of how much of A is contained in B. If tex2html_wrap_inline439 is chosen appropriately, this simple fingerprint gives reliable document matching results. We refer to this scheme as full fingerprinting. Although it is not practical (for space reasons), it is a very useful measure of document similarity, and we shall use it for the evaluation of our system in Section 7. (We remark that it would not be a good idea to construct fingerprints by chopping up the document into tex2html_wrap_inline441 substrings by making a cut at every tex2html_wrap_inline443 character, because insertion of a character at the start of the document would shift the substrings by 1 and the resulting fingerprint would be a poor match to the original, even though the two documents are almost identical).

The choice of tex2html_wrap_inline445 in this fingerprinting scheme is particularly important, and is subject to two conflicting constraints. If tex2html_wrap_inline447 is too small, then there will be many false matches (e.g. if tex2html_wrap_inline449 is the size of a word, then the scheme reduces to little more than comparing lists of words in documents, a poor similarity metric). If tex2html_wrap_inline451 is too large, then there will be many false negatives because one character change can affect tex2html_wrap_inline453 substrings in the fingerprint (e.g. if tex2html_wrap_inline455 is the size of a paragraph, then a single character change in a paragraph would effectively prevent matching for the entire paragraph). We remark that there is no ``right'' value for tex2html_wrap_inline457 : we cannot quantitatively or empirically calculate a definitive value for tex2html_wrap_inline459 . In essence, the choice of tex2html_wrap_inline461 defines the notion of document similarity for the system. Different values of tex2html_wrap_inline463 will be useful for different kinds of searches. The value of tex2html_wrap_inline465 used in this paper is effectively around 30-45 (more precisely, our strings consist of 20 character sequences of consonants; we discuss this in detail in Section 5). We investigate the effect of different values of tex2html_wrap_inline467 in Section 7.



next up previous
Next: Selective Fingerprinting Up: Fingerprinting Previous: Fingerprinting



Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996