Most closely related to our work is the Stanford Digital Library work on SCAM [7, 8, 9]. Their approach uses relative word frequencies. Similarity between documents is based on a modified cosine similarity measure. The storage requirements for this approach are approximately 30%-65% of the size of the original documents ( reports data storage requirements of between 37MB and 79MB for a document set of 120MB, depending on the chunking approach used). The use of words as the basis for document analysis requires textual representations of documents that faithfully preserve word boundaries.
Our approach is based on selecting a set of subsequences of characters from a document and generating a fingerprint based on the hash values of these subsequences. Similarity between two documents is measured by counting the number of common subsequences in fingerprints (the reliability of this measure is critically dependent on how the subsequences are selected from a document). One major difference from SCAM is that we store less than 500 bytes per document (400 bytes for the actual fingerprint, about 20 bytes for document identification such as URL and email information, and some overhead for indexing), which typically means that our storage requirements are about 0.5-1% of the size of the original documents, almost two orders of magnitude less than the requirements for SCAM. Another significant difference is that we accept documents in a variety of different formats (including Postscipt generated from TeX, PageMaker, Microsoft Word and FrameMaker). To perform textual comparison on such documents, we must first convert them to text. The problem is that such conversions introduce many errors. In particular, punctuation and word spacing are very unreliable. An early version of our work was word based, but did not give satisfactory results. Instead, we have adopted techniques that are largely insensitive to word boundaries and other common errors introduced by document conversion.
The idea of selecting a set of pieces of a document and then hashing these to obtain a document fingerprint is used by Manber in sif, a tool for finding similar files in a large file system. However, the motivations for sifand our work are very different. First, siffocuses on similarities of 25% and higher, whereas we strive to provide reliable information for matches of 3-5% and lower (the key is to reduce the number of false positives while retaining important matches). Second, our work addresses the problem of tolerance to noise. Third, our work addresses special issues related to plagiarism applications. As a result, the sif and Koala system designs differ quite substantially.