We have presented a system for document comparison based on textual similarity. Target applications include related document searches and copyright/plagiarism protection. Our system uses fixed size selective fingerprints based on document substrings, and supports reliable and accurate document comparison with very small fingerprints (about 400 bytes per document). The main novelties of our work are (a) very low storage requirements (almost two orders of magnitude less than competing systems), (b) resilience to noise in documents (such as that introduced by conversion from Postscript to text), (c) security measures to the improve dependability of plagiarism searches in the context of an active adversary, and (d) significant reduction of false positives.

Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996