Next: Introduction

Scalable Document Fingerprinting (Extended Abstract)

Abstract:

As more information becomes available electronically, document search based on textual similarity is becoming increasingly important, not only for locating documents online, but also for addressing internet variants of old problems such as plagiarism and copyright violation.

This paper presents an online system that provides reliable search results using modest resources and scales up to data sets of the order of a million documents. Our system provides a practical compromise between storage requirements, immunity to noise introduced by document conversion and security needs for plagiarism applications. We present both quantitative analysis and empirical results to argue that our design is feasible and effective. A web-based prototype system is accessible via the URL http://www.cs.cmu.edu/afs/cs/user/nch/www/koala.html.

Introduction
Textual Relationships
Fingerprinting
Fingerprint Storage
Document Noise
Reducing False Positives
Results
Conclusion
References
About this document ...

Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996