As more information becomes available electronically, document search based on
textual similarity is becoming increasingly important, not only for locating
documents online, but also for addressing internet variants
of old problems such as plagiarism and copyright violation.
This paper presents an online system that provides reliable search results using
modest resources and scales up to data sets of the order of a million
documents. Our system provides a practical compromise between storage
requirements, immunity to noise introduced by document conversion and security
needs for plagiarism applications. We present both quantitative analysis and
empirical results to argue that our design is feasible and
effective. A web-based prototype system is accessible via the URL
http://www.cs.cmu.edu/afs/cs/user/nch/www/koala.html.