As more information becomes available on the internet, searching for documents based on textual similarity is becoming increasingly useful. For example, suppose that I have an early version of a research article, and that I want to determine where it was eventually published (so that I can cite it appropriately), or find if there is a more up-to-date version that fixes previous errors, or perhaps locate an old technical report version that contains more complete proofs or implementation details. Given that the various version of an article are likely to have a significant amount of the text in common, a textually related document search is very likely to locate the earlier/later versions of the article (assuming they are accessible to the search engine).
Another application of this kind of search is detection of copyright violations and plagiarism on the internet. It is now all too easy to obtain a paper over the internet, modify the cover page to insert your name in place of the original authors, perhaps change the title and abstract, and submit the paper as if it were your own. Given the large number of conferences and journals, and the imperfections of the refereeing process, the chances of such a paper slipping through undetected are significant. This kind of plagiarism has been successfully carried out in the past and was the subject of a recent editorial in the Communications of the ACM . Arguably it was inevitable that the individual involved would be discovered. What is surprising is the scope of his activities, the time it took before he was discovered, and that he was able to continue with some success even after his case was well known.
More generally, the internet raises major plagiarism and copyright problems because of the ease by which documents may be copied and modified. Resources such as newsfeed wire services, newspaper articles, netnews articles, online books and so forth are increasingly at risk. As a result, many important sources of information are not made available online because the individuals and organizations that own the information find these risks unacceptable. On another front, there is increasing concern about plagiarism of coursework papers at the college undergraduate and graduate level.
There are two approaches to this plagiarism/copyright problem. In the first, digital signatures or watermarks are included in a document [3, 10]. These signatures may involve the use of particular word spacings or checksums of components of a document. Unfortunately, these signatures can often be deleted (particularly if the document is translated from one format to another). Moreover, these approaches are not well suited for detecting partial matches involving modified documents. The second approach involves the registration and storage of documents and subsequent textual matching with new documents to track copies and modified versions [1, 6, 11]. This approach has been used in a number of implemented systems [2, 7, 8, 9].
In this paper we consider the general problem of textual matching for document search and plagiarism/copyright applications. We focus on the question of how to build a practical online system that provides reliable search results for a data set of the order of a million documents, while using modest resources ( G of disk, i.e. about 500 bytes per document). Three central issues arise: first, how can we meet the space constraints; second, how can we provide reliable search in the presence of noise (to perform textual matching, we must convert documents into text, and this is typically a very noisy and unreliable process), and third, how can we meet the security needs of plagiarism applications (if it is easy to circumvent the matching of related documents by making a handful of selective text changes, then the system will be of little value for plagiarism detection). We present both quantitative and empirical analysis of our system. A web-based prototype system called Koala is accessible via the URL http://www.cs.cmu.edu/afs/cs/user/nch/www/koala.html.