Koala Document Fingerprinting (KDF)

KDF is an experimental system for identifying textually related documents. The current system focuses on on computer science research documents such as technical reports, conference papers and journal articles, however there are a many other applications.

Purpose

Most pieces of research writing evolve over a period of time. Different versions of document are published or otherwise made available at various stages along this process. The purpose of the KDF system is to trace the history of a document by identifying common segments of text between documents. Typical uses of the system include:

finding citations: you have an draft version of a document and want to know where it was eventually published.
tracing references: you have a conference abstract and want to know if there is a revised or journal version of the article.
tracing plagiarism: you have a paper and want to check whether it has been previously published in whole or part
- by the same authors.
- by different authors.

What happens during searching?

Document searches proceed in a number of steps. First the document (URL) is loaded by the KDF server. If necessary, the document is converted to a textual representation. Then, using this text, a fingerprint of the document is generated. Finally, this fingerprint is matched against the current document fingerprint database to find related documents.

Limitations

Fingerprint Database: The scope of searches in the KDF system is defined by what is currently in the fingerprint database. We are actively seeking out papers available online - this is a lengthly process. If you know of papers that are not currently in the database, you may add them yourself (see the KDF home page). Alternatively, if you have a large collection of URL's, please email them to nch@cs.cmu.edu and well will gladly add them. The KDF system is designed for a large database of documents. Searching is scalable, and storage requirements are low (about 500 bytes per document). Our aim is for a database of about one million documents, including computer science conference proceedings and journals.

Textual Conversion: To give reliable searches for documents in a variety of representations, the basic fingerprinting techniques use textual representations. Documents in non-textual representations must be converted to text. The KDF system has limited support for postscript. Other documents must first be converted to text before they can be used.

Postscript Support: Reliable conversion from postscript to text is difficult, error-prone and expensive. Motivated mainly by resource constraints, we have incorporated a limited string-based postscript to text conversion process into KDF. This is satisfactory for postscript generated using TeX and Framemaker, it is not effective on that generated by Microsoft Word. The limited postscript support is include only for convenience: it is not fundamental to the fingerprinting process. We hope to incorporate better conversion processes as they become available (pointer welcome!).

Other Applications

The fingerprinting techniques used in KDF are independent of the underlying textual objects used. While this particular implementation has focussed on computer science research documents, the same system could be used for searching web pages, magazine articles or speeches.

For further details...

See Scalable Document Fingerprinting, Second USENIX Electronic Commerce Workshop, pp.191-200, 1996 (postscript version).

The Koala Document Fingerprinting system was developed by Nevin Heintze while at Carnegie Mellon University, August 1995.
"Koala Document Fingerprinting" copyright © 1995 Nevin Heintze. All Rights Reserved.