We employ different size fingerprints for storage and search: the fingerprints we store in the database have size 100, but the fingerprints used for searching have size 1000. Importantly, the search fingerprint for a document is a strict superset of the fingerprint used for storage. There are two reasons for this choice. The first is reliability, and is intimately connected with design decisions discussed in Subsection 6.1. The second motivation is security: we want our system to be resilient under attack by would-be plagiarists. To illustrate the issue, first suppose that we use fixed size fingerprints of 100 for both storage and search and that the selection strategy is publicly known. In this case, it would be easy for a plagiarist to determine which 100 substrings are part of the fingerprint, and make 100 changes at the appropriate places in the plagiarized version so that it no longer matches the original. If, instead of making the selection strategy public, we keep it secret (for example, we could use some secret seed value to guide the selecting strategy), then by a trial and error process, it is still possible to find an appropriate set of 100 changes (for example, one could chop the original document into pieces and search separately on these pieces to identify the selected substrings).
We provide better security by periodically changing the stored fingerprint of a document. The use of two fingerprints provides a particularly convenient way to achieve this: we obtain a new stored fingerprint by simply choosing a different subset of the search fingerprint (since the ratio of sizes involved is 100:1000, this still gives considerable scope for change). The advantage of this approach is that we do not need to change the search engine (i.e. we still generate the same search fingerprint) to search against the modified stored fingerprint. This is important, because it allows us to change the database incrementally: we can update the stored fingerprints of a few documents at a time in a transparent manner. To support this process, we maintain a list of URLs for each fingerprinted document so that we can retrieve the document and recompute its fingerprint as desired. We also maintain a contact email address for each document to help resolve stale URLs. We envision updating fingerprints on a regular basis (perhaps once every six months or year), with irregular updates if there is suspicious activity relating to a document (such as an unusually large number of searches for it).