As mentioned earlier, full fingerprinting is conceptually useful but it is not practical because of the sizes of the fingerprints generated. To reduce the size of a fingerprint, we select a subset of the substrings from the full fingerprint. Since the goal of our work is to treat documents that vary in size from several thousand words to several hundred thousand words while meeting very tight space constraints, we have chosen to select a fixed number of substrings, independent of the size of the document. We call this fixed size selective fingerprinting. (An alternative is to select a fixed proportion of the substrings, so that the size of the selective fingerprint is proportional to the size of the document. The main drawback of this alternative is space consumption: to provide accurate fingerprinting of documents with several thousand words we would need a fingerprint containing 50-100 substrings, and this means fingerprints of size 5000-10000 for documents containing several hundred thousand words.)
The design a fixed size selective fingerprinting system revolves around two choices: fingerprint size and selection strategy (that is, which substrings do we select from the full fingerprint). We discuss these in the next two subsections.