CONCISE REPORT ON THE ALGORITHMS IN SIM 910619 INTRODUCTION The general outline of the similarity checker is as follows: 1. the files are read in (pass 1) 2. a forward-reference table is prepared 3. the set of interesting runs is determined 4. the exact fseek positions of the runs are determined (pass 2) 5. the contents of the runs are printed in order (pass 3) To keep the memory requirements (relatively) small, the exact positions of the tokens are not recorded. This necessitates pass 2. See, however, the pertinent chapter. READING THE FILES Each file is tokenized using an lex-generated scanner appropriate for the input. Each token fits in one byte, possibly using all 8 bits. The tokens are stored in the array tk_buff[], which is extended by reallocation if it overflows. See buff.c. Also, to optimize away pass 2, an attempt is made to remember the fseek positions of all beginnings of lines. The token-position/fseek-position pairs at BOL are stored in the array nl_buff[], which is also extended by reallocation, if needed. If the attempt fails due to lack of memory, nl_buff[] is abandoned. PREPARING THE FORWARD-REFERENCE TABLE Text is compared by comparing every substring to all substrings to the right of it; this process is in essence quadratic. However, only substrings of length at least 'min_run_size' are of interest, which gives us the possibility to speed up this process by using a hash table. Once the entire text has been read in, a forward-reference table forw_ref[] is made (see hash.c). For every position in the text, we construct an index which gives the next position in the text where a run of min_run_size tokens starts that has the same hash code. If there is no such run, the index is 0. These forward references are kept in the array forw_ref[]. To fill in this array, we use a hash table hash[], such that hash[i] is the index of the latest token with hash_code i, or 0 if there is none. If at a given position p, we find that the text ahead of us has hash code i, hash[i] tells us which position in forw_ref[] will have to be updated to p. See make_forw_ref(). For long text sequences (say hundreds of thousands of tokens), the hashing is not really efficient any more since too many spurious matches occur. Therefore, the forward reference table is scanned a second time, eliminating from any chain all references to runs that do not end in the same token (actually, this is a second hash code). For the UNIX manuals this reduced the number of matches from 91.9% to 1.9% (of which 0.06% was genuine). DETERMINING THE SET OF INTERESTING RUNS The overall structure of the routine compare() (see compare.c) is: for all new files for all texts it must be compared to for all positions in the new file for all positions in the text for ever increasing sizes try to match and keep the best If for a given position in the new file a good run (i.e. on of at least minimum length) has been found, the run is registered using a call of add_run(), the run is skipped in the new file and searching continues at the position after it. This prevents duplicate reports of runs. Add_run() (see add_run.c) allocates a struct run for the run (see sim.h) which contains two struct chunks and a quality description. It fills in the two chunks with the pertinent info, one for the first file and one for the second (which may be the same, if the run relates two chunks in the same file). The run is then entered into the arbitrary-in-sorted-out store aiso (see aiso.spc and aiso.bdy, a genuine abstract data type in C!), in which it is inserted according to its quality. Both positions (struct position) in both chunks in the run (so four in total) are each entered in a linked list starting at the tx_pos field in the struct text of the appropriate file. When this is finished, the forward reference table can be deleted. So the final results of this phase are visible both through the tx_pos fields and through the aiso interface. DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2) The purpose of this pass is to find for each chunk, which up to now is known by token position only, its starting and ending line number and the fseek position of the beginning of the starting line (which need not and generally will not be the fseek position of the first token in the run). For each file that has a non-zero tx_pos field, ie. that has some interesting chunks, the positions in the tx_pos list are sorted on ascending line number (they have been found in essentially arbitrary order) by sort_pos() in pass2.c. Next we scan the pos list and the file in parallel, updating the info in a position when we meet it. A position carries an indication whether it is a starting or an ending position, since slightly differing calculations have to be done in each case. Actually, if the nl_buff[] data structure still exists, the file is not accessed at all and the data from nl_buff[] is used instead. This is done transparently in stream.c. PRINTING THE CONTENTS OF THE RUNS (PASS 3) Since each struct run has now been completely filled in, this is simple; the hard work is calculating the page layout. Pass3() accesses the aiso store and retrieves from it the runs in descending order of importance. Show_run() opens both files, positions them using fseek and prints the runs, relying on the line count. ================================================================ CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (901214) sim: get command line options check the options init buffer init language pass1, read the files | there is an array tk_buff[] that holds all input tokens make forward reference table | there is an array forw_ref[], with one entry for each token | in the input ; forw_ref[i] gives the token number where a | token sequence starts with the same hash value as the one | starting at i compare various files delete forward reference table pass2, find positions of found similarities pass3, print the similarities pass1, read the files: for each file divide the text into tokens store all tokens except newlines in tk_buff make forward reference table: | there are two independent hash functions, hash1() and hash2(). | hash1(i) gives the hash value of the token sequence starting at i | like wise for hash2(i) set up the forward references using the hash table: | there is an array hash[], with one entry for each possible | hash value ; hash[i] gives the position in forw_ref[] | where i was most recently entered as a hash value for each file for all positions in file except the last min_run_size set forw_ref[] and update hash[] use hash2() to clean out matches: for all tokens find first token in chain with same hash2 code short-circuit forward reference to it compare: for all new files for all texts it must be compared to for all positions in the new file for all positions in the text for ever increasing sizes try to match and keep the best try to match and keep the best: | using forw_ref[], we find a list of positions in which a | matching token sequence will start; | scanning this list, we measure the maximum length of the | match and add the longest match to the run collection pass2, find positions of found runs: for all files: sort the positions in the runs | we scan the pos list and the file in parallel for all positions inside this file if it matches a token position in a run record character position and line number pass3, print the similarities: for all runs | a run consists of two chunks open the files that hold the chunks and position them at the beginning of the chunk display the chunks