CONCISE REPORT ON THE ALGORITHMS IN SIM			910619


	INTRODUCTION

The general outline of the similarity checker is as follows:

	1. the files are read in (pass 1)
	2. a forward-reference table is prepared
	3. the set of interesting runs is determined
	4. the exact fseek positions of the runs are determined (pass 2)
	5. the contents of the runs are printed in order (pass 3)

To keep the memory requirements (relatively) small, the exact positions
of the tokens are not recorded.  This necessitates pass 2.  See, however,
the pertinent chapter.


	READING THE FILES

Each file is tokenized using an lex-generated scanner appropriate for
the input.  Each token fits in one byte, possibly using all 8 bits.  The
tokens are stored in the array tk_buff[], which is extended by
reallocation if it overflows.  See buff.c.

Also, to optimize away pass 2, an attempt is made to remember the fseek
positions of all beginnings of lines.  The token-position/fseek-position
pairs at BOL are stored in the array nl_buff[], which is also extended by
reallocation, if needed.  If the attempt fails due to lack of memory,
nl_buff[] is abandoned.


	PREPARING THE FORWARD-REFERENCE TABLE

Text is compared by comparing every substring to all substrings
to the right of it; this process is in essence quadratic.  However,
only substrings of length at least 'min_run_size' are of interest,
which gives us the possibility to speed up this process by using
a hash table.

Once the entire text has been read in, a forward-reference table
forw_ref[] is made (see hash.c).
For every position in the text, we construct an index which gives
the next position in the text where a run of min_run_size tokens
starts that has the same hash code.  If there is no such run, the
index is 0.  These forward references are kept in the array
forw_ref[].

To fill in this array, we use a hash table hash[], such that hash[i] is
the index of the latest token with hash_code i, or 0 if there is none.
If at a given position p, we find that the text ahead of us has hash
code i, hash[i] tells us which position in forw_ref[] will have to be
updated to p.  See make_forw_ref().

For long text sequences (say hundreds of thousands of tokens), the
hashing is not really efficient any more since too many spurious matches
occur.  Therefore, the forward reference table is scanned a second time,
eliminating from any chain all references to runs that do not end in the
same token (actually, this is a second hash code).  For the UNIX manuals
this reduced the number of matches from 91.9% to 1.9% (of which 0.06%
was genuine).

	DETERMINING THE SET OF INTERESTING RUNS

The overall structure of the routine compare() (see compare.c) is:

for all new files
	for all texts it must be compared to
		for all positions in the new file
			for all positions in the text
				for ever increasing sizes
					try to match and keep the best

If for a given position in the new file a good run (i.e. on of at least
minimum length) has been found, the run is registered using a call of
add_run(), the run is skipped in the new file and searching continues at
the position after it.  This prevents duplicate reports of runs.

Add_run() (see add_run.c) allocates a struct run for the run (see sim.h)
which contains two struct chunks and a quality description.  It fills
in the two chunks with the pertinent info, one for the first file and
one for the second (which may be the same, if the run relates two chunks
in the same file).

The run is then entered into the arbitrary-in-sorted-out store aiso (see
aiso.spc and aiso.bdy, a genuine abstract data type in C!), in which it
is inserted according to its quality.  Both positions (struct position)
in both chunks in the run (so four in total) are each entered in a
linked list starting at the tx_pos field in the struct text of the
appropriate file.

When this is finished, the forward reference table can be deleted.

So the final results of this phase are visible both through the tx_pos
fields and through the aiso interface.


	DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2)

The purpose of this pass is to find for each chunk, which up to now is
known by token position only, its starting and ending line number and the
fseek position of the beginning of the starting line (which need not and
generally will not be the fseek position of the first token in the run).

For each file that has a non-zero tx_pos field, ie. that has some
interesting chunks, the positions in the tx_pos list are sorted on
ascending line number (they have been found in essentially arbitrary
order) by sort_pos() in pass2.c.

Next we scan the pos list and the file in parallel, updating the info in
a position when we meet it.  A position carries an indication whether it
is a starting or an ending position, since slightly differing
calculations have to be done in each case.

Actually, if the nl_buff[] data structure still exists, the file is not
accessed at all and the data from nl_buff[] is used instead.  This is
done transparently in stream.c.


	PRINTING THE CONTENTS OF THE RUNS (PASS 3)

Since each struct run has now been completely filled in, this is simple;
the hard work is calculating the page layout.
Pass3() accesses the aiso store and retrieves from it the runs in
descending order of importance.  Show_run() opens both files, positions
them using fseek and prints the runs, relying on the line count.

================================================================
	CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (901214)

sim:
	get command line options
	check the options

	init buffer
	init language

	pass1, read the files
		| there is an array tk_buff[] that holds all input tokens

	make forward reference table
		| there is an array forw_ref[], with one entry for each token
		|   in the input ; forw_ref[i] gives the token number where a
		|   token sequence starts with the same hash value as the one
		|   starting at i

	compare various files
	delete forward reference table
	pass2, find positions of found similarities
	pass3, print the similarities


pass1, read the files:
	for each file
		divide the text into tokens
		store all tokens except newlines in tk_buff


make forward reference table:
	| there are two independent hash functions, hash1() and hash2().
	|   hash1(i) gives the hash value of the token sequence starting at i
	|   like wise for hash2(i)

	set up the forward references using the hash table:
		| there is an array hash[], with one entry for each possible
		|   hash value ; hash[i] gives the position in forw_ref[]
		|   where i was most recently entered as a hash value
		for each file
			for all positions in file except the last min_run_size
				set forw_ref[] and update hash[]

	use hash2() to clean out matches:
		for all tokens
			find first token in chain with same hash2 code
			short-circuit forward reference to it


compare:
	for all new files
		for all texts it must be compared to
			for all positions in the new file
				for all positions in the text
					for ever increasing sizes
						try to match and keep the best
	try to match and keep the best:
		| using forw_ref[], we find a list of positions in which a
		|   matching token sequence will start;
		|   scanning this list, we measure the maximum length of the
		|   match and add the longest match to the run collection


pass2, find positions of found runs:
	for all files:
		sort the positions in the runs

		| we scan the pos list and the file in parallel
		for all positions inside this file
			if it matches a token position in a run
				record character position and line number


pass3, print the similarities:
	for all runs
		| a run consists of two chunks
		open the files that hold the chunks and position them
		  at the beginning of the chunk
		display the chunks