
Some simple notes about how it all works...
@(#) $Revision: 1.1 $

(1) Overview

lq-Text keeps a number of separate files.

Files ending in .dir or .pag are ndbm files.  I have used ndbm for the
index because it's so fast at accessing things.
** If you compiled with bsdhash, the files won't end with .dir or .pag

$(LQTEXTDIR)/wordlist.{dir,pag}
    Every distinct word has an entry containing the word number.
    In order to reduce space, sWriteNumber is used (this is described below,
    but basically writes a variable number of bytes depending on the
    size of the number).
    This number is an index into:

$(LQTEXTDIR)/WIDIndex.
    This has -- for each distinct word -- a block of WIDBLOCKSIZE bytes
    which contains:
	*The length of the word (1 byte)
	*The word itself (no trailing \0, hence Min...MaxWordLength bytes)
	Offset into the physical database "$(LQTEXTDIR)/data", written
	    in from one to five bytes as necessary
	    [actually the format is more like
	     {FID No_of_times { WID No_of_times { block word flags dist }* }* }*
	     see the comments in liblqtext/{r,w}pblock.c for details]
	Number of occurrences all told (again 1-5 bytes)
	*At least one byte, and generally more, of data (see under "data")
	[* - these are doomed, may already be gone]
    Given a word, one can find the WID with Word2WID().
    Given a WID, one can find the corresponding string with WID2Word().

$(LQTEXTDIR)/FileList.{dir,pag}
    There are two entries here for every file that NX-Text has seen.
    (a) to allow one to get a filename from a FID
    (b) to get a (t_FileInfo *) from a filename; this contains
	FID (like a WID...)
	Pathname
	Date last indexed (seconds since Jan 1st 1970)
	Filter type (see below)
	There is room in the returned structure for a (FILE *), which
	can be used if the file is opened.
    When support for file structure (SGML) is added, a FIDIndex file
    will appear just like WIDIndex, and this file will shrink accordingly.

$(LQTEXTDIR)/data
    This is the biggie, folks!
    The data file is divided into blocks of BLOCKSIZE bytes -- currently
    64 bytes each... subject to change.
    There's an average of BLOCKSIZE/2 bytes wasted for every word... 
    Each block is part of some linked list or other...and has a NextOffset
    field.  This is a block offset, multiplied by two to leave a single
    bit free at the bottom.

    Block 0 is unused.

    Block 1 is the first data block.

    Data blocks contain:
	a pointer to the next block in this Chain (0 if none) (4 bytes)
	and then the actual matches.

    The first block in each chain is pointed to by a WordInfo entry
    stored in the Wordlist (see above), and the first few bytes of data
    are also there.  In some cases, all of the data for a word fits in the
    WIDIndex file, saving more space.

    The layout of the data is described in detail in comments in pblock.c;
    here is a summary:
    For each word in the database we have to keep a list of
	every occurrence of that word in every file.
    Hence the data is the list of occurrences.  Each occurrence fits in
    memory into a t_WordPlace struct, which has
	the file number
	the block in the file
	the word within the file
	the distance since the last word (bytes)
	a byte of flags.
    In order to write these economically, we store
    + A FID * 2 (1..5 bytes), with the bottom bit meaning that there are
      more than one matches of this word in this file
    + The number of times that FID occurs (if appropriate)
    + For each of those occurrences:
      . the block number, stored as the difference since the last block
	(reset to zero for each file, of course) (1..5 bytes)
      . the word in the block, plus 1 bit if there are flags (1 byte)
      . flags (1 byte)
      . the distance from the last word, if the appropriate flag bit is
	set -- otherwise it's assumed to be 1 byte away (1 byte).
    Hence some typical sizes of entries are:
		max...  small... min...
	FID	 5	1	5 -- shared among up to 255
	n	 -	1	1 -- entries
	block	 5	1	1
	word	 1	1	1
	flags	 1	0
	distance 1	0
	========================================
		13	4	2 + share of 2..6

    It's important that it doesn't take too long to unpack all of
    these things, of course.
    There are _literally_ millions of these entries in even a few megabytes'
    worth of database, so the compression turns out to be good value.

    The compressed numbers are written a byte at a time, with the top bit
    set if there is another byte to follow.  Hence, a small number will
    take up only one byte.  Most numbers are small.

$(LQTEXTDIR)/{F,W}idFile
    These contain the list of available FIDs and WIDs.
    These are only written out every 500 or so changes, to reduce the
    number of open...lockf...read...seek...write...unlock...close
    sequences required!

    The data is stored in ASCII.

    TODO: use the first block of "data" for this -- not only less clutter,
    but also more useful, as this is currently wasted space apart from the
    first 4 bytes marking the start of the free list.

Algorithms
==========

1.  Adding files (addfile)
    
* check that file is not already indexed (this uses stat(file) and
  FileInfo->DateLastIndexed to check for changes)
* determine file type
* run the appropriate filter, if any
* read the output
    + for each indexable word,
      . add the word and its offset to a symbol table
* at the end, or when the symbol table fills,
    + for each word we've seen
      . get the old entry
      . add all new entries
      . sort the resulting list
      . write the results back to file
      Note: this could be sped up if only the changes were written,
	    as in the common case of adding a new file, the new entries
	    will all have the new MaxFID, and hence will sort to the
	    end of the list!  For a long list, this could save a lot
	    of work.
      . delete the word from the symbol table, to save space
	(not at the very end, of course!)

2. Retrieving Words (wordinfo, wordindex.c)

* for each word
    . find the WordInfo entry from the database (WID2WordInfo(Word2WID()))
    . get the pblock [see below] with Getpblock() (pblock.c)
    . for each pair, get the FileInfo and print the name and offset.
      Could be cleverer with cacheing FileInfo entries.


3. Retrieving a Phrase (tryphrase.c, Phrase.c)
* For each phrase
    . Make a new string, containing only the words we index.
    . Build up a t_Phrase structure, containing a WordInfo and a pblock
      for each indexed word.  If a word isn't in the index, clearly the
      phrase won't be.
    . For each FID/Offset pair in the first word's pblock
      + for each word in the phrase
	. see if the word has a pair with a matching FID, and an Offset
	  reasonably close;
	  no --> go onto the next pair
	  yes --> maybe a partial match
      In other words, we grow a list of matches until they all either
      reach the end of the phrase or come to a sticky end, wither and die
      (like my houseplants!)
  There are oodles and oodles of comments in Phrase.c which might help.
  The simplest example of using String2Phrase() and MakeMatches() is in
  lqphrase.c, although that part of lqtext.c is quite simple too.
