Newsgroups: comp.theory.info-retrieval
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!swrinde!elroy.jpl.nasa.gov!ames!news.hawaii.edu!news
From: ameyer@ix.netcom.com (Alan Meyer)
Subject: Re: Dynamic Document Preservation
X-Nntp-Posting-Host: uhunix.uhcc.hawaii.edu
Message-ID: <3of034$91u@ixnews3.ix.netcom.com>
Lines: 96
Sender: news@news.hawaii.edu
Organization: Netcom
References: <D7MG7p.DAL@news.hawaii.edu>
Date: Sat, 6 May 1995 05:05:40 GMT
Approved: Pollarda@Uhunix.uhcc.hawaii.edu

In <D7MG7p.DAL@news.hawaii.edu> pollarda@uhunix3.uhcc.Hawaii.Edu (Art Pollard)
writes: 
>
>
>I am wondering whether anybody out there has done any work on the
>preservation of form of dynamic indexed documents that are indexed on the
>paragraph (or other suitable breakpoint) level.  This is (of course) much
>more complex than indexing a dynamic document where the whole document is
>considered to be one record.  Here are some of the issues that make this
>particularly interesting and challenging.  (And of course, I would love 
>to hear from anybody else that has experience with such matters.)
>
>1) It would be nice if under a boolean system the search results could be 
>   returned in a relative order (in relation to one another).  If new
>   document/record number is assigned to a modified paragraph, it would
>   mess up the standard sequential ordering of the index (as viewed by 
>   the user).
>
>2) Of course, you would like to scan the documents sequentially.
>
>3) You would want to be able to perform unlimited inserts and deletions 
>   at any point in the text.  (Between any two records / paragraphs.)
>
>#1 can be accomplished by a large float associated with each paragraph
>   with a relative ordering value.  (i.e., 1, 1.5, 1.7, 2, 3, ...)
>
>#2 can be accomplished by forward and next pointers.  
>
>#3 seems to point to problems with using the float solution for #1 
>   without periodic reorganization otherwise the float would overflow.
>
>The solutions for #1, & #2 would increase the index overhead by:
>
>  Forward Pointer      (4 bytes)
>  Backwards Pointer    (4 bytes)
>  Relative Order Float (8 bytes)
>                      ---------- 
>                 Total  16 bytes
>
I worked on a project once in which we were required to keep a set of documents
in a particular sorted order, even though new documents could be inserted
anywhere in the set.  This seems to be like your problem of indexing on
paragraphse where the paragraphs must be kept in order even though it is
possible to randomly insert a paragraph, throwing all paragraph numbers off in
the postings list.

The solution we adopted was to assign integer identifying numbers to each
record, but with gaps between the integers.  These numbers were used as record
identifiers in the postings cells.  The gap size was chosen based on our
estimate of the volatility of the database vs. the cost of using larger gaps. 
A very large gap allowed more inserts between records, but required more bits
in the postings cells - which cost both space and retrieval speed.

A 2 bit gap allows a minimum of 2 and a maximum of 3 inserts between any pair
of numbers.  A 3 bit gap allows a minimum of 3 and a maximum of 7 inserts.  And
so on.  Whether the minimum or the maximum number of inserts is achieved
depends on the order of inserts.

For example, assuming a 2 bit gap, the first insert between record 4 and 8
(which start out being contiguous records) would get number 6.  The second
insert would get 5 or 7, depending upon whether the record should sort before
or after the first insert (6).  A third insert will be possible if it's proper
sorted position is on the other side of the 6 as compared to the second insert.
If not, no insert is possible.

This scheme can be made to work if any of the following conditions can be
satisfied:

1.  The gap is large enough to accommodate the expected number of
inserts, or

2.  Periodic re-organizations and complete re-indexing is tolerated, or

3.  Software is written to re-organize a local area of the number space
to overcome a crunch in that one area, or

4.  Out of order inserts can be tolerated as long as they occupy only a
small percentage of the database and appear relatively close to the
proper order.

If the database is not too volatile the cost of this solution can be
kept quite low, an extra 2, 3, or 4 bits in a postings cell will
accommodate a minimum of 2, 3, or 4 insertions, and a maximum of 3, 7,
or 15 insertions, and so on.

In effect, this solution is the same as your floating point solution,
but doesn't commit to a full 32 or 64 bit representation of an
identifier, and it may make it easier to design software to translate
between record identifiers and database positions.

Hope this helps.
-- 
  Alan Meyer
  AM Systems, Inc.
  Randallstown, Maryland
  ameyer@ix.netcom.com



