Newsgroups: comp.theory.info-retrieval
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!gatech!swrinde!cs.utexas.edu!uwm.edu!lll-winken.llnl.gov!ames!news.hawaii.edu!uhunix3.uhcc.Hawaii.Edu!pollarda
From: pollarda@uhunix3.uhcc.Hawaii.Edu (Art Pollard)
Subject: Dynamic Document Preservation
X-Nntp-Posting-Host: uhunix3.uhcc.hawaii.edu
Message-ID: <D7MG7p.DAL@news.hawaii.edu>
Sender: news@news.hawaii.edu
Organization: University of Hawaii
Date: Wed, 26 Apr 1995 02:57:25 GMT
Approved:  Pollarda@Uhunix.uhcc.hawaii.edu
Lines: 51


I am wondering whether anybody out there has done any work on the
preservation of form of dynamic indexed documents that are indexed on the
paragraph (or other suitable breakpoint) level.  This is (of course) much
more complex than indexing a dynamic document where the whole document is
considered to be one record.  Here are some of the issues that make this
particularly interesting and challenging.  (And of course, I would love 
to hear from anybody else that has experience with such matters.)

1) It would be nice if under a boolean system the search results could be 
   returned in a relative order (in relation to one another).  If new
   document/record number is assigned to a modified paragraph, it would
   mess up the standard sequential ordering of the index (as viewed by 
   the user).

2) Of course, you would like to scan the documents sequentially.

3) You would want to be able to perform unlimited inserts and deletions 
   at any point in the text.  (Between any two records / paragraphs.)

#1 can be accomplished by a large float associated with each paragraph
   with a relative ordering value.  (i.e., 1, 1.5, 1.7, 2, 3, ...)

#2 can be accomplished by forward and next pointers.  

#3 seems to point to problems with using the float solution for #1 
   without periodic reorganization otherwise the float would overflow.

The solutions for #1, & #2 would increase the index overhead by:

  Forward Pointer      (4 bytes)
  Backwards Pointer    (4 bytes)
  Relative Order Float (8 bytes)
                      ---------- 
                 Total  16 bytes

Per paragraph / record.  In addition, all the (relevent) paragraph pointers 
(after a search) would have to be read in order to decide the ordering of 
the paragraphs returned from the search.  This could take awhile if there 
were many paragraphs that satisfied the search.

Is there a better way to do this?  Perhaps with less overhead?  I would 
assume that a fair amount of work has been done on dynamic documents such 
as this somewhere.

Any input, pointers, references, etc. would be greatly appreciated.

Thanks,

-Art

