Main Page   Namespace List   Class Hierarchy   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

KeyfileIncIndex Class Reference

#include <KeyfileIncIndex.hpp>

Inheritance diagram for KeyfileIncIndex:

PushIndex Index List of all members.

Public Methods

 KeyfileIncIndex (const char *indexName=0)
 Instantiate with an existing index name, including extension.

 KeyfileIncIndex (char *prefix, int cachesize=128000000, DOCID_T startdocid=1)
 ~KeyfileIncIndex ()
 Clean up.

void setName (char *prefix)
 sets the name for this index

bool beginDoc (DocumentProps *dp)
 the beginning of a new document

bool addTerm (Term &t)
 adding a term to the current document

void endDoc (DocumentProps *dp)
 signify the end of current document

virtual void endDoc (DocumentProps *dp, const char *mgr)
 signify the end of current document

void endCollection (CollectionProps *cp)
 signify the end of this collection.

void setDocManager (const char *mgrID)
 set the document manager to use for succeeding documents

void setMesgStream (ostream *lemStream)
 set the mesg stream

void addKnownTerm (int termID, int position)
 update data for an already seen term

int addUnknownTerm (InvFPTerm *term)
 initialize data for a previously unseen term.

int addUncachedTerm (InvFPTerm *term)
 update data for a term that is not cached in the term cache.

Open index
bool open (const char *indexName)
 Open previously created Index with given prefix.

Spelling and index conversion
int term (const char *word)
 Convert a term spelling to a termID.

const char * term (int termID)
 Convert a termID to its spelling.

int document (const char *docIDStr)
 Convert a spelling to docID.

const char * document (int docID)
 Convert a docID to its spelling.

DocumentManagerdocManager (int docID)
 The document manager for this document.

Summary counts
int docCount ()
 Total count (i.e., number) of documents in collection.

int termCountUnique ()
 Total count of unique terms in collection.

int termCount (int termID) const
 Total counts of a term in collection.

int termCount () const
 Total counts of all terms in collection.

float docLengthAvg ()
 Average document length.

int docCount (int termID)
 Total counts of doc with a given term.

int docLength (DOCID_T docID) const
 Total counts of terms in a document, including stop words maybe.

virtual int totaldocLength (int docID) const
 Total counts of terms in a document including stopwords for sure.

int docLengthCounted (int docID)
 Total count of terms in given document, not including stop words.

Index entry access
DocInfoListdocInfoList (int termID)
 doc entries in a term index,
See also:
DocList , InvFPDocList


TermInfoListtermInfoList (int docID)
 word entries in a document index (bag of words),
See also:
TermList


TermInfoListtermInfoListSeq (int docID)
 word entries in a document index (sequence of words),
See also:
TermList



Protected Methods

bool tryOpen ()
 try to open an existing index

void writeTOC ()
 write out the table of contents file.

void writeCache (bool lastRun=false)
 write out the cache

void lastWriteCache ()
 final run write out of cache

void mergeCacheSegments ()
 out-of-tree cache management combine segments into single segment

void writeCacheSegment ()
 write out segments

void writeDocMgrIDs ()
 write out document manager ids

int docMgrID (const char *mgr)
virtual void doendDoc (DocumentProps *dp, int mgrid)
 handle end of document token.

void openDBs ()
 open the database files

void openSegments ()
 open the segment files

void createDBs ()
 create the database files

void fullToc ()
 readin all toc

bool docMgrIDs ()
 read in document manager internal and external ids map

record fetchDocumentRecord (int key) const
 retrieve a document record.

void addDocumentLookup (int documentKey, const char *documentName)
 store a document record

void addTermLookup (int termKey, const char *termSpelling)
 store a term record

void addGeneralLookup (Keyfile &numberNameIndex, Keyfile &nameNumberIndex, int number, const char *name)
 store a record

InvFPDocListinternalDocInfoList (int termID)
 retrieve and construct the DocInfoList for a term.

void _updateTermlist (InvFPDocList *curlist, int position)
 add a position to a DocInfoList

int _cacheSize ()
 total memory used by cache

void _computeMemoryBounds (int memorySize)
 cache size limits based on cachesize parameter to constructor

void _resetEstimatePoint ()
 Approximate how many updates to collect before flushing the cache.


Protected Attributes

int listlengths
 how long all the lists are

int * counts
 array to hold all the overall count stats of this db

std::vector< std::string > names
 array to hold all the names for files we need for this db

float aveDocLen
 the average document length in this index

vector< std::string > docmgrs
 list of document managers

ostream * msgstream
 Lemur code messages stream.

Keyfile invlookup
 termID -> TermData (term statistics and inverted list segment offsets)

Keyfile dIDs
 documentName -> documentID

Keyfile dSTRs
 documentID -> documentName

Keyfile tIDs
 termName -> termID

Keyfile tSTRs
 termID -> termName

File dtlookup
 document statistics (document length, etc.)

ReadBufferdtlookupReadBuffer
 read buffer for dtlookup

File writetlist
 filestream for writing the list of located terms

char termKey [MAX_TERM_LENGTH]
 buffers for term() lookup functions

char docKey [MAX_DOCID_LENGTH]
 buffers for document() lookup functions

int _listsSize
 memory for use by inverted list buffers

int _memorySize
 upper bound for memory use

std::string name
 the prefix name

vector< InvFPDocList * > invertlists
 array of pointers to doclists

vector< LocatedTermtermlist
 list of terms and their locations in this document

int curdocmgr
 the current docmanager to use

vector< DocumentManager * > docMgrs
 list of document manager objects

TermCache _cache
 cache of term entries

std::vector< File * > _segments
 out-of-tree segments for data

int _largestFlushedTermID
 highest term id flushed to disk.

int _estimatePoint
 invertlists point where we should next check on the cache size


Detailed Description

KeyfileIncIndex builds an index assigning termids, docids, tracking locations of term within documents, and tracking terms within documents. It also expects a DocumentProp to have the total number of terms that were in a document. It expects that all stopping and stemming (if any) occurs before the term is passed in. If used with an existing index, new documents are added incrementally. Records are stored in keyfile b-trees. KeyfileIncIndex also provides the Index API for using the index.


Constructor & Destructor Documentation

KeyfileIncIndex::KeyfileIncIndex const char *    indexName = 0
 

Instantiate with an existing index name, including extension.

KeyfileIncIndex::KeyfileIncIndex char *    prefix,
int    cachesize = 128000000,
DOCID_T    startdocid = 1
 

Instantiate with index name without extension. Optionally pass in cachesize and starting document id number.

KeyfileIncIndex::~KeyfileIncIndex  
 

Clean up.


Member Function Documentation

int KeyfileIncIndex::_cacheSize   [protected]
 

total memory used by cache

void KeyfileIncIndex::_computeMemoryBounds int    memorySize [protected]
 

cache size limits based on cachesize parameter to constructor

void KeyfileIncIndex::_resetEstimatePoint   [protected]
 

Approximate how many updates to collect before flushing the cache.

void KeyfileIncIndex::_updateTermlist InvFPDocList   curlist,
int    position
[protected]
 

add a position to a DocInfoList

void KeyfileIncIndex::addDocumentLookup int    documentKey,
const char *    documentName
[protected]
 

store a document record

void KeyfileIncIndex::addGeneralLookup Keyfile   numberNameIndex,
Keyfile   nameNumberIndex,
int    number,
const char *    name
[protected]
 

store a record

void KeyfileIncIndex::addKnownTerm int    termID,
int    position
 

update data for an already seen term

bool KeyfileIncIndex::addTerm Term   t [virtual]
 

adding a term to the current document

Implements PushIndex.

void KeyfileIncIndex::addTermLookup int    termKey,
const char *    termSpelling
[protected]
 

store a term record

int KeyfileIncIndex::addUncachedTerm InvFPTerm   term
 

update data for a term that is not cached in the term cache.

int KeyfileIncIndex::addUnknownTerm InvFPTerm   term
 

initialize data for a previously unseen term.

bool KeyfileIncIndex::beginDoc DocumentProps   dp [virtual]
 

the beginning of a new document

Implements PushIndex.

void KeyfileIncIndex::createDBs   [protected]
 

create the database files

int KeyfileIncIndex::docCount int    termID [virtual]
 

Total counts of doc with a given term.

Implements Index.

int KeyfileIncIndex::docCount   [inline, virtual]
 

Total count (i.e., number) of documents in collection.

Implements Index.

DocInfoList * KeyfileIncIndex::docInfoList int    termID [virtual]
 

doc entries in a term index,

See also:
DocList , InvFPDocList

Implements Index.

int KeyfileIncIndex::docLength DOCID_T    docID const
 

Total counts of terms in a document, including stop words maybe.

float KeyfileIncIndex::docLengthAvg   [virtual]
 

Average document length.

Implements Index.

int KeyfileIncIndex::docLengthCounted int    docID
 

Total count of terms in given document, not including stop words.

DocumentManager * KeyfileIncIndex::docManager int    docID [virtual]
 

The document manager for this document.

Reimplemented from Index.

int KeyfileIncIndex::docMgrID const char *    mgr [protected]
 

returns the internal id of given docmgr if not already registered, mgr will be added

bool KeyfileIncIndex::docMgrIDs   [protected]
 

read in document manager internal and external ids map

const char * KeyfileIncIndex::document int    docID [virtual]
 

Convert a docID to its spelling.

Implements Index.

int KeyfileIncIndex::document const char *    docIDStr [virtual]
 

Convert a spelling to docID.

Implements Index.

void KeyfileIncIndex::doendDoc DocumentProps   dp,
int    mgrid
[protected, virtual]
 

handle end of document token.

void KeyfileIncIndex::endCollection CollectionProps   cp [virtual]
 

signify the end of this collection.

Implements PushIndex.

void KeyfileIncIndex::endDoc DocumentProps   dp,
const char *    mgr
[virtual]
 

signify the end of current document

void KeyfileIncIndex::endDoc DocumentProps   dp [virtual]
 

signify the end of current document

Implements PushIndex.

KeyfileIncIndex::record KeyfileIncIndex::fetchDocumentRecord int    key const [protected]
 

retrieve a document record.

void KeyfileIncIndex::fullToc   [protected]
 

readin all toc

InvFPDocList * KeyfileIncIndex::internalDocInfoList int    termID [protected]
 

retrieve and construct the DocInfoList for a term.

void KeyfileIncIndex::lastWriteCache   [protected]
 

final run write out of cache

void KeyfileIncIndex::mergeCacheSegments   [protected]
 

out-of-tree cache management combine segments into single segment

bool KeyfileIncIndex::open const char *    indexName [virtual]
 

Open previously created Index with given prefix.

Implements Index.

void KeyfileIncIndex::openDBs   [protected]
 

open the database files

void KeyfileIncIndex::openSegments   [protected]
 

open the segment files

void KeyfileIncIndex::setDocManager const char *    mgrID [virtual]
 

set the document manager to use for succeeding documents

Implements PushIndex.

void KeyfileIncIndex::setMesgStream ostream *    lemStream
 

set the mesg stream

void KeyfileIncIndex::setName char *    prefix
 

sets the name for this index

const char * KeyfileIncIndex::term int    termID [virtual]
 

Convert a termID to its spelling.

Implements Index.

int KeyfileIncIndex::term const char *    word [virtual]
 

Convert a term spelling to a termID.

Implements Index.

int KeyfileIncIndex::termCount   const [inline, virtual]
 

Total counts of all terms in collection.

Implements Index.

int KeyfileIncIndex::termCount int    termID const [virtual]
 

Total counts of a term in collection.

Implements Index.

int KeyfileIncIndex::termCountUnique   [inline, virtual]
 

Total count of unique terms in collection.

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoList int    docID [virtual]
 

word entries in a document index (bag of words),

See also:
TermList

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoListSeq int    docID
 

word entries in a document index (sequence of words),

See also:
TermList

int KeyfileIncIndex::totaldocLength int    docID const [virtual]
 

Total counts of terms in a document including stopwords for sure.

bool KeyfileIncIndex::tryOpen   [protected]
 

try to open an existing index

void KeyfileIncIndex::writeCache bool    lastRun = false [protected]
 

write out the cache

void KeyfileIncIndex::writeCacheSegment   [protected]
 

write out segments

void KeyfileIncIndex::writeDocMgrIDs   [protected]
 

write out document manager ids

void KeyfileIncIndex::writeTOC   [protected]
 

write out the table of contents file.


Member Data Documentation

TermCache KeyfileIncIndex::_cache [protected]
 

cache of term entries

int KeyfileIncIndex::_estimatePoint [protected]
 

invertlists point where we should next check on the cache size

int KeyfileIncIndex::_largestFlushedTermID [protected]
 

highest term id flushed to disk.

int KeyfileIncIndex::_listsSize [protected]
 

memory for use by inverted list buffers

int KeyfileIncIndex::_memorySize [protected]
 

upper bound for memory use

std::vector<File*> KeyfileIncIndex::_segments [protected]
 

out-of-tree segments for data

float KeyfileIncIndex::aveDocLen [protected]
 

the average document length in this index

int* KeyfileIncIndex::counts [protected]
 

array to hold all the overall count stats of this db

int KeyfileIncIndex::curdocmgr [protected]
 

the current docmanager to use

Keyfile KeyfileIncIndex::dIDs [protected]
 

documentName -> documentID

char KeyfileIncIndex::docKey[MAX_DOCID_LENGTH] [protected]
 

buffers for document() lookup functions

vector<DocumentManager*> KeyfileIncIndex::docMgrs [protected]
 

list of document manager objects

vector<std::string> KeyfileIncIndex::docmgrs [protected]
 

list of document managers

Keyfile KeyfileIncIndex::dSTRs [protected]
 

documentID -> documentName

File KeyfileIncIndex::dtlookup [protected]
 

document statistics (document length, etc.)

ReadBuffer* KeyfileIncIndex::dtlookupReadBuffer [protected]
 

read buffer for dtlookup

vector<InvFPDocList*> KeyfileIncIndex::invertlists [protected]
 

array of pointers to doclists

Keyfile KeyfileIncIndex::invlookup [protected]
 

termID -> TermData (term statistics and inverted list segment offsets)

int KeyfileIncIndex::listlengths [protected]
 

how long all the lists are

ostream* KeyfileIncIndex::msgstream [protected]
 

Lemur code messages stream.

std::string KeyfileIncIndex::name [protected]
 

the prefix name

std::vector<std::string> KeyfileIncIndex::names [protected]
 

array to hold all the names for files we need for this db

char KeyfileIncIndex::termKey[MAX_TERM_LENGTH] [protected]
 

buffers for term() lookup functions

vector<LocatedTerm> KeyfileIncIndex::termlist [protected]
 

list of terms and their locations in this document

Keyfile KeyfileIncIndex::tIDs [protected]
 

termName -> termID

Keyfile KeyfileIncIndex::tSTRs [protected]
 

termID -> termName

File KeyfileIncIndex::writetlist [protected]
 

filestream for writing the list of located terms


The documentation for this class was generated from the following files:
Generated on Fri Feb 6 07:12:03 2004 for LEMUR by doxygen1.2.16