Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

KeyfileIncIndex Class Reference

#include <KeyfileIncIndex.hpp>

Inheritance diagram for KeyfileIncIndex:

PushIndex Index List of all members.

Public Methods

 KeyfileIncIndex (const string &prefix, int cachesize=128000000, DOCID_T startdocid=1)
 KeyfileIncIndex ()
 New empty one for index manager to use.

 ~KeyfileIncIndex ()
 Clean up.

void setName (const string &prefix)
 sets the name for this index

bool beginDoc (const DocumentProps *dp)
 the beginning of a new document

bool addTerm (const Term &t)
 adding a term to the current document

void endDoc (const DocumentProps *dp)
 signify the end of current document

virtual void endDoc (const DocumentProps *dp, const string &mgr)
 signify the end of current document

void endCollection (const CollectionProps *cp)
 signify the end of this collection.

void setDocManager (const string &mgrID)
 set the document manager to use for succeeding documents

void setMesgStream (ostream *lemStream)
 set the mesg stream

void addKnownTerm (int termID, int position)
 update data for an already seen term

int addUnknownTerm (const InvFPTerm *term)
 initialize data for a previously unseen term.

int addUncachedTerm (const InvFPTerm *term)
 update data for a term that is not cached in the term cache.

Open index
bool open (const string &indexName)
 Open previously created Index with given prefix.

Spelling and index conversion
int term (const string &word) const
 Convert a term spelling to a termID.

const string term (int termID) const
 Convert a termID to its spelling.

int document (const string &docIDStr) const
 Convert a spelling to docID.

const string document (int docID) const
 Convert a docID to its spelling.

const DocumentManagerdocManager (int docID) const
 The document manager for this document.

Summary counts
int docCount () const
 Total count (i.e., number) of documents in collection.

int termCountUnique () const
 Total count of unique terms in collection.

int termCount (int termID) const
 Total counts of a term in collection.

int termCount () const
 Total counts of all terms in collection.

float docLengthAvg () const
 Average document length.

int docCount (int termID) const
 Total counts of doc with a given term.

int docLength (DOCID_T docID) const
 Total counts of terms in a document, including stop words maybe.

virtual int totaldocLength (int docID) const
 Total counts of terms in a document including stopwords for sure.

int docLengthCounted (int docID) const
 Total count of terms in given document, not including stop words.

Index entry access
DocInfoListdocInfoList (int termID) const
 doc entries in a term index,
See also:
DocList , InvFPDocList


TermInfoListtermInfoList (int docID) const
 word entries in a document index (bag of words),
See also:
TermList


TermInfoListtermInfoListSeq (int docID) const
 word entries in a document index (sequence of words),
See also:
TermList



Protected Methods

bool tryOpen ()
 try to open an existing index

void writeTOC ()
 write out the table of contents file.

void writeCache (bool lastRun=false)
 write out the cache

void lastWriteCache ()
 final run write out of cache

void mergeCacheSegments ()
 out-of-tree cache management combine segments into single segment

void writeCacheSegment ()
 write out segments

void writeDocMgrIDs ()
 write out document manager ids

int docMgrID (const string &mgr)
virtual void doendDoc (const DocumentProps *dp, int mgrid)
 handle end of document token.

void openDBs ()
 open the database files

void openSegments ()
 open the segment files

void createDBs ()
 create the database files

void fullToc ()
 readin all toc

bool docMgrIDs ()
 read in document manager internal and external ids map

record fetchDocumentRecord (int key) const
 retrieve a document record.

void addDocumentLookup (int documentKey, const char *documentName)
 store a document record

void addTermLookup (int termKey, const char *termSpelling)
 store a term record

void addGeneralLookup (Keyfile &numberNameIndex, Keyfile &nameNumberIndex, int number, const char *name)
 store a record

InvFPDocListinternalDocInfoList (int termID) const
 retrieve and construct the DocInfoList for a term.

void _updateTermlist (InvFPDocList *curlist, int position)
 add a position to a DocInfoList

int _cacheSize ()
 total memory used by cache

void _computeMemoryBounds (int memorySize)
 cache size limits based on cachesize parameter to constructor

void _resetEstimatePoint ()
 Approximate how many updates to collect before flushing the cache.


Protected Attributes

int listlengths
 how long all the lists are

int * counts
 array to hold all the overall count stats of this db

std::vector< std::string > names
 array to hold all the names for files we need for this db

float aveDocLen
 the average document length in this index

vector< std::string > docmgrs
 list of document managers

ostream * msgstream
 Lemur code messages stream.

Keyfile invlookup
 termID -> TermData (term statistics and inverted list segment offsets)

Keyfile dIDs
 documentName -> documentID

Keyfile dSTRs
 documentID -> documentName

Keyfile tIDs
 termName -> termID

Keyfile tSTRs
 termID -> termName

File dtlookup
 document statistics (document length, etc.)

ReadBufferdtlookupReadBuffer
 read buffer for dtlookup

File writetlist
char termKey [MAX_TERM_LENGTH]
 buffers for term() lookup functions

char docKey [MAX_DOCID_LENGTH]
 buffers for document() lookup functions

int _listsSize
 memory for use by inverted list buffers

int _memorySize
 upper bound for memory use

std::string name
 the prefix name

vector< InvFPDocList * > invertlists
 array of pointers to doclists

vector< LocatedTermtermlist
 list of terms and their locations in this document

int curdocmgr
 the current docmanager to use

vector< DocumentManager * > docMgrs
 list of document manager objects

TermCache _cache
 cache of term entries

std::vector< File * > _segments
 out-of-tree segments for data

int _largestFlushedTermID
 highest term id flushed to disk.

int _estimatePoint
 invertlists point where we should next check on the cache size

bool ignoreDoc
 are we in a bad document state?


Detailed Description

KeyfileIncIndex builds an index assigning termids, docids, tracking locations of term within documents, and tracking terms within documents. It also expects a DocumentProp to have the total number of terms that were in a document. It expects that all stopping and stemming (if any) occurs before the term is passed in. If used with an existing index, new documents are added incrementally. Records are stored in keyfile b-trees. KeyfileIncIndex also provides the Index API for using the index.


Constructor & Destructor Documentation

KeyfileIncIndex::KeyfileIncIndex const string &    prefix,
int    cachesize = 128000000,
DOCID_T    startdocid = 1
 

Instantiate with index name without extension. Optionally pass in cachesize and starting document id number.

KeyfileIncIndex::KeyfileIncIndex  
 

New empty one for index manager to use.

KeyfileIncIndex::~KeyfileIncIndex  
 

Clean up.


Member Function Documentation

int KeyfileIncIndex::_cacheSize   [protected]
 

total memory used by cache

void KeyfileIncIndex::_computeMemoryBounds int    memorySize [protected]
 

cache size limits based on cachesize parameter to constructor

void KeyfileIncIndex::_resetEstimatePoint   [protected]
 

Approximate how many updates to collect before flushing the cache.

void KeyfileIncIndex::_updateTermlist InvFPDocList   curlist,
int    position
[protected]
 

add a position to a DocInfoList

void KeyfileIncIndex::addDocumentLookup int    documentKey,
const char *    documentName
[protected]
 

store a document record

void KeyfileIncIndex::addGeneralLookup Keyfile   numberNameIndex,
Keyfile   nameNumberIndex,
int    number,
const char *    name
[protected]
 

store a record

void KeyfileIncIndex::addKnownTerm int    termID,
int    position
 

update data for an already seen term

bool KeyfileIncIndex::addTerm const Term   t [virtual]
 

adding a term to the current document

Implements PushIndex.

void KeyfileIncIndex::addTermLookup int    termKey,
const char *    termSpelling
[protected]
 

store a term record

int KeyfileIncIndex::addUncachedTerm const InvFPTerm   term
 

update data for a term that is not cached in the term cache.

int KeyfileIncIndex::addUnknownTerm const InvFPTerm   term
 

initialize data for a previously unseen term.

bool KeyfileIncIndex::beginDoc const DocumentProps   dp [virtual]
 

the beginning of a new document

Implements PushIndex.

void KeyfileIncIndex::createDBs   [protected]
 

create the database files

int KeyfileIncIndex::docCount int    termID const [virtual]
 

Total counts of doc with a given term.

Implements Index.

int KeyfileIncIndex::docCount   const [inline, virtual]
 

Total count (i.e., number) of documents in collection.

Implements Index.

DocInfoList * KeyfileIncIndex::docInfoList int    termID const [virtual]
 

doc entries in a term index,

See also:
DocList , InvFPDocList

Implements Index.

int KeyfileIncIndex::docLength DOCID_T    docID const
 

Total counts of terms in a document, including stop words maybe.

float KeyfileIncIndex::docLengthAvg   [virtual]
 

Average document length.

Implements Index.

int KeyfileIncIndex::docLengthCounted int    docID const
 

Total count of terms in given document, not including stop words.

const DocumentManager * KeyfileIncIndex::docManager int    docID const [virtual]
 

The document manager for this document.

Reimplemented from Index.

int KeyfileIncIndex::docMgrID const string &    mgr [protected]
 

returns the internal id of given docmgr if not already registered, mgr will be added

bool KeyfileIncIndex::docMgrIDs   [protected]
 

read in document manager internal and external ids map

const string KeyfileIncIndex::document int    docID const [virtual]
 

Convert a docID to its spelling.

Implements Index.

int KeyfileIncIndex::document const string &    docIDStr const [virtual]
 

Convert a spelling to docID.

Implements Index.

void KeyfileIncIndex::doendDoc const DocumentProps   dp,
int    mgrid
[protected, virtual]
 

handle end of document token.

void KeyfileIncIndex::endCollection const CollectionProps   cp [virtual]
 

signify the end of this collection.

Implements PushIndex.

void KeyfileIncIndex::endDoc const DocumentProps   dp,
const string &    mgr
[virtual]
 

signify the end of current document

void KeyfileIncIndex::endDoc const DocumentProps   dp [virtual]
 

signify the end of current document

Implements PushIndex.

KeyfileIncIndex::record KeyfileIncIndex::fetchDocumentRecord int    key const [protected]
 

retrieve a document record.

void KeyfileIncIndex::fullToc   [protected]
 

readin all toc

InvFPDocList * KeyfileIncIndex::internalDocInfoList int    termID const [protected]
 

retrieve and construct the DocInfoList for a term.

void KeyfileIncIndex::lastWriteCache   [protected]
 

final run write out of cache

void KeyfileIncIndex::mergeCacheSegments   [protected]
 

out-of-tree cache management combine segments into single segment

bool KeyfileIncIndex::open const string &    indexName [virtual]
 

Open previously created Index with given prefix.

Implements Index.

void KeyfileIncIndex::openDBs   [protected]
 

open the database files

void KeyfileIncIndex::openSegments   [protected]
 

open the segment files

void KeyfileIncIndex::setDocManager const string &    mgrID [virtual]
 

set the document manager to use for succeeding documents

Implements PushIndex.

void KeyfileIncIndex::setMesgStream ostream *    lemStream
 

set the mesg stream

void KeyfileIncIndex::setName const string &    prefix
 

sets the name for this index

const string KeyfileIncIndex::term int    termID const [virtual]
 

Convert a termID to its spelling.

Implements Index.

int KeyfileIncIndex::term const string &    word const [virtual]
 

Convert a term spelling to a termID.

Implements Index.

int KeyfileIncIndex::termCount   const [inline, virtual]
 

Total counts of all terms in collection.

Implements Index.

int KeyfileIncIndex::termCount int    termID const [virtual]
 

Total counts of a term in collection.

Implements Index.

int KeyfileIncIndex::termCountUnique   const [inline, virtual]
 

Total count of unique terms in collection.

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoList int    docID const [virtual]
 

word entries in a document index (bag of words),

See also:
TermList

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoListSeq int    docID const [virtual]
 

word entries in a document index (sequence of words),

See also:
TermList

Reimplemented from Index.

int KeyfileIncIndex::totaldocLength int    docID const [virtual]
 

Total counts of terms in a document including stopwords for sure.

bool KeyfileIncIndex::tryOpen   [protected]
 

try to open an existing index

void KeyfileIncIndex::writeCache bool    lastRun = false [protected]
 

write out the cache

void KeyfileIncIndex::writeCacheSegment   [protected]
 

write out segments

void KeyfileIncIndex::writeDocMgrIDs   [protected]
 

write out document manager ids

void KeyfileIncIndex::writeTOC   [protected]
 

write out the table of contents file.


Member Data Documentation

TermCache KeyfileIncIndex::_cache [protected]
 

cache of term entries

int KeyfileIncIndex::_estimatePoint [protected]
 

invertlists point where we should next check on the cache size

int KeyfileIncIndex::_largestFlushedTermID [protected]
 

highest term id flushed to disk.

int KeyfileIncIndex::_listsSize [protected]
 

memory for use by inverted list buffers

int KeyfileIncIndex::_memorySize [protected]
 

upper bound for memory use

std::vector<File*> KeyfileIncIndex::_segments [protected]
 

out-of-tree segments for data

float KeyfileIncIndex::aveDocLen [protected]
 

the average document length in this index

int* KeyfileIncIndex::counts [protected]
 

array to hold all the overall count stats of this db

int KeyfileIncIndex::curdocmgr [protected]
 

the current docmanager to use

Keyfile KeyfileIncIndex::dIDs [protected]
 

documentName -> documentID

char KeyfileIncIndex::docKey[MAX_DOCID_LENGTH] [protected]
 

buffers for document() lookup functions

vector<DocumentManager*> KeyfileIncIndex::docMgrs [protected]
 

list of document manager objects

vector<std::string> KeyfileIncIndex::docmgrs [protected]
 

list of document managers

Keyfile KeyfileIncIndex::dSTRs [protected]
 

documentID -> documentName

File KeyfileIncIndex::dtlookup [protected]
 

document statistics (document length, etc.)

ReadBuffer* KeyfileIncIndex::dtlookupReadBuffer [protected]
 

read buffer for dtlookup

bool KeyfileIncIndex::ignoreDoc [protected]
 

are we in a bad document state?

vector<InvFPDocList*> KeyfileIncIndex::invertlists [protected]
 

array of pointers to doclists

Keyfile KeyfileIncIndex::invlookup [protected]
 

termID -> TermData (term statistics and inverted list segment offsets)

int KeyfileIncIndex::listlengths [protected]
 

how long all the lists are

ostream* KeyfileIncIndex::msgstream [protected]
 

Lemur code messages stream.

std::string KeyfileIncIndex::name [protected]
 

the prefix name

std::vector<std::string> KeyfileIncIndex::names [protected]
 

array to hold all the names for files we need for this db

char KeyfileIncIndex::termKey[MAX_TERM_LENGTH] [protected]
 

buffers for term() lookup functions

vector<LocatedTerm> KeyfileIncIndex::termlist [protected]
 

list of terms and their locations in this document

Keyfile KeyfileIncIndex::tIDs [protected]
 

termName -> termID

Keyfile KeyfileIncIndex::tSTRs [protected]
 

termID -> termName

File KeyfileIncIndex::writetlist [protected]
 

filestream for writing the list of located terms mutable for index access mode of Index API (not PushIndex)


The documentation for this class was generated from the following files:
Generated on Fri Jul 2 16:25:43 2004 for Lemur Toolkit by doxygen1.2.18