Main Page   Namespace List   Class Hierarchy   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

BasicIndex Class Reference

Basic Indexer (with arbitrary compressor). More...

#include <BasicIndex.hpp>

Inheritance diagram for BasicIndex:

Index List of all members.

Public Methods

 BasicIndex ()
 constructor (used when opening an index)

 BasicIndex (Compress *pc)
 constructor (used when building an index)

virtual ~BasicIndex ()
virtual bool open (const char *indexName)
 Open previously created Index, return true if opened successfully.

void build (DocStream *collectionStream, const char *file, const char *outputPrefix, int totalDocs=0x1000000, int maxMemory=0x4000000, int minimumCount=1, int maxVocSize=2000000)
Spelling and index conversion
virtual int term (const char *word)
 Convert a term spelling to a termID.

virtual const char * term (int termID)
 Convert a termID to its spelling.

virtual int document (const char *docIDStr)
 Convert a spelling to docID.

virtual const char * document (int docID)
 Convert a docID to its spelling.

virtual const char * termLexiconID ()
 return the term lexicon ID

Summary counts
virtual int docCount ()
 Total count (i.e., number) of documents in collection.

virtual int termCountUnique ()
 Total count of unique terms in collection.

virtual int termCount (int termID) const
 Total counts of a term in collection.

virtual int termCount () const
 Total counts of all terms in collection.

virtual float docLengthAvg ()
 Average document length.

virtual int docCount (int termID)
 Total counts of doc with a given term.

virtual int docLength (int docID) const
 Total counts of terms in a document.

Index entry access
virtual DocInfoListdocInfoList (int termID)
 doc entries in a term index, caller should release the memory
See also:
DocList


virtual TermInfoListtermInfoList (int docID)
 word entries in a document index, caller should release the memory
See also:
TermList



Private Methods

void buildVocabulary (int maxVocSize, int minimumCount)
void writeWordIndex (int indexNum, FastList< IndexCount > *dlw)
int indexCollection ()
int headDocIndex ()
int headWordIndex ()
void createKeys ()
void mergeIndexFiles ()
void createKey (const char *inName, const char *outName, Terms &voc, int *byteOffset)
int mergePair (const char *fn1, const char *fn2, const char *fn3)
void writeIndexFile ()

Private Attributes

ifstream textStream
String prefix
String textFile
String wordVocabulary
String documentVocabulary
String wordIndexFile
String documentIndexFile
String wordKeyFile
String documentKeyFile
Terms terms
Terms docids
int numDocuments
int numWords
int numBytes
int maxDocumentLength
float avgDocumentLength
int totalDocuments
int memorySegment
int maxSegmentsPerIndex
time_t timeToIndex
int maximumMemory
MemListpMemList
CompresspCompressor
bool deleteCompressor
DocStreampDocStream
ifstream wordIndexStream
ifstream documentIndexStream
int * woffset
int * doffset
int * tmpdarr
int * tmpwarr
int * countOfTerm
int * countOfDoc

Detailed Description

Basic Indexer (with arbitrary compressor).

BasicIndex is a basic implementation of Index. It creates and manages two indices (term->doc and doc->term) as well as a term lexicon and document id lexicon. The application can pass in any compressor when calling the build function. @See Index for an example of use.


Constructor & Destructor Documentation

BasicIndex::BasicIndex  
 

constructor (used when opening an index)

BasicIndex::BasicIndex Compress   pc
 

constructor (used when building an index)

BasicIndex::~BasicIndex   [virtual]
 


Member Function Documentation

void BasicIndex::build DocStream   collectionStream,
const char *    file,
const char *    outputPrefix,
int    totalDocs = 0x1000000,
int    maxMemory = 0x4000000,
int    minimumCount = 1,
int    maxVocSize = 2000000
 

void BasicIndex::buildVocabulary int    maxVocSize,
int    minimumCount
[private]
 

void BasicIndex::createKey const char *    inName,
const char *    outName,
Terms   voc,
int *    byteOffset
[private]
 

void BasicIndex::createKeys   [private]
 

int BasicIndex::docCount int    termID [virtual]
 

Total counts of doc with a given term.

Implements Index.

virtual int BasicIndex::docCount   [inline, virtual]
 

Total count (i.e., number) of documents in collection.

Implements Index.

DocInfoList * BasicIndex::docInfoList int    termID [virtual]
 

doc entries in a term index, caller should release the memory

See also:
DocList

Implements Index.

virtual int BasicIndex::docLength int    docID const [inline, virtual]
 

Total counts of terms in a document.

Implements Index.

virtual float BasicIndex::docLengthAvg   [inline, virtual]
 

Average document length.

Implements Index.

virtual const char* BasicIndex::document int    docID [inline, virtual]
 

Convert a docID to its spelling.

Implements Index.

virtual int BasicIndex::document const char *    docIDStr [inline, virtual]
 

Convert a spelling to docID.

Implements Index.

int BasicIndex::headDocIndex   [private]
 

int BasicIndex::headWordIndex   [private]
 

int BasicIndex::indexCollection   [private]
 

void BasicIndex::mergeIndexFiles   [private]
 

int BasicIndex::mergePair const char *    fn1,
const char *    fn2,
const char *    fn3
[private]
 

bool BasicIndex::open const char *    indexName [virtual]
 

Open previously created Index, return true if opened successfully.

Implements Index.

virtual const char* BasicIndex::term int    termID [inline, virtual]
 

Convert a termID to its spelling.

Implements Index.

virtual int BasicIndex::term const char *    word [inline, virtual]
 

Convert a term spelling to a termID.

Implements Index.

virtual int BasicIndex::termCount   const [inline, virtual]
 

Total counts of all terms in collection.

Implements Index.

virtual int BasicIndex::termCount int    termID const [inline, virtual]
 

Total counts of a term in collection.

Implements Index.

virtual int BasicIndex::termCountUnique   [inline, virtual]
 

Total count of unique terms in collection.

Implements Index.

TermInfoList * BasicIndex::termInfoList int    docID [virtual]
 

word entries in a document index, caller should release the memory

See also:
TermList

Implements Index.

virtual const char* BasicIndex::termLexiconID   [inline, virtual]
 

return the term lexicon ID

Reimplemented from Index.

void BasicIndex::writeIndexFile   [private]
 

void BasicIndex::writeWordIndex int    indexNum,
FastList< IndexCount > *    dlw
[private]
 


Member Data Documentation

float BasicIndex::avgDocumentLength [private]
 

int* BasicIndex::countOfDoc [private]
 

int* BasicIndex::countOfTerm [private]
 

bool BasicIndex::deleteCompressor [private]
 

Terms BasicIndex::docids [private]
 

String BasicIndex::documentIndexFile [private]
 

ifstream BasicIndex::documentIndexStream [private]
 

String BasicIndex::documentKeyFile [private]
 

String BasicIndex::documentVocabulary [private]
 

int * BasicIndex::doffset [private]
 

int BasicIndex::maxDocumentLength [private]
 

int BasicIndex::maximumMemory [private]
 

int BasicIndex::maxSegmentsPerIndex [private]
 

int BasicIndex::memorySegment [private]
 

int BasicIndex::numBytes [private]
 

int BasicIndex::numDocuments [private]
 

int BasicIndex::numWords [private]
 

Compress* BasicIndex::pCompressor [private]
 

DocStream* BasicIndex::pDocStream [private]
 

MemList* BasicIndex::pMemList [private]
 

String BasicIndex::prefix [private]
 

Terms BasicIndex::terms [private]
 

String BasicIndex::textFile [private]
 

ifstream BasicIndex::textStream [private]
 

time_t BasicIndex::timeToIndex [private]
 

int* BasicIndex::tmpdarr [private]
 

int * BasicIndex::tmpwarr [private]
 

int BasicIndex::totalDocuments [private]
 

int* BasicIndex::woffset [private]
 

String BasicIndex::wordIndexFile [private]
 

ifstream BasicIndex::wordIndexStream [private]
 

String BasicIndex::wordKeyFile [private]
 

String BasicIndex::wordVocabulary [private]
 


The documentation for this class was generated from the following files:
Generated on Fri Feb 6 07:11:58 2004 for LEMUR by doxygen1.2.16