Abstract
Unigram Language modeling is a successful probabilistic framework for
Information Retrieval (IR) that uses the multinomial distribution to
model documents and queries. An important feature in this approach is
the usage of crossentropy between the query model and document models
as a document ranking function. The Naive Bayes model for text
classification uses the same multinomial distribution to model documents
but in contrast, employs documentloglikelihood as a scoring function.
Curiously, the crossentropy function roughly corresponds to
queryloglikelihood w.r.t. the document models, in some sense an
inverse of the scoring function used in the Naive Bayes model. It has
been empirically demonstrated that cross entropy is a better performer
than documentlikelihood, but this interesting phenomenon remains
largely unexplained. In this work we investigate the crossentropy
ranking function in IR. In particular, we show that the cross entropy
ranking function corresponds to the loglikelihood of documents w.r.t.
the approximated SmoothedDirichlet (SD) distribution, a novel variant
of the Dirichlet distribution. We also empirically demonstrate that this
new distribution captures term occurrence patterns in documents much
better than the multinomial, thus offering a reason behind the superior
performance of the cross entropy ranking function compared to the
multinomial documentlikelihood.
Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based Naive Bayes model and on par with the SVMs, confirming our reasoning. We also construct a wellmotivated classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudofeedback and show that its performance is equivalent to the Relevance model (RM), a stateoftheart model for IR in the language modeling framework that also uses crossentropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling queries of varying lengths owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an online classification task.

Pradeep Ravikumar Last modified: Wed May 3 11:24:40 EDT 2006