Unigram Language modeling is a successful probabilistic framework for
Information Retrieval (IR) that uses the multinomial distribution to
model documents and queries. An important feature in this approach is
the usage of cross-entropy between the query model and document models
as a document ranking function. The Naive Bayes model for text
classification uses the same multinomial distribution to model documents
but in contrast, employs document-log-likelihood as a scoring function.
Curiously, the cross-entropy function roughly corresponds to
query-log-likelihood w.r.t. the document models, in some sense an
inverse of the scoring function used in the Naive Bayes model. It has
been empirically demonstrated that cross entropy is a better performer
than document-likelihood, but this interesting phenomenon remains
largely unexplained. In this work we investigate the cross-entropy
ranking function in IR. In particular, we show that the cross entropy
ranking function corresponds to the log-likelihood of documents w.r.t.
the approximated Smoothed-Dirichlet (SD) distribution, a novel variant
of the Dirichlet distribution. We also empirically demonstrate that this
new distribution captures term occurrence patterns in documents much
better than the multinomial, thus offering a reason behind the superior
performance of the cross entropy ranking function compared to the
Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based Naive Bayes model and on par with the SVMs, confirming our reasoning. We also construct a well-motivated classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudo-feedback and show that its performance is equivalent to the Relevance model (RM), a state-of-the-art model for IR in the language modeling framework that also uses cross-entropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling queries of varying lengths owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an on-line classification task.
Back to the Main Page
Pradeep Ravikumar Last modified: Wed May 3 11:24:40 EDT 2006