Smoothed Dirichlet distribution: Understanding the Cross-entropy ranking function in Information Retrieval

Abstract

Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of cross-entropy between the query model and document models as a document ranking function. The Naive Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood as a scoring function. Curiously, the cross-entropy function roughly corresponds to query-log-likelihood w.r.t. the document models, in some sense an inverse of the scoring function used in the Naive Bayes model. It has been empirically demonstrated that cross entropy is a better performer than document-likelihood, but this interesting phenomenon remains largely unexplained. In this work we investigate the cross-entropy ranking function in IR. In particular, we show that the cross entropy ranking function corresponds to the log-likelihood of documents w.r.t. the approximated Smoothed-Dirichlet (SD) distribution, a novel variant of the Dirichlet distribution. We also empirically demonstrate that this new distribution captures term occurrence patterns in documents much better than the multinomial, thus offering a reason behind the superior performance of the cross entropy ranking function compared to the multinomial document-likelihood.

Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based Naive Bayes model and on par with the SVMs, confirming our reasoning. We also construct a well-motivated classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudo-feedback and show that its performance is equivalent to the Relevance model (RM), a state-of-the-art model for IR in the language modeling framework that also uses cross-entropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling queries of varying lengths owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an on-line classification task.

Back to the Main Page

Pradeep Ravikumar

Last modified: Wed May 3 11:24:40 EDT 2006