Bayesian Methods for Frequent Terms in Text: Models of Contagion and the Delta-Square Statistic

Edoardo M. Airoldi


  Most statistical approaches to modeling text implicitly assume that informative words are rare. This assumption is usually appropriate for topical retrieval and classification tasks; however, in non- topical classification and soft-clustering problems where classes and latent variables relate to sentiment or author, informative words can be frequent. In this paper we present a comprehensive set of statistical learning tools which treat words with higher frequencies of occurrence in a sensible manner. We introduce probabilistic models of contagion for classification and soft-clustering based on the Poisson and Negative-Binomial distributions, which share with the Multinomial the desirable properties of simplicity and analytic tractability. We then introduce the Delta-Square statistic to select features and avoid over-fitting. As an example, we demonstrate the Dirichlet-Poisson model for classification and soft-clustering. On a technical level, this model leverages: (a) the "reference length" parameter, in order to implicitly normalize word-counts in a probabilistic fashion, and ultimately correct parameter estimates for the different word-length of documents, and (b) the "sum/ratio" parameterization, in order to promote the tractability of variational inference, the interpretability of parameters and priors, and geometrical intuitions. This is joint work with William Cohen and Stephen Fienberg.

Back to the Main Page

Pradeep Ravikumar
Last modified: Sat Nov 5 09:08:53 EST 2005