Abstract
Most statistical approaches to modeling text implicitly assume that
informative words are rare. This assumption is usually appropriate for
topical retrieval and classification tasks; however, in non topical
classification and softclustering problems where classes and latent
variables relate to sentiment or author, informative words can be
frequent. In this paper we present a comprehensive set of statistical
learning tools which treat words with higher frequencies of occurrence
in a sensible manner. We introduce probabilistic models of contagion
for classification and softclustering based on the Poisson and
NegativeBinomial distributions, which share with the Multinomial the
desirable properties of simplicity and analytic tractability. We then
introduce the DeltaSquare statistic to select features and avoid
overfitting.
As an example, we demonstrate the DirichletPoisson model for
classification and softclustering. On a technical level, this model
leverages: (a) the "reference length" parameter, in order to implicitly
normalize wordcounts in a probabilistic fashion, and ultimately
correct parameter estimates for the different wordlength of documents,
and (b) the "sum/ratio" parameterization, in order to promote the
tractability of variational inference, the interpretability of
parameters and priors, and geometrical intuitions.
This is joint work with William Cohen and Stephen Fienberg.

Pradeep Ravikumar Last modified: Sat Nov 5 09:08:53 EST 2005