Abstract
For Hidden Markov Models (HMMs) with fully connected transition models, the
three fundamental problems of evaluating the likelihood of an observation
sequence, estimating an optimal state sequence for the observations, and
learning the model parameters, all have quadratic time complexity in the
number of states. We introduce a novel class of nonsparse Markov transition
matrices called DenseMostlyConstant (DMC) transition matrices that allow
us to derive new algorithms for solving the basic HMM problems in
subquadratic time. We describe the DMC HMM model and algorithms and attempt
to convey some intuition for their usage. Empirical results for these
algorithms show dramatic speedups for all three problems. In terms of
accuracy, the DMC model yields strong results and outperforms the
baseline algorithms even in domains known to violate the DMC assumption.
Most statistical approaches to modeling text implicitly assume that
informative words are rare. This assumption is usually appropriate for
topical retrieval and classification tasks; however, in non topical
classification and softclustering problems where classes and latent
variables relate to sentiment or author, informative words can be
frequent. In this paper we present a comprehensive set of statistical
learning tools which treat words with higher frequencies of occurrence
in a sensible manner. We introduce probabilistic models of contagion
for classification and softclustering based on the Poisson and
NegativeBinomial distributions, which share with the Multinomial the
desirable properties of simplicity and analytic tractability. We then
introduce the DeltaSquare statistic to select features and avoid
overfitting.

Pradeep Ravikumar Last modified: Thu Nov 10 11:41:07 EST 2005