|
My primary research is in the area of machine learning and
computational statistics, with basic research on theory, methods, and
algorithms. I am generally interested in statistical and
probabilistic methods in computer science and AI. Areas of focus
include nonparametric methods, sparsity, the analysis of
high-dimensional and large data sets, graphical models, information
theory, and applications to language processing, text analysis, and
information retrieval. Some recent projects include:
- Rodeo: Sparse Nonparametric Regression in High
Dimensions
with Larry Wasserman
Submitted for publication (short version to appear in NIPS)
postscript PDF (revised 10/12)
Modern data sets requiring statistical analysis are often very
high dimensional. However, estimating a high dimensional regression
function is notoriously difficult, due to the curse of
dimensionality, which can be precisely characterized using minimax
theory. We've been working on a new method for simultaneously
performing bandwidth selection and variable selection in
nonparametric regression that can beat the curse of dimensionality
when the underlying function is sparse. The method starts with a
local linear estimator with large bandwidths, and incrementally
decreases the bandwidth in directions where the gradient of the
estimator with respect to bandwidth is large. The method, called
"rodeo" (regularization of derivative expectation operator)
conducts a sequence of hypothesis tests, and is easy to implement.
A modified version that replaces testing with soft thresholding can
be viewed as solving a sequence of lasso problems. Under certain
assumptions, the method achieves the optimal minimax rate of
convergence, up to logarithmic factors, as if the true relevant
variables were known in advance. When applied in one dimension, the
rodeo yields a simple adaptive estimator that chooses the locally
optimal bandwidth.
- Correlated Topic Models
with Dave Blei
NIPS paper: PDF
Topic models, such as latent Dirichlet allocation (LDA), are effective
tools for the statistical analysis of document collections and other
discrete data. The LDA model assumes that the words of each document
arise from a mixture of topics, each of which is a distribution over
the vocabulary. A limitation of LDA is the inability to model topic
correlation even though, for example, a document about sports is more
likely to also be about health than international finance. This
limitation stems from the use of the Dirichlet distribution to model
the variability among the topic proportions. We have been developing
the correlated topic model (CTM), where the topic proportions
exhibit correlation via the logistic normal distribution (Aitchison,
1982). Mean-field variational inference for approximate posterior
inference in this model, is complicated by the fact that the logistic
normal is not conjugate to the multinomial. The CTM provides a
natural way of visualizing and exploring unstructured data sets.
Visit
www.cs.cmu.edu/~lemur/science for an example browser for the model
fit on a collection of OCR articles from the journal Science. We're currently
working on time series versions of these models to capture the time
evolution of the underlying topics.
- Preconditioner Approximations for Probabilistic
Graphical Models
with Pradeep Ravikumar
NIPS paper: postscript PDF
Slides from talk: PDF
We're investigating a new family of approximation techniques for probabilistic
graphical models, based on the use of graphical preconditioners
developed in the scientific computing literature. The new framework
yields upper and lower bounds on event probabilities and the
log partition function of undirected graphical models, using
non-iterative procedures that have low time complexity. As in mean
field approaches, the approximations are built upon tractable
subgraphs, but we recast the problem of optimizing the tractable
distribution parameters and approximate inference in terms of the
well-studied linear systems problem of obtaining a good matrix
preconditioner. Preliminary experiments are encouraging, with
the new approximation schemes competitive with basic variational
methods.
|