Fast and flexible clustering---nonparametric Bayesian methods in machine learning

David Blei

Abstract

  Probabilistic clustering methods are widely used in machine learning for many application areas such as vision, information retrieval, and bioinformatics. Such methods can be used both for dimensionality reduction and, more qualitatively, automatically organizing large datasets into groups of similar elements. Lurking behind every clustering application, however, is the nagging question: how many clusters should there be?

The classical approach to answering this question is to use model-selection techniques to choose the best number of clusters for the data. However, realistic datasets grow and change over time; one can't always expect that a fixed number of clusters can capture all of the variation of the observed and future data. The Dirichlet process (Fergusen, 1973), the cornerstone of Bayesian nonparametric statistics, provides a modeling technique with which new data can exhibit new clusters. It can be used as a flexible clustering tool which chooses the number of clusters, but does not wholly commit to that choice.

While conceptually attractive, Dirichlet process methods are limited in practice because they rely on Monte Carlo Markov chain sampling techniques for approximate inference (West and MacEachern, 1994). Such methods work well in univariate settings, but are often too slow in the complicated multivariate problems which are the focus of machine learning. In this talk, I will review Dirichlet process clustering techniques, and describe a fast, deterministic method for approximate inference based on variational methods. This method is a practical alternative to MCMC for problems of large scale.
(Joint work with Michael I. Jordan)


Back to the Main Page

Pradeep Ravikumar
Last modified: Fri Oct 29 13:28:56 EDT 2004