CMU - IR Discussion Series

Wednesday, December 1, 2004 - 3:00, WeH 4625
Title: Probabilistic Models of Text and Images
Speaker: David Blei

Abstract:

Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, sounds, and genetic information have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. In this talk, I will describe a suite of probabilistic models of information collections, for which the above problems can be cast as statistical queries.
I will describe the use of graphical models as a flexible, modular framework for the representation of modeling assumptions. Fast approximate posterior inference algorithms based on variational methods allow us to specify complex Bayesian models, even in the face of large datasets.
With this framework in hand, I will describe latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits a finite index of hidden topics which describe the underlying documents. New documents are then situated into the collection via approximate posterior inference of their associated index terms. Extensions to LDA can index a set of images, or multimedia collections of interrelated text and images.
Finally, I will describe nonparametric Bayesian methods for relaxing the assumption of a fixed number of topics, and develop models based on the natural assumption that the size of the index can grow with the collection. This idea is extended to trees, and to models which represent the hidden structure and content of a topic hierarchy that underlies a collection.
Joint work with Michael Jordan, Andrew Ng, Thomas Griffiths, and Josh Tenenbaum