Wednesday, March 8, 2006 - 12:00, NSH 3002
Title: Structured and Dynamic Topic Models
Speaker: John Lafferty

Abstract:

A surge of recent research in machine learning and statistics has developed new techniques for automatically finding patterns of words in document collections using hierarchical probabilistic models. These models are called "topic models" because the word patterns often reflect the underlying topics that are combined to form the documents; however topic models also naturally apply to such data as images and biological sequences. While previous topic models have assumed that the corpus is static, many document collections actually change over time: scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. We describe new work on probabilistic models designed to capture of the dynamics of the topics as they evolve over time. Traditional time series modeling has focused on continuous data; but topic models are designed for categorical data. Our approach is to use state space models on the natural parameter space of multinomial and logistic normal distributions that represent topic models as points on a high dimensional probability simplex over the word vocabulary. Due to the nonconjugacy of the Gaussian and multinomial models, posterior inference is intractable, and we develop variational approximations based on Kalman filters and nonparametric wavelet regression to carry out approximate posterior inference over the latent topics. In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the contents of a large document collection, allowing a user to explore the structure of the corpus in a topic-guided fashion. We demonstrate the capabilities of these new models on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text resulting from an optical character recognition engine run over the original bound journals by JSTOR, the online scholarly journal archive.

This is joint work with David Blei.