Probability and Structure in Natural Language Processing



In short, this is a crash course covering recent developments at the junction of NLP and machine learning, but that most NLPers did not learn in graduate school. The goal is to make it easier for NLP researchers to follow relevant research in machine learning, and to contribute to the growing body of research that uses advanced statistical modeling techniques to solve hard language processing problems.


The tutorial will be broken into three parts, with a couple of exercises and open discussion sessions interspersed. The outline below is ambitious and subject to change; some topics may be referenced only in brief.

Probabilistic Graphical Models

Probabilistic graphical models are a major topic in machine learning. They provide a foundation for statistical modeling of complex data, and starting points (if not full-blown solutions) for inference and learning algorithms. They generalize many familiar methods in NLP. We'll cover:
  1. Bayesian networks: representations (graph vs. probability distribution, independence assumptions), reading off independence assumptions (d-separation, etc.)
  2. Markov networks: representations, factors, independence assumptions
  3. Relationships between BNs and MNs; review of special cases from NLP (maxent, HMMs, etc.)
  4. What is inference? Formulating the main inference problems (MAP, marginal) and approaches, at a high level (variable elimination, loopy belief propagation, mean-field variational inference, Markov chain Monte Carlo)
  5. Probabilistic learning/estimation as a kind of inference: MLE and MAP

Linear Structure Models

Most problems in linguistic structure prediction are currently solved by applying discrete optimization techniques (dynamic programming, search, and others) to identify a structure that maximizes some score, given an input. We describe a few ways to think about the problem of prediction itself, and review key approaches to learning structured prediction models.
  1. What we mean by "structure": trees, sequences, alignments
  2. HMMs as dynamic Bayesian networks; inference algorithms under the graphical models view.
  3. The features/linear model view of decoding
  4. Five views of decoding
  5. Empirical risk minimization: connecting generative models (e.g., HMMs and PCFGs) to structured perceptron, CRFs, and structured max margin

Incomplete Data

Since we will never have as much annotated linguistic data as we'd like in all the languages, domains, and genres for which we'd like to do NLP, semisupervised and unsupervised learning have become hugely important. We show how the foundations from the first two parts can be extended to provide a framework for learning with incomplete data.
  1. Motivation of hidden structure modeling in NLP: examples of incomplete data
  2. Review of EM, a kind of inference
  3. What does "Bayesian" mean? Priors over parameters; Bayesian learning is still inference
  4. From EM to variational EM, and empirical Bayesian learning
  5. Sampling in the Bayesian, structured setting


The exercise is released Monday afternoon (midway through lecture 2) and is due at 5pm on Tuesday (after lecture 4). Download the exercise here.


Koller, Daphne and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. [link]

Smith, Noah A. 2011. Linguistic Structure Prediction. Morgan & Claypool. [link]