This course seeks to cover statistical modeling techniques for
discrete, structured data such as text. It brings together content
previously covered in Language and Statistics 2 (11-762) and
Information Extraction (10-707 and 11-748), and aims to define a
canonical set of models and techniques applicable to problems in
natural language processing, information extraction, and other
application areas. Upon completion, students will have a broad
understanding of machine learning techniques for structured outputs,
will be able to develop appropriate algorithms for use in new
research, and will be able to critically read related literature. The
course is organized around methods, with example tasks introduced
throughout.
We expect that the course will be of interest not only to LTI and MLD students, but also to students in the Lane Center, RI, and CSD.
Format
Synthesis (45%): Students will be expected to make
substantial contributions to a public wiki that synthesizes material
covered in the course. The wiki environment allows students to
leverage each other's contributions and illustrate connections across
different research ideas. It allows the instructors to track each
contributions. It also provides a resource to students and
researchers at CMU and elsewhere; each year the wiki will evolve to
cover more recent literature. Most of the synthesis requirement will be due before the middle of the semester.
Practicum (45%): Students will work in
small teams to complete projects on relevant topics of their choosing.
Projects are expected to directly explore ideas covered in the course,
and to have an experimental component. The use and extension of
existing libraries and tools is encouraged. Project proposals must
include milestones, and regular graded reports will discourage
procrastination. The project will be a greater focus in the second half of the course.
Oral presentation (10%): At the end of the course,
each team will either give a brief oral presentation or present at a
poster session, depending on course size.
No midterm or final exam is planned.
Topics
Subject to change. Parenthesized numbers are approximate numbers of lectures.
Sequence models: HMMs and MEMMs for part-of-speech tagging, BIO tagging/chunking, and segmentation; CRFs; cyclic models and pseudolikelihood (3)
Large margin models: structured and ranking perceptrons; structured SVMs (3)
Inference: dynamic programming; search; integer linear programming; stacking and Searn (3)
Tree models: PCFGs and phrase-structure parsing; spanning trees and dependency parsing (3)
Kernels: kernels for inputs and relation extraction; kernels for outputs, reranking, and non-local features (2)
Alignment models: edit distances for text or genomics; weighted FSTs; non-monotonic alignment and machine translation (3)
Incomplete data: EM; latent-variable CRFs and SVMs; regularizers from unlabeled data; graph-based semisupervised learning; associative Markov networks; Bayesian grammars (6)
Readings
Many lectures will be accompanied by a reading from
recent literature (typically machine learning or natural language
processing publications from the past decade). Supplementary readings
will be suggested from Prof. Smith's book, Linguistic Structure
Prediction.