Latent-Variable Models for Natural Language Processing Dan Klein University of California, Berkeley Abstract: Language is complex, but our labeled data sets generally aren't. For example, treebanks specify coarse categories like noun phrases, but they say nothing about richer phenomena like agreement, case, definiteness, and so on. One solution is to use latent-variable methods to learn these underlying complexities automatically. In this talk, I will present several latent-variable models for natural language processing which take such an approach. In the domain of syntactic parsing, I will describe a state-splitting approach which begins with an X-bar grammar and learns to iteratively refine grammar symbols. For example, noun phrases are split into subjects and objects, singular and plural, and so on. This splitting process in turn admits an efficient coarse-to-fine inference scheme, which reduces parsing times by orders of magnitude. Our method currently produces the best parsing accuracies in a variety of languages, in a fully language-general fashion. The same techniques can also be applied to acoustic modeling, where they induce latent phonological patterns. In the domain of machine translation, we must often analyze sentences and their translations at the same time. In principle, analyzing two languages should be easier than analyzing one: it is well known that two predictors can work better when they must agree. However ``agreement'' across languages is itself a complex, parameterized relation. I show that, for both parsing and entity recognition, bilingual models can be built from monolingual ones using latent-variable methods -- here, the latent variables are bilingual correspondences. The resulting bilingual models are substantially better than their decoupled monolingual versions, giving both error rate reductions in labeling tasks and BLEU score increases in machine translation. Bio: Dan Klein is an assistant professor of computer science at the University of California, Berkeley (PhD Stanford, MSt Oxford, BA Cornell). His research focuses on statistical natural language processing, including unsupervised learning methods, syntactic parsing, information extraction, and machine translation. Academic honors include a Marshall Fellowship, a Microsoft New Faculty Fellowship, the ACM Grace Murray Hopper award, and best paper awards at the ACL, NAACL, and EMNLP conferences.