Internet search, speech recognition, machine translation, question answering, information retrieval, biological sequence analysis -- are all at the forefront of this century’s information revolution. In addition to their use of machine learning, these technologies rely heavily on classic statistical estimation techniques. Yet most CS and engineering undergraduate programs do not prepare students in this area beyond an introductory probability & statistics course. This course is designed to address this gap.
The goal of "Language and Statistics" is to ground the data-driven techniques used in language technologies in sound statistical methodology. We start by formulating various language technology problems in both an information theoretic framework (the source-channel paradigm) and a Bayesian framework (the Bayes classifier). We then discuss the statistical properties of words, sentences, documents and whole languages, and the various computational formalisms used to represent language. These discussions naturally lead to specific concepts in statistical estimation.
Topics include: Zipf's distribution and type-token curves; point estimators, Maximum Likelihood estimation, bias and variance, sparseness, smoothing and clustering; interpolation, shrinkage, and backoff; entropy, cross entropy and mutual information; decision tree models applied to language; latent variable models and the EM algorithm; hidden Markov models; exponential models and the maximum entropy principle; semantic modeling and dimensionality reduction; probabilistic context-free grammars and syntactic language models.
“Language and Statistics” is designed for LTI and other SCS graduate students, but others are also welcome. CS undergraduate upperclassmen who have taken it have done well, although they found it challenging.
The Master-level version of this course (11-661) does not require the course project.
Prerequisites: Strong quantitative aptitude. Familiarity and comfort with basic undergraduate-level probability. Some programming skill.
Last
modified: Aug 27, 2017