This course introduces some of the central themes and techniques that have emerged in statistical methods for language technologies and natural language processing. While many early NLP systems relied heavily on hand-crafted rules, during the past ten years a great deal of progress has been made using probabilistic methods that automatically and implicitly learn about language by extracting statistics from large quantities of text, thus reducing the knowledge acquisition bottleneck. As the computational power of computers increases, and as more natural language data becomes available on-line, statistical methods will become increasingly attractive and powerful in the future.
Topics include the source-channel paradigm from information theory, predictive language models, hidden Markov models, the EM algorithm in its many guises, maximum entropy methods, and classification and regression techniques. Selected case studies involving technologies such as word and document clustering, sense disambiguation, parsing, text classification, and machine translation are presented. The material draws upon machine learning, statistics, and information theory, but only an elementary knowledge of probability is a prerequisite for the course.
Last modified: Fri Jan 12 16:41:40 EST 2007