Advanced Statistical Language Processing: Reading the Web (10-709)

Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University

Fall 2009


Instructor: Tom Mitchell, GHC 8211, x8-2611

Course administrative assistant:  Sharon Cavlovich, GHC 8215, x8-5196

Class lectures: Thursdays 3:00pm-4:50pm, Gates-Hillman Center, 4211

This is an advanced, research-oriented course on statistical natural language processing.   Students and the instructor will work together to understand and extend state-of-the-art machine learning algorithms for information extraction, named entity extraction, co-reference resolution, and related natural language processing tasks.  The course will involve two primary activities: reading and discussing current research papers in this area, and developing a novel approach to continuous learning for natural language processing.   More specifically, as a class we will work together to build components of a computer system that runs for many days on a large computer cluster that contains 200 million web pages, to perform two tasks: (1) extracting factual content from unstructured and semi-structured web pages, and (2) continuously learning to improve its competence at information extraction.   We will begin the course with a running prototype system, as described at http://rtw.ml.cmu.edu/readtheweb.html.  During the course, students will help extend and populate this system with additional statistical learning methods that enable it to extract additional kinds of information from the web, and to continuously learn to improve its capabilities.

Class Wiki:


Data and Software: 

Class slides: