Read The Web

course 10-709, Spring 2006

(alternative title: Advanced Statistical Language Processing)

  Machine Learning Department
School of Computer Science
Carnegie Mellon University

This is an advanced, research-oriented course on statistical natural language processing.   Students and the instructors will work together to understand, implement, and extend state-of-the-art machine learning algorithms for information extraction, named entity extraction, co-reference resolution, and related natural language processing tasks.   The course will involve two primary activities: reading and discussing current research papers in this area, and developing a novel approach to continuous learning for natural language processing.   More specifically, as a class we will work together toward designing and building a computer system that runs 24 hours/day, 7 days/week, performing two tasks: (1) extracting factual content from unstructured and semi-structured web pages, and (2) continuously learning to improve its competence at information extraction.  

Instructors: Tom Mitchell, William Cohen, Scott Fahlman, Eric Nyberg
Course secretary: Sharon Cavlovich, , Wean Hall 5315, x8-5196
Class meetings: Thursdays, 3:00-5:00pm, Wean 5409
Course mailing list:
Office hours: Tom Mitchell by appointment with

KIVA Discussion Site : we're using this site for discussions about specific projects, meeting scheduling, etc.

Round 2 Project plans :
Software: for collecting web pages, training text classifiers, storing facts in a knowledge base, and more...

Data sets: including 10^5 web pages collected from CS and Biology departments

Reading list: on bootstrap learning for natural language processing

Project Reading lists: links to reading lists for individual project groups

Resources:  a list of candidate resources (eg., search software, Wordnet) which may be useful in our system.

Task ListTo succeed as a group we'll have to accomplish a number of important tasks beyond individually learning the material and developing our individual component of the system.  These tasks range from serving as a class consultant to help others use the Minorthird system, to helping manage the course website.  Everybody is expected to sign up for something.  Please sign up now.

Round 1 Projects : look here for advice on round 1 project proposals due January 26

Tentative course schedule:
During most class meetings we will spend part of the class studying one or more approaches to semi-supervised learning, and part of the class on design and design reviews of the ReadTheWeb system we're building. 

The following is a partial outline of topics/assignments/handouts for upcoming class sessions.  To be updated as we go.

Here is some of the research we may cover:
-Large scale web information extraction [Etzioni, et al. 05]
-Bootstrap learning from the web [Brin, 1999]
-Cotraining for web classification [Blum&Mitchell 98]
-Bootstrapping for natural language learning [Eisner&Karakos, 05]
-Semi-supervised learning for named entity extraction [Collins&Singer 99; Jones 05]
-Automatic learning of hypernyms [Ng, 05]
-Extracting information about people and publications from the web [McCallum, 05]
-Wrapper induction for extraction from structured web pages [Muslea et al., 01; Mohapatra et al. 04]
-Learning to disambiguate word senses [Yarowsky 96]
-Discovering new word senses [Pantel&Lin 02]
-Synonym and ontology discovery [Lin et al., 03]
-Relation extraction [Yangarber et al. 00]
-Statistical parsing [Collins, et al. 05]
-Graphical models for information extraction [Rosario, 05]
-Latent Dirichlet Allocation [Blei, 02]

more coming soon...
This file is located at /afs/cs/project/theo-21/www/index.html. 
It was created using NVU, freely available at
Tom Mitchell, January 20, 2006.