Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration

Notice: this page is for the course as taught in spring 2004. An updated syllabus for 2007 is available.

Instructor and Venue

Instructor: William Cohen, CALD
When/where: MW 10:30-11:50, Wean Hall 4601
Course Number: 10-707, cross-listed in LTI as 11-748

Announcements: On Wed April 28, all project groups will give a 10-minute presentation of their work.

Supplemental material: Instructions for installing minorthird; Vitor's also maintaining an FAQ for minorthird.


Information extraction is finding names of entities in unstructured or partially structured text, and determining the relationships that hold between these entities. Information integration is reasoning with data taken from multiple sources. Together these techniques let one automatically perform the tremendously challenging task of deriving structured information from text, and relating it to previously-known facts.

The course will discuss many of the sub-problems involved in information extraction and integration, and the techniques required to solve them. We will consider the problems of text segmentation, relational learning, classification of text segments, finding and clustering of similar records, and reasoning with objects whose identity is uncertain. We will survey a variety of learning techniques that have been used on these problems, including rule-learning, boosting, semi-supervised learning, finite-state sequential classification methods (such as conditional Markov models and conditional random fields), character-based edit distances and adaptive generative models for modifying them, and other topics as time allows.

Readings will be based on research papers. Grades will be based on class participation, paper presentations, and a project.

More specifically, students will be expected to:

Prerequisites: a machine learning course (e.g., 15-781 or 15-681) or consent of the instructor.



Lecture: (Jan 12) Overview of IE. Some longer overview slides are available on my web page, from researcher tutorials given at NIPS-2002 and KDD-2003. (Jan 14) Overviews of some of my own older work on information integration, and also some more of my recent work comparing different string distance metrics. Don't miss the example of TFIDF matching that didn't fit in my old PDF presentation.


Information Extraction as Classifying Text Segments

Lecture: (Jan 19) A discussion of key points from Jansche and Abney, and Cohen et al. (Jan 21) Overviews of the Califf and Mooney paper and the Cohen et al paper. Pradeep will also present a summary of the Collins and Singer paper.


IE as Boundary Detection

Lecture: (Jan 27) A discussion of Kushmeric's AIJ 2000 journal paper and Kushmeric and Freitag's BWI paper.; and I'll try again to get to a a presentation of the Cohen et al wrapper-learning paper.


IE as Sequential Token Classification: HMMs

Lecture: (2/4) A guest lecture by Sunita Sarawagi, focusing on the Borkar et al paper. (Notice that I've added this to the readings for this week, and made the Leek paper "optional".)


IE as Sequential Token Classification: Other Directed Graphical Models

Lecture: (2/9) Comments on the Ratnaparkhi paper and the Frietag et al paper; presentation from Tal Blum.


Lecture: (2/11) Comments on Borthwick et al paper and the Mikheev et al papers; presentation from Bing Zhao.


IE as Sequential Token Classification: "Undirected" Graphical Models

Lecture: (2/18) Comments on Lafferty et al paper and the Sha and Pereira paper; presentation from Luca.

Lecture: (2/23) Comments on Klein and Manning et al paper and the Toutanova paper


IE as Sequential Token Classification: Margin-based Methods

Lecture: (2/25) finishing up Klein and Manning et al and some background on max-margin learning

Lecture: (3/1) Guest lecture from Russ Greiner (U Alberta) on Web-IC.

Lecture: (3/3) Comments on the Collins paper and Altun et al paper.


Information Integration: Distance Metrics for Text

Lecture: (3/15) An overview of edit-distance computations and comments the Monge-Elkan paper

Lecture: (3/17) More on edit-distance computations; TFIDF distances for data integration

Lecture: (3/22) Review of various distance metrics and comparative experiments with different metrics


IE with Large Dictionaries

Lecture: (3/25) Guest lecture from Carlos Guestrin on Max Margin Markov networks.

Lecture: (3/29) Review of previous remarks, and comments on Krauthammer et al paper.

Lecture: (3/31) Comments on Bunescu et al and Cohen and Sarawagi papers

Lecture: (4/5) Additional comments on Cohen and Sarawagi paper


Information Integration: Learning Distance Metrics

Lecture: (4/7 and 4/12) Learning Edit Distances with Pair HMMs


Information Integration: Reasoning with Uncertain Objects and/or Extracting Facts

Last modified: Wed Oct 26 12:51:13 Eastern Daylight Time 2011