Spring 2007
Information Extraction:
Machine Learning Approaches to Extracting Structured Information from Text

Instructor and Venue

Instructor: William Cohen, Machine Learning Dept and LTI
TA: Vitor Carvalho (Office Hours: Tuesdays/Thursdays, email to schedule)
When/where: Tues/Thus 12-1:20, Wean Hall 4615a
Course Number: 10-707, cross-listed in LTI as 11-748
Syllabus: below

Announcements:

Description

Information extraction is finding names of entities in unstructured or partially structured text, and determining the relationships that hold between these entities. More succinctly, information extraction is the problem of deriving structured factual information from text.

This course considers the problem of information extraction from a machine-learning prospective. We will survey a variety of learning methods that have been used for information extraction, including rule-learning, boosting, and sequential classification methods such as hidden Markov models, conditional random fields, and structured support vector machines. We will also look at experimental results from a number of specific information extraction domains, such as biomedical text, and discuss semi-supervised "bootstrapping" learning methods for information extraction.

Readings will be based on research papers. Grades will be based on class participation, paper presentations, and a project. A syllabus is below. You can also find a complete syllabus with slides for the course as taught last (Spring 2004). The Spring 2007 course will concentrate less on information integration, and will cover more topics in information extraction. There will also be a focus on techniques for structured learning.

Readings will be based on research papers. Grades will be based on class participation, paper presentations, and a project. More specifically, students will be expected to:

Prerequisites: a machine learning course (e.g., 15-781 or 15-681) or consent of the instructor.

Syllabus

Overview/Survey of Information Extraction

Lectures: (Slides will be posted after each class).

Readings:

NER by Classifying Candidate Text Segments or Tokens

Lectures:

Readings:

NER as Sequential Token Classification with Graphical Models - 1 (HMMs and CMMs)

Lectures:

Readings:

NER as Sequential Token Classification with Graphical Models - 2 (CRFs)

Lectures:

Readings:

CRFs, CMMs, and Dependency networks

Lectures:

Readings:

Long-range dependencies in NER/Margin Methods

Lectures: Readings:

Sequential Classification with Margin-based Methods

Lectures: Readings:

Spring Break!


From Entities to Facts

Lectures: Readings:

Bootstrapping

Lectures: Readings:

Similarity and Information Extraction

Readings (none required):

Project presentations


Notice: Classroom activities may be taped or recorded by a student for the personal use of that student or for all students presently enrolled in the class only, but may not be further copied, distributed, published or otherwise used for any other purpose without the express written consent of Dr. Cohen. Do not leave small children unattended. This syllabus should not be used as a flotation device.
Last modified: Mon Mar 26 09:48:12 EDT 2007