Divider

Advanced Statistical Language Processing: Reading the Web (10-709)

Homework 1

Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University

Fall 2009

Divider

This assignment is intended to (1) introduce you to some of the large-scale data we have available to build on, and (2) give you a chance to do something interesting with it.

The task:  Download the data describing the co-occurrence counts for noun phrases and contexts.  Do something interesting with it.  For example, you might want to train a classifier to determine which noun phrases refer to cities, or emotions, or academic disciplines, based on the "bag of contexts" with which the noun phrase co-occurs.   You might want to try unsupervised clustering of some kind.  Choose something you find interesting, that's not overly ambitious for a one-week task, and that will allow you to explore working with the data.   

What to turn in:

Hints:


On working alone versus in pairs: In general, it's fine to work in pairs or alone on projects for this class.  However, for this first assignment I'd like everybody to become familiar with the data sets.  So feel free to brainstorm with others in the class, but please do your own work for this assignment to be sure you learn about the data.