Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University
Fall
2009
This assignment is intended to (1) introduce
you to some of the large-scale data we have available to build on, and
(2) give you a chance to do something interesting with it.
The task:
Download the data
describing the co-occurrence counts for noun phrases and contexts.
Do something interesting with it. For example, you
might want to train a classifier to determine which noun phrases refer
to cities, or emotions, or academic disciplines, based on the "bag of
contexts" with which the noun phrase co-occurs. You might
want to try unsupervised clustering of some kind. Choose
something you find interesting, that's not overly ambitious for a one-week task, and that will allow you to explore
working with the data.
What to turn in:
2 powerpoint slides that you will present in 3 minutes to
the class on thursday Sept 17. Email these to
tom.mitchell@cs.cmu.edu, by 2pm the day of class, so he can include
your slides in the set to be presented. (note 3 minutes is a
very short time to speak, so consider in advance how you'll fit your
content into this short time window)
A brief writeup of 1-3 pages. Please
email this to tom.mitchell@cs.cmu.edu by the end of the day on friday
Sept 18. This will give you a little time to address any
ideas that come up during your presentation, or connections to what
others reported.
Hints:
Please
note that you can
find labeled examples of noun phrases and contexts associated with
categories such as city and emotion at the WSDM
supplemental materials page (see the link to "Instances
Promoted by Meta-Bootstrap Learner"). Alternatively, it is
easy to copy and paste lists of companies, animals, etc., by browsing
the MBL
knowledge base mentioned on that same page.
Here
are two papers that might help stimulate your thinking
On working alone versus
in pairs: In general, it's fine to work in pairs or alone on
projects for this class. However, for this first assignment
I'd like everybody to become familiar with the data sets. So
feel free to brainstorm with others in the class, but please do your
own work for this assignment to be sure you learn about the data.