Your Course Project
 
Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a real-world data set. Below, you will find some project ideas, but the best idea would be to combine machine learning with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.
Projects can be done by you as an individual, or in teams of two students. Each project will also be assigned a 701 instructor as a project consultant/mentor. They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 20% of your final class grade, and will have two final deliverables:
- 
    a writeup in the format of a NIPS paper (8 pages maximum in NIPS format, including references; this page limit is strict), due December 14th by 2pm (emailed to the instructors list, NO LATE SUBMISSION ACCEPTED since we need to get your grades in), worth 60% of the project grade, and 
- 
    a poster presenting your work for a special ML class poster session on November 30, 2-5pm in the NSH Atrium, worth 20% of the project grade. 
In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by November 12, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
You must
turn in a brief project proposal (1-page maximum) by October 17th.  
Read the list of available data sets and potential
project ideas below.  You are encouraged to use one of these
data sets, because we know that they have been successfully used for
machine learning in the past.   If you prefer to use a
different data set, we will consider your proposal, but you must have
access to this data already, and present a clear proposal for what you
would do with it.  
 
Project proposal format: Proposals should be one page maximum. Include the following information:
- 
    Project title 
- 
    Data set 
- 
    Project idea. This should be approximately two paragraphs. 
- 
    Software you will need to write. 
- 
    Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal. 
- 
    Teammate: will you have a teammate? If so, whom? Maximum team size is two students. 
- 
    November 12 milestone: What will you complete by November 12? Experimental results of some kind are expected here. 
Here are some details on the poster format.
- The two most common ways to make your poster are:
  - You can create a huge slide (in powerpoint, using beamer, etc) and print it out at one of the poster printers available (for SCS, details below). You'd want the shorter dimension to be less than 36'' since the special poster printers use paper rolls of width 36''.
- You can create a bunch of "normal" presentation slides, print out each one on a piece of (letter-sized) paper, and put them all together on a poster board. This is somewhat messier (and not as pretty) but you don't need to wait for the special poster printer to print.
 
- We will provide you with foam boards onto which you can tack on your poster (or your slides for the poster). The foam boards are 32'' by 40''.
- Printing: Unfortunately, we don't have a budget to pay for printing.
  If you are an SCS student,
SCS has a poster printer you can use which prints on a 36" wide
roll of paper.  You have to call Operations and request to be added to
the queue and they will call you back and tell you when you are able
to print.  It takes a while to print however especially if there are
people ahead of you already printing so you can't wait until the last
hour (or last day). But they are open 24/7.  The printer sometimes runs
out of paper, and you should really start early!!!  The number of
operations is 8-2608 or you can call the help desk (8-4231) to get the
number. 
 If you are a student outside SCS, you will need to check with your department to see if there are printing facilities for big posters (we're not sure what is offered outside SCS), or print a set of regular sized pages.
Datasets and project suggestions:
  Below
are descriptions of several data sets, and some suggested
projects.  The first few are spelled out in greater
detail.  You are encouraged to select and flesh out one of
these projects, or make up you own well-specified project using these
datasets.  If you have other data sets you would like to work
on, we would consider that as well, provided you already have access to
this data and a good idea of what to do with it.
 
Data Set Categories:
- fMRI Brain Imaging Data
- Image Data
- Sensor Network Data
- Text Data
- Datasets of Datasets
- Additional Data Sets
fMRI Brain Imaging Data
Brain imaging data (fMRI)
This data
set contains a time series of images of brain activation, measured
using fMRI, with one image every 500 msec. During this time, human
subjects performed 40 trials of a sentence-picture comparison task
(reading a sentence, observing a picture, and determining whether the
sentence correctly described the picture). Each of the 40 trials lasts
approximately 30 seconds. Each image contains approximately 5,000
voxels (3D pixels), across a large portion of the brain. Data is
available for 12 different human subjects.  
Available
software: we can provide
Matlab software for reading the data, manipulating and visualizing it,
and for training some types of classifiers (Gassian Naive Bayes, SVM). 
Project
A1:
Bayes network classifiers for fMRI
Project
idea: Gaussian Naïve
Bayes classifiers and SVMs have been used with this data to predict
when the subject was reading a sentence versus perceiving a picture.
Both of these classify 8-second windows of data into these two classes,
achieving around 85% classification accuracy [Mitchell et al, 2004].
This project will explore going beyond the Gaussian Naïve
Bayes classifier (which assumes voxel activities are conditionally
independent), by training a Bayes network in particular a TAN tree
[Friedman, et al., 1997]. Issues youll need to confront include which
features to include (5000 voxels times 8 seconds of images is a lot of
features) for classifier input, whether to train brain-specific or
brain-independent classifiers, and a number of issues about efficient
computation with this fairly large data set. Midpoint milestone: By
April 12 you should have run at least one classification algorithm on
this data and measured its accuracy using a cross validation test. This
will put you in a good position to explore refinements of the
algorithm, alternative feature encodings for the data, or competing
algorithms, by the end of the semester. Project: Reducing
dimensionality and classification accuracy.
Papers
to read: "Learning to Decode Cognitive States
from Brain Images," Mitchell et
al., 2004, "Bayesian Network Classifiers"
Friedman et al., 1997. 
Project
A2: Dimensionality reduction
for fMRI data
Project
idea:  Explore the
use of dimensionality-reduction methods to improve classification
accuracy with this data.  Given the extremely high dimension
of the input (5000 voxels times 8 images) to the classifier, it is
sensible to explore methods for reducing this to a small number of
dimension. For example, consider PCA, hidden layers of neural nets, or
other relevant dimensionality reducing methods.  PCA is an
example of a method that finds lower dimension representations that minimize error in
reconstructing the data. 
In contract, neural network hidden layes are lower dimensional
representations of the inputs that minimize
classification
error (but only find a local
minimum).  Does one of these work better?  Does it
depend on parameters such as the number of training examples?
Papers
to read: "Learning to Decode Cognitive States
from Brain Images," Mitchell et
al., 2004, papers and textbook on PCA, neural nets, or whatever you
propose to try.
Project
A3: Feature
selection/feature invention for fMRI classification.
Project
idea: 
As
in many high dimensional data sets, automatic selection of a subset of
features can have a strong positive impact on classifier
accuracy.  We have found that selecting features by the
difference in their activity when the subject performs the task,
relative to their activity while the subject is resting, is one useful
strategy [Mitchell et al., 2004].  In this project you could
suggest, implement, and test alternative feature selection strategies
(eg., consider  the incremental value of adding a new feature
to the current feature set, instead of scoring each feature independent
of other features that are being selected), and see whether you can
obtain higher classification accuracies.  
Alternatively, you could consider methods for synthesizing new features
(e.g., define the 'smoothed value' of a voxel in terms of a spatial
Gaussian kernel function applied to it and its neighbors, or define
features by averaging voxels whose time series are highly correlated). 
Papers
to read: "Learning to Decode Cognitive States
from Brain Images," Mitchell et
al., 2004, papers on feature selection
 
Image Data
Character recognition (digits) data
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have three datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
The second dataset is the now "classic" digit recognition task for outgoing mail zip codes:
http://yann.lecun.com/exdb/mnist/
The third (and most challenging) data set consists of scrambled text known as Captchas that were designed by Luis Von Ahn to be difficult to automatically recognize. For more about Captchas go to Wikipedia Article or Captcha.net where you will find several papers.
Project suggestions:
- 
    Learn a classifier to recognize the letter/digit 
- 
    Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.) 
- 
    Apply a clustering/dimensionality reduction algorithm on this data, see if you get better classification on this lower dimensional space. 
- Learn a classifier to decipher Captchas. You may want to begin by reading the following: You may want to begin by building a classifier to segment the image into seperate letters.
Image Segmentation Dataset
The goal is to segment images in a meaningful way.  Berkeley
collected three hundred images and paid students to hand-segment each
one (usually each image has multiple
hand-segmentations).   Two-hundred of these images
are training images, and the remaining 100 are test images. 
The dataset includes code for reading the images and ground-truth
labels, computing the benchmark scores, and some other utility
functions.  It also includes code for a segmentation
example.  This dataset is new and the problem unsolved, so
there is a chance that you could come up with the leading algorithm for
your project.
http://www.cs.berkeley.edu
Project
ideas:
Project B1:
Region-Based Segmentation 
Most segmentation algorithms have focused on segmentation based on
edges or based on discontinuity of color and texture.  The
ground-truth in this dataset, however, allows supervised learning
algorithms to segment the images based on statistics calculated over
regions.  One way to do this is to "oversegment" the image
into superpixels (Felzenszwalb 2004, code available) and merge the
superpixels into larger segments.  Come up with a set of
features to represent the superpixels (probably based on color and
texture), a classifier/regression algorithm (suggestion: boosted
decision trees) that allows you to estimate the likelihood that two
superpixels are in the same segment, and an algorithm for segmentation
based on those pairwise likelihoods. Since this project idea is fairly
time-consuming focusing on a specific part of the project may also be
acceptable.  
Milestone:
By April 12, you should be able to estimate the likelihood that two
superpixels are in the same segment and have a quantitative measure of
how good your estimator is.  You should also have an outline
of how to use the likelihood estimates to form the final
segmentation.  The rest of the project will involve improving
your likelihood estimation and your grouping algorithm, and in
generating final results.  
Papers
to read: Some segmentation
papers from Berkeley are available here 
Project B2:
Supervised vs. Unsupervised Segmentation Methods 
Write two segmentation algorithms (these may be simpler than the one
above): a supervised method (such as logistic regression) and an
unsupervised method (such as K-means). Compare the results of the two
algorithms. For your write-up, describe the two classification methods
that you plan to use.
Milestone:
By April 12, you should have completed at least one of your
segmentation algorithms and have results for that algorithm.  
Papers
to read: Some segmentation
papers from Berkeley are available here 
 
Object Recognition
The Caltech 256 dataset contains images of 256 object categories
taken at varying orientations, varying lighting conditions, and with
different backgrounds. 
http://www.vision.caltech.edu/Image_Datasets/Caltech256/
Project
ideas: 
 
- 
    You can try to create an object recognition system which can identify which object category is the best match for a given test image. 
- 
    Apply clustering to learn object categories without supervision 
Face recognition data
There are two data sets for this problem. The first dataset contains 640 images of faces. The faces themselves are images of 20 former Machine Learning students and instructors, with about 32 images of each person. Images vary by the pose (direction the person is looking), expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a variety of classification problems ranging from person identification to sunglass detection. The data, documentation, and associated code are available here:
Available Software: The same website provides an implementation of a neural network classifier for this image data. The code is quite robust, and pretty well documented in an associated homework assignment.
The second data set consists of 2253 female and 1745 male rectified frontal face images scraped from the hotornot.com website by Ryan White along with user ratings of attractiveness. The data set can be found here:
Project ideas:
- 
    Try SVM's on this data, and compare their performance to that of the provided neural networks 
- 
    Apply a clustering algorithm to find "similar" faces 
- Learn a facial attractiveness classifier. A recent NIPS paper on the topic of predicting facial attractiveness can be found here.
Sensor Network Data
Sensor network data

Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.
http://www-2.cs.cmu.edu/~guestrin/Research/Data/
This is a "real" dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low.
Project ideas:
- 
    Compare regression algorithms 
- 
    Detect failed sensors 
- 
    Learn graphical models representing the correlations between measurements at different nodes 
- 
    Develop new distributed algorithms for solving a learning task on this data 
Papers:
- 
    http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf 
- 
    http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf 
Precipitation data
This dataset has includes 45 years of daily precipitation data from the Northwest of the US:
http://www.jisao.washington
Project ideas:
- 
    Weather prediction: Learn a probabilistic model to predict rain levels 
- 
    Sensor selection: Where should you place sensor to best predict rain 
Text Data
Netflix Prize Dataset
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05".  The data is available here: Netflix Prize
Project ideas:
Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
Can you discover clusters of similar movies or users?
Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website.
Enron E-mail Dataset
The Enron E-mail data set contains about 500,000 e-mails from about 150 users.
The data set is available here: Enron Data
Project ideas:
Can you classify the text of an e-mail message to decide who sent it?
Twenty Newgroups text data
This data
set contains 1000 text articles posted to each of 20 online newgroups,
for a total of 20,000 articles.  For documentation and
download, see this website. 
This data is useful for a variety of text classification and/or
clustering projects.  The "label" of each article is which of
the 20 newsgroups it belongs to.  The newsgroups (labels) are
hierarchically organized (e.g., "sports", "hockey").
Available
software: The same website
provides an implementation of a Naive Bayes classifier for this text
data.  The code is quite robust, and some documentation is
available, but it is difficult code to modify.
Project
ideas: 
 
- 
    EM for text classification in the case where you have labels for some documents, but not for others (see McCallum et al, and come up with your own suggestions) 
- 
    Make up your own text learning problem/approach 
Papers:
 
WebKB Data Set
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www-2.cs.cmu.edu/~webkb/
 
Project ideas:
- 
    Can you learn classifiers to predict the type of a webpage from the text? 
- 
  Can you improve accuracy by exploiting correlations between pages that point to each other? 
Papers:
Datasets of Datasets
Data 2.0
Context: There is a great deal of excitement in the research and business communities surrounding the idea of community-managed data analysis and visualization -- i.e. "Web 2.0 for Data". Example commercial sites include swivel.com, data360.com, and IBM's Many Eyes. Research to date has been mostly in the HCI community, eg. the work of Heer, et al in CHI 2007.
We have access to some 7,000 uploaded datasets from 
Swivel.com, which
are available under a Creative Commons license.  Given this
background, many interesting Machine Learning projects are possible. 
Note: the data will be posted here soon.
Project ideas:
- 
    Graph Selection: After a user uploads a dataset (e.g. a relational table) of many columns, the website should automatically present a set of interesting graphs to the user. From this "starting" set of graphs, the user may pick one and modify it to generate their graph of choice. So the initial choice of graphs should (a) cover a wide variety of graphs that could be "close" to the a user's desired graph, and (b) provide interesting or surprising graphs that may not have occurred to the user. One way to approach this problem is to treat the set of possible graphs as a vector space, and choose suggested graphs by picking clusterheads in this space. Challenges include defining a distance metric in this space, and tuning a clustering algorithm to pick good graphs in the space. 
- 
  Suggested Graph Comparisons: Suppose you upload a spreadsheet to a Data 2.0 site, and choose a graph of interest. The site should suggest other graphs from the database that might combine with yours in an interesting way. For example, http://www.swivel.com/graphs/show/1001967 is an example that shows an apparent anti-correlation between wine- drinking and crime rates. This example compares graphs based mostly on correlation in the data, but other features for comparison may include textual tags from column and graph names, as well as user comments on the graphs and other data. The definition of a good utility function for suggested graphs is an important challenge in this project. 
- 
  (Semi-) Automatic relational transformation: All of the websites listed above require data to be uploaded in a strict relational format. However, many, many spreadsheets and websites do not adhere to this format -- instead they use a variety of "pivot" or "hierarchical matrix" representations. Swivel provides a "toolbar" for Excel to help users convert to the appropriate format using a set of simple bulk data transformations (http://www.swivel.com/pantry/ excel_swivelbar). The challenge in this project is to automatically decide on the transformations to perform on a spreadsheet in order to convert it to a relational format that the various websites can use to easily generate graphs. 
Additional Data Sets
Activity Recognition Dataset
This dataset contains multimodal recordings (audio, video, accelerometers and gyroscopes) of a human executing five recipes (brownies, salad, pizza, sandwich, eggs) in a model kitchen. Here is a more detailed documentation and instructions for downloading the data: [pdf].
Project ideas:
- Activity recognition from weakly label data: - Activity recognition across view points. Given a video sequence of a particular recipe (e.g. brownies), learn a set of statistics invariant to the camera viewpoint (there are 7 views of the scene). 
- Learn temporal features that are useful to discriminate among different activities. 
 
- Automatic synchronization of two or more multimodal datasets (e.g. audio/video). Before starting the recipe, the user presses a device that produces sound and light. This signal can be used to synchronize audio, video and accelerometer data. Some references for audio-video synchronization: [1], [2]. 
- Use semi-supervised algorithms to fully label the database. For instance, mark a few points in the body of the subject in a few frames, and use manifold learning techniques to label/track those points in the rest of the frames. Reference: [1] 
Election Contributions Dataset
This dataset represents federal electoral compaign donations in the United States for the election years 1980 through 2006. The data, fully built, will form a tripartite, directed graph. Donors (individuals and corporations) make contributions to Committtees, who then in turn make contributions to Candidates. There is a many-to-many relationship between Donors and Committees, and also a many-to-many relationship between Committees and Candidates. For documentation and download, see this website.
Project ideas:
- 
    Can you predict a committee's contribution rate, or preferred candidates, based on its past contribution rate? Which features best indicate who donates to it? 
- 
  Can you predict how much a donor will contribute based on zip code, or whether an occupation is listed (or, if you can analyze the text, what occupation is listed)? 
- 
  Can you predict how much money a candidate will receive based on party, state, or whether s/he is an incumbent/challenger/open seat? 
- 
  Can you discover clusters of donors/committees/candidates? 
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Sam Roweis also has a link to several datasets out there:
http://www.cs.toronto.edu/~roweis/data.html
 
Tools
-  Text->BOW: Here's a tool for converting text documents into bag-of-words vectors. You need to supply it with the vocabulary file with a list of words you want in the feature vector. Then the first feature in the resulting BOW vector will be the first word in the list, etc. An example vocabulary.txt file is also included. Original author: Sophie Wang.