Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set.  Projects can be done by you as an individual, or in teams of two to three students.   Each project will also be assigned a instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will have 3 deliverables:

  1. Proposal:1 page
  2. Midway Report:3-4 pages
  3. Final Report: 8 pages

Note that all write-ups in the form of a NIPS paper. The page limits are strict! Papers over the limit will not be considered. 

Project Proposal:

You must turn in a brief project proposal (1-page maximum).  Read the list of available data sets and potential project ideas below.  You are highly recommended to use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on data that has not been collected, so you have to work on existing data sets. It is also possible to propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.

Project proposal format:  Proposals should be one page maximum.  Include the following information:

Midway Report:

This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.

Final Report:

Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:

Project Suggestions:

Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using machine learning techniques. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis. You can also find some project ideas below.

Project A: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects. 
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project ideas:


Project B: Image Segmentation Dataset

The goal is to segment images in a meaningful way.  Berkeleycollected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations).   Two-hundred of these images are training images, and the remaining 100 are test images.  The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions.  It also includes code for a segmentation example.  This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.

Project ideas:


Project C: Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles.  For documentation and download, see this website.  This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software

Project ideas:

Project D: Handwriting Recognition

(Lisa Anthony

A general overview of our data: we have approximately 16,000 labeled character samples from 39 middle and high school students, consisting of x-coord, y-coord, and time per point in each stroke. They are grouped into sets of 45 equations that each student copied. The symbols in our dataset are: 0-9, x, y, a, b, c, +, -, _ (fraction bar), =, (, ).

Project ideas:

  • HOW MUCH DATA: All our data is currently hand-labeled, and we have lots of it. One question might be, if the data wasn't labeled, what would be the added value of additional data? That is, what would be the optimal or minimal dataset? This could be defined along several axes: the number of users, the number of samples per character, or the number of samples per symbol per user. We have done a few preliminary experiments where it is clear that there is a leveling off point for test accuracy -- likely caused by the increase in variability of adding  new samples (especially by new users with differing handwriting styles), which harms the classification algorithm (see #3). For future studies and domains it might be useful to get a general sense of "data saturation" -- a recommended canonical corpus size -- for researchers who aren't as experienced in ML and data mining or in domains where data collection is costly.
  • HOW MUCH LABELED DATA AND/OR AUTOMATIC LABELING: Hand-labeling all our data took quite a bit of time. What possibilities exist for an automated, semi-supervised labeling algorithm that could tell us how much data we need to label in advance and how much human verification is needed on the automatically labeled stuff? A side note is that the collection of this data (for the sake of the users) was in the form of one equation at a time rather than one character at a time, so the characters needed to be segmented at the time of labeling since the strokes all ran together in the logs. An automated segmenting approach would be very helpful to us in the future!
  • MULTIPLE CLASSIFIERS: Finally, there is quite a bit of variance between users in that their handwriting styles differ and the particular means of executing a style differs across users. We hypothesize that multiple classifiers trained per user would have higher walk-up-and-use accuracy on a set of independent users than one classifier that has to generalize across all user styles. So this could also be an interesting area to explore.

Project E: Character recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Project ideas:

  • Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

Project F: NBA statistics data

This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all of the regular season

Project ideas:

  • outlier detection on the players; find out who are the outstanding players.
  • predict the game outcome.

Project G: Precipitation data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

Project ideas:

  • Weather prediction: Learn a probabilistic model to predict rain levels
  • Sensor selection: Where should you place sensor to best predict rain  

Project H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

Project ideas:

  • Learning classifiers to predict the type of webpage from the text
  • Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?


Project I: Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.

Project ideas:
  • One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".


Project J: Email Annotation
The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

Project ideas:

  •  Model the task as a Sequential Labeling problem, where each email is a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".


Project K: Object Recognition

The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.

Project ideas:

  • You can try to create an object recognition system which can identify which object category is the best match for a given test image.
  • Apply clustering to learn object categories without supervision

Project L: Learning POMDP structure so as to maximize utility

Hoey & Little (CVPR 04) show how to learn the state space, and parameters, of a POMDP so as to maximize utility in a visual face gesture recognition task. (This is similar to the concept of "utile distinctions" developed in Andrew McCallum's PhD thesis.) The goal of this project is to reproduce Hoey's work in a simpler (non-visual) domain, such as McCallum's driving task.

Project M: Learning partially observed MRFs: the Langevin algorithm

In the recently proposed exponential family harmonium model (Welling et. al., Xing et. al.), a constructive divergence (CD) algorithm was used to learn the parameters of the model (essentially a partially observed, two-layer MRF). In Xing et. al., a comparison to variational learning was performed. CD is essentially a gradient ascent algorithm of which the gradient is approximated by a few samples. The Langevin method adds a random perturbation to the gradient and can often help to get the learning process out of local optima. In this project you will implement the Langevin learning algorithm for Xings dual wing harmonium model, and test your algorithm on the data in my UAI paper. See Zoubin Ghahramanis paper of Bayesian learning of MRF for reference.

Project N: Enron E-mail Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data

Project ideas:

  • Can you classify the text of an e-mail message to decide who sent it? 

Project R: More data

There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

Sam Roweis also has a link to several datasets out there:

© 2009 Eric Xing @ School of Computer Science, Carnegie Mellon University