Final Writeup Grading Scheme

The grades for the final project will be loosely based on the following criteria:

Background: [5 Points]

  1. Does the final report describe the setting and reference the related research?
  2. Does the final report describe the data that you are working with and how it was derived?

Design: [10 Points]

  1. Does the final report have a concrete well defined experimental design describing the learning task?
  2. Does the final report describe which features and models are used and why they were chosen?

Implementation: [10 Points]

  1. Did you implement appropriate techniques presented in 10-601: an aspect of a learning algorithm, feature selection, training, inference?
  2. Does the final report describe what was implemented and what tools were used?

Results: [10 Points]

  1. Does the final report have quantitative results from learning or experimenting with your data?
  2. Does the final report have results evaluating learning? (i.e., learning curves, precision-recall, training/testing errors, ...)
  3. Does the final report make effective use of graphs which are appropriately labeled and properly described in the document?

Interpretation: [10 Points]

  1. Does the final report attempt to interpret the results?
  2. Does the interpretation correctly use concepts from class to justify the results?

Overall Presentation: [5 Points]

  1. Is the overall presentation coherent and well organized?
  2. Does the final report contain grammatical or formatting errors that make the document difficult to read or understand?

Your Course Project

The course project will account for the 25% of the final grade, the following will contribute to the project grade:

Your class project is an opportunity for you to explore an interesting machine learning problem of your choice. All projects must have an implementation component, though theoretical aspects may also be explored. You should also evaluate your approach, preferably on real-world data,though for some projects simulated data may also be appropriate. Below, you will find some project ideas, but the best idea would be to combine machine learning with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters. If you are uncertain about this requirement, please email the instructors.

Projects should be done in teams of 2-3 students. Each project will also be assigned a 10-601 instructor as a project consultant/mentor. They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 25% of your final class grade, and will have the following deliverables:

  1. A project proposal (1 page maximum). This should include a project title, a project idea, the data set that you will use, the software that you need to write, any additional software that you plan to use, 1-3 relevant papers that you need to read, and who your teammates are.

  2. First project milestone report. This writeup should present substantial initial results (4 pages). Include place holders for experiments you plan to coduct. Worth 10% of project grade.

  3. Second project milestone report. Should be a mostly complete draft with results of your first experiments, although possibly missing the final experiments; again include place holders (5 pages). Worth 15% of project grade.A

  4. Two-minute presentation of your second milestone results (4 slides maximum, not including the title slide; PDF format, no animations).

  5. Final writeup in the format of a conference paper (NIPS format) (6 pages maximum, not including references). Worth 40% of project grade

  6. Poster presenting your work for a special class poster session. This is in lieu of a final exam and all students are required to attend. Worth 35% of the project grade.

Write up and presentation format

The two-minute presentation should be in a PDF format without animations. The project milestones and final writeup should be written in NIPS format. The margins and spacing should not be altered. Finally, the bibliography does not count against the page limit.

Poster format

Project suggestions

Below you can find some project ideas. Remember, these are only suggestions. You are encouraged to come up with your own project ideas and discuss them with the instructors. Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, information retrieval, sensor networks..etc., and formulate your problem using machine learning techniques.

Project A1: Cognitive State Classification with Magnetoencephalography Data (MEG)


A zip file containing some example preprocessing of the data into features along with some text file descriptions:
The raw time data (12 GB) for two subjects (DP/RG_mats) and the FFT data (DP/RG_avgPSD) is located at:
You should access this directly through AFS space

This data set contains a time series of images of brain activation, measured using MEG. Human subjects viewed 60 different objects divided into 12 categories (tools, foods, animals, etc...). There are 8 presentations of each object, and each presentation lasts 3-4 seconds. Each second has hundreds of measurements from 300 sensors. The data is currently available for 2 different human subjects.

Project A: Building a cognitive state classifier
Project idea: We would like to build classifiers to distinguish between the different categories of objects (e.g. tools vs. foods) or even the objects themselves if possible (e.g. bear vs. cat). The exciting thing is that no one really knows how well this will work (or if it's even possible). This is because the data was only gathered a few weeks ago (Aug-Sept 08). One of the main challenges is figuring out how to make good features from the raw data. Should the raw data just be used? Or maybe it should be first passed through a low-pass filter? Perhaps a FFT should convert the time series to the frequency domain first? Should the features represent absolute sensor values or should they represent changes from some baseline? If so, what baseline? Another challenge is discovering what features are useful for what tasks. For example, the features that may distinguish foods from animals may be different than those that distinguish tools from buildings. What are good ways to discover these features?

This project is more challenging and risky than the others because it is not known what the results will be. But this is also good because no one else knows either, meaning that a good result could lead to a possible publication.
Papers to read:
Relevant but in the fMRI domain:
Learning to Decode Cognitive States from Brain Images, Mitchell et al., 2004,
Predicting Human Brain Activity Associated with the Meanings of Nouns, Mitchell et al., 2008
MEG paper:
Predicting the recognition of natural scenes from single trial MEG recordings of brain activity, Rieger et al. 2008 (access from CMU domain)

Project A2: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.

Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naive Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues you'll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: " Learning to Decode Cognitive States from Brain Images", Mitchell et al., 2004, " Bayesian Network Classifiers", Friedman et al., 1997.

Project AD: Anomaly-detection task

The typing anomaly-detection task is to discriminate between the typing of a genuine user trying to gain legitimate access to his or her account, and the typing of an impostor trying to gain access illegitimately to that same account. This webpage is a benchmark data set for keystroke dynamics. The data consist of keystroke-timing information from 51 subjects (typists), each typing a password 400 times. The project would be to use the data on this page to learn a classifier which determines reliably the identity of a given typist.

Project B: Image Segmentation Dataset

The goal is to segment images in a meaningful way.  Berkeleycollected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations).   Two-hundred of these images are training images, and the remaining 100 are test images.  The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions.  It also includes code for a segmentation example.  This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.

Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture.  The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions.  One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments.  Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from Berkeley are available here

Project C: Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles.  For documentation and download, see this website.  This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software: The same website provides an implementation of a Naive Bayes classifier for this text data.  The code is quite robust, and some documentation is available, but it is difficult code to modify.

Project ideas:
 EM text classification in the case where you have labels for some documents, but not for others  (see McCallum et al, and come up with your own suggestions)

Project D: Optimizing computerized tutoring software

We have access to a large amount of data from students using computerized tutoring software to learn concepts like algebra, geometry, and language: available here. The data is of the following form: students interact with the tutor on items, which correspond to individual steps of problems. The goal is to predict, based on a student's recent performance, whether the student will be able to get a particular new item right on the first try. Based on these predictions, the computerized tutor could select problems which were just at the boundary of a student's current knowledge, thereby optimizing the student's learning speed. Similarly, the tutor could look at the details of the learned model to try to extract relevant information and optimize the learning experience. To help in this prediction problem, the tutor records whether the student got past items right on the first try, or whether s/he needed help or required several tries. Also, experts have labeled each item with skills or knowledge components, with the idea being that items that require similar skills will have similar patterns of right/wrong answers.

Project E: Character recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Project suggestion:

  • Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

Project F: NBA statistics data

This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all of the regular season

Project idea:

  • outlier detection on the players; find out who are the outstanding players.
  • predict the game outcome.

Project G: Precipitation data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

Project ideas:

Weather prediction: Learn a probabilistic model to predict rain levels

Sensor selection: Where should you place sensor to best predict rain  

Project H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

Project ideas:

  • Learning classifiers to predict the type of webpage from the text
  • Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?


Project I: Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.

Project Ideas:
  • One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".


Project J: Email Annotation

The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

Project Ideas:
  •  Model the task as a Sequential Labeling problem, where each email is a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".


Project K: Netflix Prize Dataset

The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize

Project idea:

  • Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?

  • Can you discover clusters of similar movies or users?

  • Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website

Project L: Physiological Data Modeling (bodymedia)

Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.

1. Which sensors correspond to each column?

characteristic1 age
characteristic2 handedness
sensor1 gsr_low_average
sensor2 heat_flux_high_average
sensor3 near_body_temp_average
sensor4 pedometer
sensor5 skin_temp_average
sensor6 longitudinal_accelerometer_SAD
sensor7 longitudinal_accelerometer_average
sensor8 transverse_accelerometer_SAD
sensor9 transverse_accelerometer_average

2. What are the activities behind each annotation?

The annotations for the contest were:
5102 = sleep
3104 = watching TV

Datasets can be downloaded from


Project idea:

  • behavior classification; to classify the person based on the sensor measurements 

Project M: Object Recognition

The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.

Project ideas:

  • You can try to create an object recognition system which can identify which object category is the best match for a given test image.
  • Apply clustering to learn object categories without supervision

Project N: Learning POMDP structure so as to maximize utility

Hoey & Little (CVPR 04) show how to learn the state space, and parameters, of a POMDP so as to maximize utility in a visual face gesture recognition task. (This is similar to the concept of "utile distinctions" developed in Andrew McCallum's PhD thesis.) The goal of this project is to reproduce Hoey's work in a simpler (non-visual) domain, such as McCallum's driving task.

Project O: Learning partially observed MRFs: the Langevin algorithm

In the recently proposed exponential family harmonium model (Welling et. al., Xing et. al.), a constructive divergence (CD) algorithm was used to learn the parameters of the model (essentially a partially observed, two-layer MRF). In Xing et. al., a comparison to variational learning was performed. CD is essentially a gradient ascent algorithm of which the gradient is approximated by a few samples. The Langevin method adds a random perturbation to the gradient and can often help to get the learning process out of local optima. In this project you will implement the Langevin learning algorithm for Xings dual wing harmonium model, and test your algorithm on the data in my UAI paper. See Zoubin Ghahramanis paper of Bayesian learning of MRF for reference.

Project P: Context-specific independence

We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?

Project Q: Enron E-mail Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data

Project ideas:

  • Can you classify the text of an e-mail message to decide who sent it? 

Project R: More data

There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

Sam Roweis also has a link to several datasets out there: