SCS Computing Facilities has instituted a new procedure for printing posters. The new procedure is intended to make the process of poster printing faster and easier for the SCS community. There will no longer be a need to call Operations in order to print a poster. You can now submit posters via email, to firstname.lastname@example.org. Simply follow the printing procedures that are documented on the SCS Help pages at: http://www.cs.cmu.edu/~help/printing/platinum_printing.html and Operations will print the poster and notify you when it is ready for pickup. Please contact SCS Operations at x8-2608 or send mail to email@example.com with any questions or concerns. Also, the poster boards we use are 32"x 40" Non SCS students will need to contact their departments about resources for printing posters.
The class project is an interesting opportunity for you to try the machine learning techniques learnt in class to interesting real-world multivariate analysis problems. Successful projects in the past have gone on to become full-fledged research papers ! The final project should be carried out by teams of 1-2 students (no more than 2). Students in this class come from many different departments and schools within CMU. We highly encourage students to form interdisciplinary groups involving students from multiple departments. The project would be 30% of the final class grade, with 4 deliverables :
Note that all write-ups in the form of a NIPS paper (mirrored here). The page limits are strict! Papers over the limit will not be considered.
You are highly recommended to use one of these data sets, because they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on uncollected data, so you have to work on existing data sets. You may also propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.
You must turn in a brief project proposal (1-page maximum). Read the list of available data sets and potential project ideas below.
Project proposal format: Proposals should be one page maximum. Include the following information:
This should be a 2 pages short report, and it serves as a check-point. It should have a pretty complete introduction and related work sections; and discuss the methods, code and experiments you plan to perform. Also include (very briefly) whatever results you have obtained so far.
Your final report is expected to be a 8-page report. You should email your final report to the instructor and all three TAs. A hard copy submission is not required. The final report should roughly have the following format:
We will have all projects presenting a poster, on Project
session : Wednesday, December 1, time TBA in Newell-Simon Hall
. At least one project member should be present during the
hours. The session will be open
Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using machine learning techniques. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis. You can also find some project ideas below. (courtesy: Eric Xing's Spring 2008 Machine Learning class).
Data:A zip file containing some example preprocessing of the data into features along with some text file descriptions: LanguageFiles.zip
This data set contains a time series of images of brain activation, measured using MEG. Human subjects viewed 60 different objects divided into 12 categories (tools, foods, animals, etc...). There are 8 presentations of each object, and each presentation lasts 3-4 seconds. Each second has hundreds of measurements from 300 sensors. The data is currently available for 2 different human subjects.
Project A: Building a cognitive state
Project idea: We would like to build classifiers to distinguish between the different categories of objects (e.g. tools vs. foods) or even the objects themselves if possible (e.g. bear vs. cat). The exciting thing is that no one really knows how well this will work (or if it's even possible). This is because the data was only gathered a few weeks ago (Aug-Sept 08). One of the main challenges is figuring out how to make good features from the raw data. Should the raw data just be used? Or maybe it should be first passed through a low-pass filter? Perhaps a FFT should convert the time series to the frequency domain first? Should the features represent absolute sensor values or should they represent changes from some baseline? If so, what baseline? Another challenge is discovering what features are useful for what tasks. For example, the features that may distinguish foods from animals may be different than those that distinguish tools from buildings. What are good ways to discover these features?
This project is more challenging and risky than the others because it
is not known what the results will be. But this is also good because no
one else knows either, meaning that a good result could lead to a
Papers to read:
Relevant but in the fMRI domain:
Learning to Decode Cognitive States from Brain Images, Mitchell et al., 2004,
Predicting Human Brain Activity Associated with the Meanings of Nouns, Mitchell et al., 2008
Predicting the recognition of natural scenes from single trial MEG recordings of brain activity, Rieger et al. 2008 (access from CMU domain)
This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).
Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naive Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues you'll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: " Learning to Decode Cognitive States from Brain Images", Mitchell et al., 2004, " Bayesian Network Classifiers", Friedman et al., 1997.
The goal is to segment images in a meaningful way. Berkeleycollected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations). Two-hundred of these images are training images, and the remaining 100 are test images. The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions. It also includes code for a segmentation example. This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from
This data set contains 1000 text articles posted to each of 20
online newgroups, for a
total of 20,000
documentation and download, see this
This data is useful for a variety of text classification and/or
projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are
organized (e.g., "sports", "hockey").
Available software: The same website provides an implementation of a Naive Bayes classifier for this text data. The code is quite robust, and some documentation is available, but it is difficult code to modify.
EM text classification in the case where you
have labels for some documents, but not for others (see
and come up with your own suggestions)
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season
This dataset has includes 45 years of daily precipitation data from the Northwest of the US:
Weather prediction: Learn a probabilistic model to predict rain levels
Sensor selection: Where should you place sensor to best predict rain
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize
Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
Can you discover clusters of similar movies or users?
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.
1. Which sensors correspond to each column?
Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/
The Caltech 256 dataset
of 256 object categories taken at varying orientations, varying
lighting conditions, and with different backgrounds.
Hoey & Little (CVPR 04) show how to learn the state space, and parameters, of a POMDP so as to maximize utility in a visual face gesture recognition task. (This is similar to the concept of "utile distinctions" developed in Andrew McCallum's PhD thesis.) The goal of this project is to reproduce Hoey's work in a simpler (non-visual) domain, such as McCallum's driving task.
In the recently proposed exponential family harmonium model (Welling et. al., Xing et. al.), a constructive divergence (CD) algorithm was used to learn the parameters of the model (essentially a partially observed, two-layer MRF). In Xing et. al., a comparison to variational learning was performed. CD is essentially a gradient ascent algorithm of which the gradient is approximated by a few samples. The Langevin method adds a random perturbation to the gradient and can often help to get the learning process out of local optima. In this project you will implement the Langevin learning algorithm for Xings dual wing harmonium model, and test your algorithm on the data in the UAI paper. See Zoubin Ghahramanis paper of Bayesian learning of MRF for reference.
The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data
Can you classify the text of an e-mail message to decide who sent it?
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
Sam Roweis also has a link to several datasets out there:http://www.cs.toronto.edu/~roweis/data.html