| Course Project |
Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 701 instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 20% of your final class grade, and will have 4 deliverables:
Note that all write-ups in the form of a NIPS paper. The page limits are strict! Papers over the limit will not be considered.
Project Proposal:
You must turn in a brief project proposal (1-page maximum). Read the list of available data sets and potential project ideas below. You are highly recommended to use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on data that has not been collected, so you have to work on existing data sets. It is also possible to propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.
Project proposal format: Proposals should be one page maximum. Include the following information:
This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.
Grading scheme for the project report:
Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:
We will have all projects presenting a poster, on Project poster
session : Thu, May 1st, 3:00-6:00pm in Newell-Simon Hall 3305
. At least one project member should be present during the poster hours. The session will be open
to everybody.
Project Suggestions:
|
Ideally, you will want to pick a
problem in a
domain of your interest, e.g., natural language parsing, DNA sequence
analysis, text
information retrieval, network mining, reinforcement learning, sensor
networks, etc., and formulate
your problem using machine learning techniques. You can then, for
example, adapt
and tailor standard
inference/learning algorithms to your problem, and do a thorough
performance
analysis. You
can also find some project ideas below.
This data set contains a time series of images of brain activation,
measured
using fMRI, with one image every 500 msec. During this time, human subjects performed
40 trials
of a sentence-picture comparison task (reading a sentence, observing a
picture,
and determining whether the sentence correctly described the picture).
Each of
the 40 trials lasts approximately 30 seconds. Each image contains
approximately
5,000 voxels (3D pixels), across a large
portion of
the brain. Data is available for 12 different human subjects.
Available software: we can provide Matlab
software for reading the data, manipulating and visualizing it, and for
training some types of classifiers (Gassian
Naive Bayes, SVM).
Project A: Bayes network
classifiers
for fMRI
Project idea: Gaussian Na?e Bayes
classifiers
and SVMs have been used with this data to
predict when
the subject was reading a sentence versus perceiving a picture. Both of
these
classify 8-second windows of data into these two classes, achieving
around 85%
classification accuracy [Mitchell et al, 2004]. This project will
explore going
beyond the Gaussian Na?e Bayes
classifier
(which
assumes voxel activities are conditionally
independent), by training a Bayes network
in
particular a TAN tree [Friedman, et al., 1997]. Issues youll
need to confront include which features to include (5000 voxels
times 8 seconds of images is a lot of features) for classifier input,
whether
to train brain-specific or brain-independent classifiers, and a number
of
issues about efficient computation with this fairly large data set.
Papers to read: "Learning to Decode
Cognitive States from Brain Images,"
Mitchell et al., 2004, "Bayesian Network
Classifiers" Friedman
et al., 1997.
The goal is to segment images in a meaningful way. Berkeleycollected three hundred images and
paid students to hand-segment each one (usually each image has multiple
hand-segmentations).
Two-hundred of these images are training images, and the remaining 100
are test
images. The dataset includes code for reading the images and
ground-truth
labels, computing the benchmark scores, and some other utility
functions.
It also includes code for a segmentation example. This dataset is
new and
the problem unsolved, so there is a chance that you could come up with
the
leading algorithm for your project.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/
Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on
edges or
based on discontinuity of color and texture. The ground-truth in
this
dataset, however, allows supervised learning algorithms to segment the
images
based on statistics calculated over regions. One way to do this
is to
"oversegment" the image into superpixels (Felzenszwalb
2004,
code available) and merge the superpixels
into larger
segments. Graphical models can be used to represent smoothness in
clusters, by adding appropriate potentials between neighboring pixels.
In this project, you can address, for example, learning of such
potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from
This data set contains 1000 text articles posted to each of 20
online newgroups, for a total of 20,000
articles. For
documentation and download, see this website.
This data is useful for a variety of text classification and/or
clustering
projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are
hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website provides an implementation
of a
Naive Bayes classifier for this text
data. The
code is quite robust, and some documentation is available, but it is
difficult
code to modify.
Project ideas:
EM text classification in the case where you
have labels for some documents, but not for others (see McCallum
et al,
and come up with your own suggestions)
A general overview of our data: we have approximately 16,000 labeled
character samples from 39 middle and high school students, consisting
of x-coord, y-coord, and time per point in each stroke. They are
grouped into sets of 45 equations that each student copied. The symbols
in our dataset are: 0-9, x, y, a, b, c, +, -, _ (fraction bar), =, (, ).
There are 3 main ideas for projects:
1. HOW MUCH DATA: All our data is currently hand-labeled, and we
have lots of it. One question might be, if the data wasn't labeled,
what would be the added value of additional data? That is, what would
be the optimal or minimal dataset? This could be defined along several
axes: the number of users, the number of samples per character, or the
number of samples per symbol per user. We have done a few preliminary
experiments where it is clear that there is a leveling off point for
test accuracy -- likely caused by the increase in variability of
adding new samples (especially by new users with differing
handwriting styles), which harms the classification algorithm (see #3).
For future studies and domains it might be useful to get a general
sense of "data saturation" -- a recommended canonical corpus size --
for researchers who aren't as experienced in ML and data mining or in
domains where data collection is costly.
2. HOW MUCH LABELED DATA AND/OR
AUTOMATIC LABELING: Hand-labeling all our data took quite a bit
of time. What possibilities exist for an
automated, semi-supervised labeling algorithm that could tell us how
much data we need to label in advance and how much human verification
is needed on the automatically labeled stuff? A side note is that the
collection of this data (for the sake of the users) was in the form of
one equation at a time rather than one character at a time, so the
characters needed to be segmented at the time of labeling since the
strokes all ran together in the logs. An automated segmenting approach
would be very helpful to us in the future!
3. MULTIPLE CLASSIFIERS:
Finally, there is quite a bit of variance between users in that their
handwriting styles differ and the particular means of executing a style
differs across users. We hypothesize that multiple classifiers trained
per user would have higher walk-up-and-use accuracy on a set of
independent users than one classifier that has to generalize across all
user styles. So this could also be an interesting area to explore.
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
Project suggestion:
Project F: NBA statistics data
This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by
season
-coaches_career.txt - nba career coaching
records
Currently all of the regular season
Project idea:
This dataset has includes 45 years of daily precipitation data from the Northwest of the US:
http://www.jisao.washington.edu/data_sets/widmann/
Project ideas:
Weather prediction: Learn a probabilistic model to predict rain levels
Sensor selection: Where should you place sensor to best predict rain
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www-2.cs.cmu.edu/~webkb/
Project ideas:
Papers:
Project J: Email Annotation
The datasets provided below are sets of emails. The goal is to identify
which parts of the email refer to a person name. This task is an
example of the general problem area of Information Extraction.
http://www.cs.cmu.edu/~einat/datasets.html
Project Ideas:
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize
Project idea:
Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
Can you discover clusters of similar movies or users?
Project L: Physiological Data Modeling (bodymedia)
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.
1. Which sensors correspond to each column?
กก
| characteristic1 | age |
| characteristic2 | handedness |
| sensor1 | gsr_low_average |
| sensor2 | heat_flux_high_average |
| sensor3 | near_body_temp_average |
| sensor4 | pedometer |
| sensor5 | skin_temp_average |
| sensor6 | longitudinal_accelerometer_SAD |
| sensor7 | longitudinal_accelerometer_average |
| sensor8 | transverse_accelerometer_SAD |
| sensor9 | transverse_accelerometer_average |
Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/
Project idea:
The Caltech 256 dataset contains images
of 256 object categories taken at varying orientations, varying lighting
conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/
Project ideas:
Project N: Learning POMDP structure so as to maximize utility
Hoey & Little (CVPR 04) show how to
learn the state
space, and parameters, of a POMDP so as to maximize utility in a visual
face
gesture recognition task. (This is similar to the concept of "utile
distinctions" developed in Andrew
McCallum's PhD
thesis.) The goal of this project is to reproduce Hoey's
work in a simpler (non-visual) domain, such as McCallum's driving task.
Project O: Learning partially observed MRFs: the Langevin algorithm
In the recently proposed exponential
family
harmonium model (Welling
et. al., Xing
et. al.), a constructive divergence (CD) algorithm was used to
learn the
parameters of the model (essentially a partially observed, two-layer
MRF). In
Xing et. al., a comparison to variational
learning was performed. CD is essentially a gradient ascent algorithm
of which
the gradient is approximated by a few samples. The Langevin method adds a random
perturbation to the gradient and can often help to get the learning
process out
of local optima. In this project you will implement the Langevin
learning algorithm for Xings dual wing harmonium model, and test your
algorithm
on the data in my UAI paper. See Zoubin Ghahramanis paper
of Bayesian learning of MRF for reference.
Project P: Context-specific independence
We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data
Project ideas:
Can you classify the text of an e-mail message to decide who sent it?
Project R: More data
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Sam Roweis also has a link to several datasets out there:
http://www.cs.toronto.edu/~roweis/data.html