![]() |
Probabilistic Graphical Models
10-708, Fall 2009Eric Xing School of Computer Science, Carnegie-Mellon University |
Course Project
Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 708 instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 35% of your final class grade, and will have 4 deliverables:
- Proposal : 1 page (10%).
Due : 7th Oct - Midway Report : 3-4
pages (20%).
Due : 4th Nov - Final Report : 8 pages (40%).
Due : 2nd Dec - Poster Presentation :
(30%)
Note that all write-ups in the form of a NIPS paper. The page limits are strict! Papers over the limit will not be considered.
Project Proposal
You must turn in a brief project proposal (1-page maximum). Read the list of available data sets and potential project ideas below. You are highly recommended to use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on data that has not been collected, so you have to work on existing data sets. It is also possible to propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.
Project proposal format: Proposals should be one page maximum. Include the following information:
- Project title
- Data set
- Project idea. This should be approximately two paragraphs.
- Software you will need to write.
- Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal
- Teammate: will you have a teammate? If so, whom? Maximum team size is two students. We expect projects done in a group to be more substantial than projects done individually.
- Midterm milestone: What will you complete by the midterm? Experimental results of some kind are expected here. You should also describe what portion of the project each partner will be doing.
Midway Report
This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.
Grading scheme for the project report:
- 70% for proposed method (should be almost finished)
- 25% for the design of upcoming experiments
- 5% for plan of activities (in an appendix, please show the old one and the revised one, along with the activities of each group member)
Final Report
Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:
- Introduction - Motivation
- Problem definition
- Proposed method
- Intuition - why should it be better than the state of the art?
- Description of its algorithms
- Experiments
- Description of your testbed; list of questions your experiments are designed to answer
- Details of the experiments; observations
- Conclusions
Poster Presentation
We will have all projects presenting a poster, on Project
poster
session : 30th November, 2009 from 2:30-5:30pm in the NSH atrium. At least one project member should be present during the
poster
hours. The session will be open
to everybody.
Project Suggestions:
Ideally, you will want to pick a
problem in a
domain of your interest, e.g., natural language parsing, DNA sequence
analysis, text
information retrieval, network mining, reinforcement learning, sensor
networks, etc., and formulate
your problem using machine learning techniques. You can then, for
example, adapt
and tailor standard
inference/learning algorithms to your problem, and do a thorough
performance
analysis. You
can also find some project ideas below.
Project A: Brain imaging data (fMRI)
This data set contains a time series of images of brain activation,
measured
using fMRI, with one image every 500 msec. During this time, human subjects performed
40 trials
of a sentence-picture comparison task (reading a sentence, observing a
picture,
and determining whether the sentence correctly described the picture).
Each of
the 40 trials lasts approximately 30 seconds. Each image contains
approximately
5,000 voxels (3D pixels), across a large
portion of
the brain. Data is available for 12 different human subjects.
Available software: Matlab
software for reading the data, manipulating and visualizing it, and for
training some types of classifiers (Gassian
Naive Bayes, SVM).
Project A: Bayes network
classifiers
for fMRI
Project idea: Gaussian Naïve Bayes
classifiers
and SVMs have been used with this data to
predict when
the subject was reading a sentence versus perceiving a picture. Both of
these
classify 8-second windows of data into these two classes, achieving
around 85%
classification accuracy [Mitchell et al, 2004]. This project will
explore going
beyond the Gaussian Naïve Bayes classifier
(which
assumes voxel activities are conditionally
independent), by training a Bayes network
in
particular a TAN tree [Friedman, et al., 1997]. Issues youll
need to confront include which features to include (5000 voxels
times 8 seconds of images is a lot of features) for classifier input,
whether
to train brain-specific or brain-independent classifiers, and a number
of
issues about efficient computation with this fairly large data set.
Papers to read: "Learning to Decode
Cognitive States from Brain Images,"
Mitchell et al., 2004, "Bayesian Network
Classifiers" Friedman
et al., 1997.
Project B: Image Segmentation Dataset
The goal is to segment images in a meaningful way.
http://www.cs.berkeley.edu
Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on
edges or
based on discontinuity of color and texture. The ground-truth in
this
dataset, however, allows supervised learning algorithms to segment the
images
based on statistics calculated over regions. One way to do this
is to
"oversegment" the image into superpixels (Felzenszwalb
2004,
code available) and merge the superpixels
into larger
segments. Graphical models can be used to represent smoothness in
clusters, by adding appropriate potentials between neighboring pixels.
In this project, you can address, for example, learning of such
potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from
Project C: Twenty Newgroups text data
This data set contains 1000 text articles posted to each of 20
online newgroups, for a total of 20,000
articles. For
documentation and download, see this website.
This data is useful for a variety of text classification and/or
clustering
projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are
hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website provides an implementation
of a
Naive Bayes classifier for this text
data. The
code is quite robust, and some documentation is available, but it is
difficult
code to modify.
Project ideas:
·
EM text classification in the case where you
have labels for some documents, but not for others (see McCallum
et al,
and come up with your own suggestions)
Project D: Sensor network data
A 54-node sensor network collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.
http://www-2.cs.cmu.edu/~guestrin/Research/Data/
This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low.
Project ideas:
· Learn graphical models representing the correlations between measurements at different nodes
· Develop new distributed algorithms for solving a learning task on this data
Papers:
· http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf
· http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf
· Efficient Structure Learning of Markov Networks using L1-Regularization
Project E: Character recognition (digits) data
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
Project suggestion:
· Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)
Project F: Precipitation data
This dataset has includes 45 years of daily precipitation data from
the
Northwest of the
http://www.jisao.washington
Project ideas:
· Weather prediction: Learn a probabilistic model to predict rain levels
Project G: WebKB
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www-2.cs.cmu.edu/~webkb/
Project ideas:
· Assign labels to the documents using both content as well as link information. You could use a CRF like model where the hidden variables are the class labels of the web-pages and the observed variables are the words in each web-page. The undirected edges between the labels are given by the hyper-link structure with direction ignored.
Papers:
· http://www-2.cs.cmu.edu/~webkb/
·
http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps
Project H: Electoral Campaign Contribution data
This dataset provided below is compiled from the Federal Election Commission
(http://www.fec.gov/finance/disclosure/ftpdet.shtml) and contains
information about federal electoral campaign contributions from
elections from 1980-2006. There are 3 types of entities: Donors,
Committees, and Candidates. Donors contribute money to committees,
and committees then give money to candidates. Donors are individuals,
like Harry Q. Bovik or Ben Roethlisberger. Committees are
organizations, and may be devoted to a single candidate or several
candidates. For instance, a committee might be CMU Students for Ron
Paul, or the Machine Learning Researchers for Political Action.
Candidates are registered candidates for any federal election: Senate,
House, or Presidential.
http://www.cs.cmu.edu/~mmcgloho/local/data/fec_data.html
The indices for all three entities list name and address data, with several additional fields. Donors also have a listed occupation. Committees have data pertaining to each committee's interest. The index for candidates also includes information on party and election status. Full lists of features may be found in the readme.Project ideas:
- Temporal Models such as HMMs or DBNs, modeling financial transactions over time.
- Relational Models, predicting links between donors/committees, and committees/candidates. One could also create entities/links for features (Donor Harry Bovik ResidesIn Zip15213).
- Learning causal relationships in the data.
Project I: Deduplication
The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is knownby the varied names of Deduplication, Identity Uncertainty and Record Linkage.
http://www.cs.utexas.edu/users/ml/riddle/data.html
Project Ideas:
- One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".
Papers:
Project J: Email Annotation
The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.
http://www.cs.cmu.edu/~einat/datasets.html
Project Ideas:
- Model the task as a Sequential Labeling problem, where each
email is a sequence of tokens, and each token can have either a label
of "person-name" or "not-a-person-name".
Papers: http://www.cs.cmu.edu/~einat/email-2004.pdf
Project K:
Inference
Comparing approximate inference for Ising models:
Ising
models are discrete-state 2D grid-structured MRFs
with pairwise potentials. Many models (Bayes nets, Markov nets, factor graphs) can be
converted
into this form. Exact inference is intractable, so people have tried
various
approximations, such as mean field, loopy belief propagation (BP),
generalized
belief propagation, Gibbs sampling, Rao-Blackwellised
MCMC, Swendsen-Wang, graph cuts, etc.
The goal of this project is to
empirically compare
these methods on some MRF models (using other people's code), and and to make a uniform matlab
interface to all the functions (so they can be interchanged in a
plug-n-play
fashion).
To test, you can use an MRF with random parameters, but it would be
better
to team up with someone who is trying to learn MRF parameters from real
data
(see below).
The C++ code (with a Matlab
wrapper) for mean field, loopy BP, generalized BP, Gibbs sampling and Swendsen-Wang, from here.
Code for
RB-MCMC can be obtained from Firas Hamze or Nando de Freitas. C++ graphcuts
code is
available (without matlab interface) here.
Some related papers you should read
first:
Comparing the
mean field
method and belief propagation for approximate inference in MRFs
Yair Weiss, 2001.
Comparison
of Graph Cuts with Belief Propagation for Stereo, using Identical MRF Parameters , ICCV 2003. (He has C code
available.)
Tutorial
on approximate inference, Frey and Jojic,
PAMI
2004
Comparing message-passing schedules for Belief Propagation:
The goal of this project is to compare the effects of the choice of the schedule of messages on the results of Loopy Belief Propagation. One of the goals would be to recreate the results of the paper
Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing.
Comparing variational
learning,
MCMC learning and IPF of Ising models on
binary
images:
Simple images, such
as
handwritten digits can be represented
by a grid of binary numbers, on which an Ising
modeling
can be defined. An IPF algorithm makes use of the junction tree
algorithm to
learn the model. In this project you are asked to plug in a mean field
or
generalized mean field methods for inference in the learning process,
and
compare the outcome with that of an IPF. See Yee
Whye Tehs paper
for
the IPF methods and description of the data and the problem. Since variational methods optimize a lower bound of
the likelihood
instead of the true likelihood, your results will reveal the
consequence of
such approximation on learning and interesting theoretical insights.
Project L: MRF
and vision:
2D CRFs for visual texture
classification
Discriminative Fields for
Modeling
Spatial Dependencies in Natural Images is about applying 2D
conditional
random fields (CRFs) for classifying image
regions as
containing "man-made building" or not, on the basis of texture. The
goal of this project is to reproduce the results in the NIPS 2003
paper. Useful
links:
- labeled training data.
- C++ graphcuts code for approximate inference
- Kevin Murphys Matlab CRF code
- Carl Rasmussen's matlab
conjugate
gradient minimizer (better than using netlab or matlab
optimization toolbox)
- Intro
to CRFs by Hanna Wallach
- Maxent page, includes code
- Steerable
pyramid matlab code, possibly useful
set of image features
- Matlab
wavelet toolbox, possibly useful set of image features
.
- Paper
of CRFs for sign detection, J. Weinman, 2004
- Markov
Random Field Modeling in Computer Vision, S. Z. Li, 1995.
- G. Winkler, "Image Analysis, Random
Fields, and MCMC Methods", 2nd edition, 2003.
- Markov
random fields and images, P. Perez. CWI Quarterly, 11(4):413-437,
1998. Review article.
2D CRFs for satellite image classification
The goal of this
project is to
classify pixels in satellite image data into classes like field vs road vs forest,
using MRFs/CRFs (see above), or some other
technique. Some
possibly useful links:
- Fully
Bayesian Image Segmentation -- an Engineering Perspective, Morris
et al, 1996.
- A binary tree-structured
MRF model for multispectral satellite
image segmentation ,2003
Project M: Unsupervised Parts of Speech tagging
Project ideas:
- Assume a chain graphical model and learn the parameters and parts of speech labels.
Project N: Video tracking
Object tracking and trajectory modeling using a non-linear dynamic model based
on HMM
or state-space model (e.g., input-output HMM, factorial HMM, switching
SSM)
The goal of this project is to
reproduce
the
results in the following paper: Transformed
hidden Markov models: Estimating mixture models of images and inferring
spatial transformations
in video sequences (CVPR 2000). Note that Brendan Frey
has Matlab code for transformation
invariant EM on his home
page. See also Real-time
On-line Learning of Transformed Hidden Markov Models from Video, Nemanja Petrovic, Nebojsa Jojic,
Brendan J. Frey,
Thomas S, Huang, AIstats 2003, which is
10,000 times
faster!
Project O: Context-specific
independence
Project P: More
data
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Sam Roweis also has a link to several datasets out there:
[validate xhtml]
