Probabilistic Graphical Models 10708, Fall 2005 
Course
Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a realworld data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:
1. a writeup in the form of a NIPS paper (8 pages maximum in NIPS format, including references), due Dec 5, worth 60% of the project grade, and
2. a poster presenting your work for a special ML class poster session at the end of the semester, due Dec 2, worth 20% of the project grade.
In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by Nov 14, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
Project Proposal:
You must turn in a brief project proposal (1page maximum) by Oct 19th.
You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a nontrivial technical component (either theoretical or applicationoriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project.
You may use the list of available dataset provided bellow and pick a less adventurous project from the following list of potential project ideas. These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset.
Project proposal format: Proposals should be one page maximum. Include the following information:
· Project title
· Project idea. This should be approximately two paragraphs.
· Software you will need to write.
· Papers to read. Include 13 relevant papers. You will probably want to read at least one of them before submitting your proposal
· Teammate(s): will you have teammate(s)? If so, whom? Maximum team size is three students.
· Nov 14 milestone: What will you complete by Nov 14? Experimental results of some kind are expected here.
Project suggestions:
· Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using graphical models. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.
You
can also find some project ideas below.
This data set contains a time series of images of brain activation,
measured
using fMRI, with one image every 500 msec. During this time, human subjects performed
40 trials
of a sentencepicture comparison task (reading a sentence, observing a
picture,
and determining whether the sentence correctly described the picture).
Each of
the 40 trials lasts approximately 30 seconds. Each image contains
approximately
5,000 voxels (3D pixels), across a large
portion of
the brain. Data is available for 12 different human subjects.
Available software: we can provide Matlab
software for reading the data, manipulating and visualizing it, and for
training some types of classifiers (Gassian
Naive Bayes, SVM).
Project A: Bayes network
classifiers
for fMRI
Project idea: Gaussian Naļve Bayes
classifiers
and SVMs have been used with this data to
predict when
the subject was reading a sentence versus perceiving a picture. Both of
these
classify 8second windows of data into these two classes, achieving
around 85%
classification accuracy [Mitchell et al, 2004]. This project will
explore going
beyond the Gaussian Naļve Bayes classifier
(which
assumes voxel activities are conditionally
independent), by training a Bayes network
in
particular a TAN tree [Friedman, et al., 1997]. Issues youll
need to confront include which features to include (5000 voxels
times 8 seconds of images is a lot of features) for classifier input,
whether
to train brainspecific or brainindependent classifiers, and a number
of
issues about efficient computation with this fairly large data set.
Papers to read: "Learning to Decode
Cognitive States from Brain Images,"
Mitchell et al., 2004, "Bayesian Network
Classifiers" Friedman
et al., 1997.
The goal is to segment images in a meaningful way.
http://www.cs.berkeley.edu
Project ideas:
Project B: RegionBased Segmentation
Most segmentation algorithms have focused on segmentation based on
edges or
based on discontinuity of color and texture. The groundtruth in
this
dataset, however, allows supervised learning algorithms to segment the
images
based on statistics calculated over regions. One way to do this
is to
"oversegment" the image into superpixels (Felzenszwalb
2004,
code available) and merge the superpixels
into larger
segments. Graphical models can be used to represent smoothness in
clusters, by adding appropriate potentials between neighboring pixels.
In this project, you can address, for example, learning of such
potentials, and inference in models with very large treewidth.
Papers to read: Some segmentation papers from
This data set contains 1000 text articles posted to each of 20
online newgroups, for a total of 20,000
articles. For
documentation and download, see this website.
This data is useful for a variety of text classification and/or
clustering
projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are
hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website provides an implementation
of a
Naive Bayes classifier for this text
data. The
code is quite robust, and some documentation is available, but it is
difficult
code to modify.
Project ideas:
·
EM text classification in the case where you
have labels for some documents, but not for others (see McCallum
et al,
and come up with your own suggestions)
Using this 54node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.
http://www2.cs.cmu.edu/~guestrin/Research/Data/
This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low.
Project ideas:
· Learn graphical models representing the correlations between measurements at different nodes
· Develop new distributed algorithms for solving a learning task on this data
Papers:
· http://www2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf
· http://www2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
Project suggestion:
· Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)
This dataset has includes 45 years of daily precipitation data from
the
Northwest of the
http://www.jisao.washington
Project ideas:
· Weather prediction: Learn a probabilistic model to predict rain levels
· Sensor selection: Where should you place sensor to best predict rain
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www2.cs.cmu.edu/~webkb/
Project ideas:
· Learning classifiers to predict the type of webpage from the text
· Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?
Papers:
· http://www2.cs.cmu.edu/~webkb/
·
http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps
Project K:
Inference
VIBES
(Variational Inference for Bayesian
networks) is an
alternative to BUGS in that it uses a deterministic mean field
approximation.
VIBES is open source Java; there is also a Matlab
interface. The goal of this project is to compare speed vs
accuracy of the mean field and the Gibbs sampling methods on various
problems
in Bayesian estimation. (For discrete random variables, e.g. Ising models, mean field is usually much faster,
and loopy
belief propagation is even better, but for continuous (non Gaussian)
random
variables, it's not so clear.) See also Matt
Beal's page
for variational Bayes
stuff.
Comparing approximate inference for Ising models:
Ising
models are discretestate 2D gridstructured MRFs
with pairwise potentials. Many models (Bayes nets, Markov nets, factor graphs) can be
converted
into this form. Exact inference is intractable, so people have tried
various
approximations, such as mean field, loopy belief propagation (BP),
generalized
belief propagation, Gibbs sampling, RaoBlackwellised
MCMC, SwendsenWang, graph cuts, etc.
The goal of this project is to
empirically compare
these methods on some MRF models (using other people's code), and and to make a uniform matlab
interface to all the functions (so they can be interchanged in a
plugnplay
fashion).
To test, you can use an MRF with random parameters, but it would be
better
to team up with someone who is trying to learn MRF parameters from real
data
(see below).
The C++ code (with a Matlab
wrapper) for mean field, loopy BP, generalized BP, Gibbs sampling and SwendsenWang, which I've put here.
Code for
RBMCMC can be obtained from Firas Hamze or Nando de Freitas. C++ graphcuts
code is
available (without matlab interface) here.
Some related papers you should read
first:
Comparing the
mean field
method and belief propagation for approximate inference in MRFs
Yair Weiss, 2001.
Comparison
of Graph Cuts with Belief Propagation for Stereo, using Identical MRF Parameters , ICCV 2003. (He has C code
available.)
Tutorial
on approximate inference, Frey and Jojic,
PAMI
2004
Simple images, such
as
handwritten digits can be represented
by a grid of binary numbers, on which an Ising
modeling
can be defined. An IPF algorithm makes use of the junction tree
algorithm to
learn the model. In this project you are asked to plug in a mean field
or
generalized mean field methods for inference in the learning process,
and
compare the outcome with that of an IPF. See Yee
Whye Tehs paper
for
the IPF methods and description of the data and the problem. Since variational methods optimize a lower bound of
the likelihood
instead of the true likelihood, your results will reveal the
consequence of
such approximation on learning and interesting theoretical insights.
Project L: MRF
and vision:
Discriminative Fields for
Modeling
Spatial Dependencies in Natural Images is about applying 2D
conditional
random fields (CRFs) for classifying image
regions as
containing "manmade building" or not, on the basis of texture. The
goal of this project is to reproduce the results in the NIPS 2003
paper. Useful
links:
The goal of this
project is to
classify pixels in satellite image data into classes like field vs road vs forest,
using MRFs/CRFs (see above), or some other
technique. Some
possibly useful links:
Project M: Object
tracking and trajectory modeling using a nonlinear dynamic model based
on HMM
or statespace model (e.g., inputoutput HMM, factorial HMM, switching
SSM)
Video
tracking:
The goal of this project is to
reproduce
the
results in the following paper: Transformed
hidden Markov models: Estimating mixture models of images and inferring
spatial transformations
in video sequences (CVPR 2000). Note that Brendan Frey
has Matlab code for transformation
invariant EM on his home
page. See also Realtime
Online Learning of Transformed Hidden Markov Models from Video, Nemanja Petrovic, Nebojsa Jojic,
Brendan J. Frey,
Thomas S, Huang, AIstats 2003, which is
10,000 times
faster!
Genetic
instability (this is an open research project, if you are
interested, come to Eric Xing to discuss details):
Array CGH data are sequences of fluorescence measurements reflecting the DNA copy numbers along the chromosome. The measurements are continuous and can be highly distorted by noises in a complex, nonuniform fashion. Jane Fridlyand proposed a Hidden Markov Models Approach to the Analysis of Array CGH Data, where she implement an HMM model for estimating the CGH copy number. But this model is very restricted.
A switching Hidden Process Model assumes that the hybridization process on each chromosomal region with uniform copy number would ideally follow a standard copynumberspecific linear dynamic model (LDM) [West and Harrison, 1999]. To accommodate outliers and alternative hybridization and signaling dynamics, a mixture of LDMs can be used to model a hidden process that generates fluorescence signals from a chromosomal region with a speci_c copy number. For a chromosome with stochastic regional amplifications and deletions, a switching HPM assumes that another discrete hidden process is responsible to selecting the corresponding copynumberspecific HPM at each region to generate the signals. The switching HPM model is essentially a special dynamic Bayesian network that allows one to infer the temporalspatiallyspecific hidden dynamics underlying an observation stream and the ensuing segmentation of the stream. It is a generalization to Ghahramani's SSSM which can be understood as modeling each hidden process using a plain KF. In this project you are asked to formulate this model and implement a variational algorithm for inference with such model.
In the dataset
(log2.ratio.ex), there are two columns of numbers, corresponding to two
sample
sources. Please read the original paper to get a more detailed
understanding of
the data. You can choose the appropriate number of state you feel
necessary
after inspecting the plots of the points.
Project N: Learning
POMDP structure so as to maximize utility
Hoey & Little (CVPR 04) show how to
learn the state
space, and parameters, of a POMDP so as to maximize utility in a visual
face
gesture recognition task. (This is similar to the concept of "utile
distinctions" developed in Andrew
McCallum's PhD
thesis.) The goal of this project is to reproduce Hoey's
work in a simpler (nonvisual) domain, such as McCallum's driving task.
Project
O: Learning partially
observed MRFs: the Langevin algorithm
In the recently proposed exponential
family
harmonium model (Welling
et. al., Xing
et. al.), a constructive divergence (CD) algorithm was used to
learn the
parameters of the model (essentially a partially observed, twolayer
MRF). In
Xing et. al., a comparison to variational
learning was performed. CD is essentially a gradient ascent algorithm
of which
the gradient is approximated by a few samples. The Langevin method adds a random
perturbation to the gradient and can often help to get the learning
process out
of local optima. In this project you will implement the Langevin
learning algorithm for Xings dual wing harmonium model, and test your
algorithm
on the data in my UAI paper. See Zoubin Ghahramanis paper
of Bayesian learning of MRF for reference.
Project P: Contextspecific
independence
Project Q: More
data
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Sam Roweis also has a link to several datasets out there:
http://www.cs.toronto.edu/~roweis/data.html