Probabilistic Graphical Models
10-708, Fall 2006
Your class project is an opportunity for you to explore an
interesting multivariate analysis problem of your choice in the
context of a real-world data set. Projects can be done by you
as an individual, or in a team of two students (no team of more
than two students is permitted). Each project will also be
assigned a 708 instructor as a project consultant/mentor.
They will consult with you on your ideas, but the final
responsibility to define and execute an interesting piece of work
is yours. Your project will be worth 30% of your final class grade,
and will have two final deliverables:
a writeup in the form of a NIPS paper (8 pages maximum in
NIPS format, including references),
due Dec 8, worth 60% of the project grade, and
a poster presenting your work for a special
ML class poster session at the end of the semester, due Dec
1, worth 20% of the project grade.
In addition, you must turn in a midway progress report (5
pages maximum in NIPS
format, including references) describing the results of
your first experiments by Nov 1, worth 20% of the project
grade. Note that, as with any conference, the page limits are
strict! Papers over the limit will not be considered.
You must turn in a brief project proposal (1-page maximum) by
You are encouraged to come up a topic directly related to your
own current research project or research topics related to
graphical models of your own interest that bears a non-trivial
technical component (either theoretical or application-oriented),
but the proposed work must be new and should not be copied from
your previous published or unpublished work. For example, research
on graphical models that you did this summer does not count as a
class project.If the topic of your project overlaps
with any previous (or current) classwork or research, you must
explain what parts of this project are new work.
Since this is a class on probabilistic graphical models, it is
imperative that such models are central to your project.
You may use the list of available dataset provided below and
pick a “less adventurous” project from the following
list of potential project ideas. These data sets have been
successfully used for machine learning in the past, and you can
compare your results with those reported in the literature. Of
course you can also choose to work on a new problem beyond our list
used the provided dataset.
Project proposal format: Proposals should be one page
maximum. Include the following information:
Project idea. This should be approximately two
Software you will need to write.
Papers to read. Include 1-3 relevant papers. You will
probably want to read at least one of them before submitting your
Teammate: will you have a teammate? If so, whom?
Maximum team size is two students.
Nov 1 milestone: What will you complete by Nov 1?
Experimental results of some kind are expected here.
Ideally, you will want to pick a problem in a domain of your
interest, e.g., natural language parsing, DNA sequence analysis,
text information retrieval, network mining, reinforcement learning,
sensor networks, etc., and formulate your problem using graphical
models. You can then, for example, adapt and tailor standard
inference/learning algorithms to your problem, and do a thorough
You can also find some project ideas below.
For each of the topics we provide some suggested readings. If
you're interested in the problem, these are the references to start
with. Do not consider these references exhaustive; you will be
expected to review the literature in greater depth for your
project. While you are not forced to choose one of these topics, it
advised that you talk to the instructor if you want to deviate
significantly from the topics below.
Topic A: Structure
This area refers to finding the qualitative (graph) structure of
a set of variables in either a directed or undirected graphical
model. Potential projects include
- Comparing structure learning algorithms for Bayesian networks
(eg, hillclimbing, PDAGs, optimal reinsertion) in terms of quality
of density estimation, sensitivity of the size of the data set,
classification performance, etc.
- Structure search given a fixed ordering -- If you are given a
total ordering of the variables x1...xn where
the parents of xi are a subset of
x1...xi-1, structure learning becomes simpler
than search over the space of directed acyclic graphs (K&F
- Learning the structure of an undirected graphical model (Abbeel
et. al. 2006, Parise and Welling 2006)
- Learning compact representations for conditional
probability distributions -- In discrete Bayesian networks having a
large number of parents means a node's CPD is large. It is possible
that given a particular assignment to a few of the parents, the
rest of the parents do not matter (context-specific independence),
which can lead to a compact representation of a CPD (K&F
- Bayesian model averaging -- insteading of finding the single
best structure for a Bayesian network, compute a posterior
distribution over structures (K&F 15.5)
- Optimal structure learning -- the naive algorithms are
super-exponential in the number of variables, but both the optimal MAP
(Singh & Moore 2005) and optimal BMA (Koivisto & Sood 2004)
structures can be computed in exponential time at the cost of
Koller & Friedman Chapter 15
Pieter Abbeel, Daphne Koller and Andrew Y. Ng.
Learning Factor Graphs in Polynomial Time & Sample
Journal of Machine Learning Research, 7(Aug):1743--1788, 2006.
Inferring Graphical Model Structure using L1-Regularized
Pseudo-Likelihood. Martin Wainwright, Pradeep Ravikumar, John
Lafferty. NIPS 2006 (to appear)
D. Margaritis. Distribution-Free Learning of Bayesian Network
Structure in Continuous Domains. Proceedings of The Twentieth
National Conference on Artificial Intelligence (AAAI), Pittsburgh,
PA, July 2005.
Yuhong Guo and Russ Greiner (2005), ``Discriminative Model
Selection for Belief Net Structures". In Proceedings of the
Twentieth National Conference on Artificial Intelligence
Ajit Singh and Andrew Moore (2005), Finding Optimal Bayesian Networks by Dynamic Programming. Tech Report CMU-CALD-05-106
Mikko Koivisto and Kismat Sood (2004), Exact Bayesian Structure Discovery in Bayesian Networks. JMLR 5.
Sridevi Parise and Max Welling (2006) Structure Learning in Markov
Random Fields, NIPS 2006
Topic B: Inference
The most common use of a probabilistic graphical model is
computing queries, the conditional distribution of a set of
variables given an assignment to a set of evidence variables. In
general, this problem is NP-hard, which has led to a number of
algorithms (both exact and approximate). Potential topics
- Comparing approximate inference algorithms in terms of
accuracy, computational complexity, sensitivity to parameters. Some
exact algorithms include Junction trees and Bucket elimination. On
larger networks one typically resorts to algorithms that produce
approximate solutions, such as sampling (Monte Carlo methods),
variational inference, and generalized belief propagation.
- Adaptive Generalized Belief Propagation (Welling 2004) & Expectation
Propagation (K&F 11) -- Compare these methods to each other and Gibbs sampling.
- Convex Procedures -- Methods that performance approximate
inference by convex relaxation (Wainwright 2002)
- Linear programming methods for approximating the MAP assignment (Wainwright et. al. 2005b, Yanover et. al. 2006)
- Recursive conditioning -- An any-space inference algorithm that
recursively decomposes an inference on a general Bayesian network
into inferences on a smaller subnetwork. (Darwiche 2001).
Koller & Friedman Chapters 7-11
In Artificial Intelligence Journal. Vol 125,
No 1-2, pages 5-41. 2001.
Tutorial on variational approximation methods.
In Advanced mean field methods: theory and practice. MIT Press,
An Introduction to Variational Methods for Graphical Models M. I.
Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. In M. I.
Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press,
Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Generalized Belief
Propagation", Advances in Neural Information Processing Systems
, Vol 13, pps 689-695, December 2000
Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Constructing Free-Energy
Approximations and Generalized Belief Propagation Algorithms", IEEE
Transactions on Information Theory, ISSN; 0018-9448, Vol. 51, Issue
7, pp. 2282-2312, July 2005
M. J. Wainwright, T. Jaakkola and A. S. Willsky. A new class of
upper bounds on the log partition function. IEEE Trans. on
Information Theory, vol. 51, page 2313--2335, July 2005
M. J. Wainwright, "Stochastic
Processes on Graphs: Geometric and Variational Approaches", Ph.D.
Thesis, Department of EECS, Massachusetts Institute of Technology,
M. J. Wainwright, T. S. Jaakkola and A. S. Willsky,
MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming
approaches. IEEE Transactions on Information Theory, Vol. 51(11), pages 3697--3717. November 2005.
Linear Programming Relaxations and Belief Propagation - an Empirical Study
Chen Yanover, Talya Meltzer, Yair Weiss
JMLR Special Issue on Machine Learning and Large Scale Optimization, Sep 2006
On the Choice of Regions for Generalized Belief Propagation
Max Welling, Tom Minka and Yee Whye Teh (2005) Structured Region
Graphs: Morphing EP into GBP. UAI 2005
Topic C: Temporal
There are lots of applications where we want to explicitly model
time (control, forecasting, online-learning). Hidden Markov Models
are one of the simplest discrete-time models, but there are many
others: Kalman filters for continuous state-spaces, factorial
Hidden Markov models for problems with many hidden variables that
allows for efficient variational inference, and dynamic Bayesian
networks which allow arbitrarily complex relationships between
hidden and observed variables. Projects include,
- Comparing the performance of factorial Hidden Markov Models
(Ghahramani & Jordan 1997) to dynamic Bayesian networks
- Expermental evaluation of approximate inference algorithms for
DBNs, such as Boyen-Koller, Particle Filtering, and Thin Junction
Trees (Paskin 2003). Kevin Murphy's thesis provides a good overview of inference in DBNs.
- Comparing Kalman filters against more general DBN models.
K&F Chapter 18
Ghahramani, Z. and Jordan, M.I. (1997). Factorial Hidden
Markov Models. Machine Learning 29: 245-273
Kevin Murphy's PhD Thesis.
Kevin Murphy's book chapter on DBNs:
Xavier Boyen and Daphne Koller, Tractable Inference for Complex
Stochastic Processes, in Uncertainty in Artificial Intelligence UAI
Xavier Boyen and Daphne Koller, Exploiting the Architecture of
Dynamic Systems, in National Conference on Artificial Intelligence
AAAI '99, 1999.
Mark A. Paskin (2003). Thin Junction Tree Filters for Simultaneous
Localization and Mapping. In G. Gottlob and T. Walsh eds.,
Proceedings of the Eighteenth International Joint Conference on
Artificial Intelligence ( IJCAI-03), pp. 1157–1164. San
Francisco, CA: Morgan Kaufmann.
Topic D: Hierarchical
In text classification one can view a corpus as generated by a
hierarchical process. For example, select an author. Given an
author there is a distribution over topics he is interested in.
Select a topic according to this distribution. Given a topic there
is a distribution over words used in a document. Finally, generate
a bag of words from this distribution. One approach to modelling
this process is hierarchical Bayes (often nonparametric
Another application of nonparametric hierarchical Bayes is
clustering, where instead of selecting the number of clusters
a priori, the model
averages over the number of clusters to produce a posterior over
clusterings of data points.
Potential projects include
- Implementing Latent Dirichlet Allocation for document
- Compare nonparametric Bayesian clustering methods such as
Dirichlet Processes and Chinese Restaurant Processes.
- Implement a hierarchical clustering model, either for topic
modelling or clustering.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal
of Machine Learning Research, 3:993–1022, January 2003.
Dirichlet process, Chinese restaurant processes and all that. M. I.
Jordan. Tutorial presentation at the NIPS Conference, 2005.
D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical
topic models and the nested Chinese restaurant process. In Neural
Information Processing Systems (NIPS) 16, 2003.
D. Blei and M. Jordan. Variational inference for Dirichlet process
mixtures. Journal of Bayesian Analysis, 1(1):121–144,
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet
processes. Journal of the American Statistical Association,
Ian Porteous, Alex Ihler, Padhriac Smyth and Max Welling (2006)
Gibbs Sampling for (Coupled) Infinite Mixture Models in the
Stick-Breaking Representation UAI 2006
Mark Steyvers and Tom Griffiths
Matlab Topic Modelling Toolbox.
Topic E: Relational
Almost all of the machine learning / statistics methods you have
studied assume that the data is independent or exchangable. In many
cases this is not true. For example, knowing the topic of a
web page tells you something about the likely topics of pages
linked to it. The independence assumption fails on most
graph-structured data sets (relational databases, social networks,
Potential projects include
- Implementing a restricted case of Probabilistic Relational
Models (eg, no existence uncertainty) and compare the performance
against some baseline non-relational models.
- Implementing Relational Markov Networks and compare the
performance against some baseline non-relational models
Relational Models, L. Getoor, N. Friedman, D. Koller, A.
Pfeffer. Invited contribution to the book Relational Data Mining,
S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001
Discriminative Probabilistic Models for Relational Data, B.
Taskar, P. Abbeel and D. Koller. Eighteenth Conference on
Uncertainty in Artificial Intelligence (UAI02), Edmonton, Canada,
L. Liao, D. Fox, and H. Kautz. Location-Based Activity Recognition.
in Proceedings of the Neural Information Processing Systems (NIPS),
Topic F: Hybrid Bayesian
Many real systems contain a combination of discrete and
continuous variables, which can be modeled as a hybrid BN.
Potential projects include
- Compare inference algorithms for hybrid DBNs against those that
first discretize all the continuous variables, and then just use
the standard algorithms (variable elimination, junction
K&F Chapter 12
Hybrid Bayesian Networks for Reasoning about Complex Systems, Uri
N. Lerner. Ph.D. Thesis, Stanford University, October 2002.
Topic G: Influence
A Bayesian network models a part of the world, but not decisions
taken by agents nor the effect that these decisions can have upon
the world. Influence diagrams extend Bayesian networks with nodes
that represent actions an agent can take, the costs and utilities
of actions, and most importantly the relationships between
In multiagent setting finding the Nash equilibrium is hard, but
graphical models provide a framework for recursively decomposing
the problem (opening up the possibility of a dynamic programming
approach). Dynamic programming algorithms like NashProp (Kearns and
Ortiz, 2002) are closely related to belief propagation.
- Implementing algorithms for selecting a good or optimal
strategy in the single-agent case (K&F 21)
- Finding Nash equilibria in multiplayer games (Koller &
K&F Chapter 21
D. Koller and B. Milch (2003). "Multi-Agent Influence Diagrams for
Representing and Solving Games." Games and Economic Behavior,
45(1), 181-221. Full version of paper in IJCAI '03.
Nash Propagation for Loopy Graphical Games. M. Kearns and L. Ortiz.
Proceedings of NIPS 2002.
Multiagent Planning with Factored MDPs;
Carlos Guestrin, Daphne Koller and Ronald Parr;
In Advances in Neural Information Processing Systems (NIPS 2001),
pp. 1523 - 1530, Vancouver, Canada, December 2001.
Planning Under Uncertainty in Complex Structured
Ph.D. Dissertation, Computer Science Department, Stanford
University, August 2003.
Topic H: Max-margin Graphical
Typically the parameters of a graphical model are learned by
maximum likelihood or maximum a posterori. An alternative criteria
for parameter estimation is to maximize the margin between classes,
which can be thought of as a combination of graphical models (to
represent structured relationships between inputs and outputs) with
kernel methods. Projects include,
An example of a domain where this approach works well is
handwriting recognition, where the structure encodes the fact that
knowing what the previous letter was tells you something about what
the next letter is likely to be.
- Compare max-margin to likelihood based methods (eg, character recognition, part of speech tagging)
Max-Margin Markov Networks, B. Taskar, C. Guestrin and D.
Koller. Neural Information Processing Systems Conference (NIPS03),
Vancouver, Canada, December 2003.
Topic I: Active Learning / Value of Information
Active learning refers to algorithms where the learner has some
influence on what samples he sees. For example, say you can perform
5 tests on a patient, out of a panel of 60 tests. Given an existing
model of patients, which ones do you pick ? What about the
sequential case where you consider the result of each test before
choosing another one ? Possible projects include,
- Apply active learning to activity modelling or sensor networks (which sensor should you sample from).
- Compare optimization criteria (eg, experimental design criteria) [CITE]
- Active learning that models parameter uncertainty.
A. Krause, C. Guestrin. "Near-optimal Nonmyopic Value of
Information in Graphical Models". Proc. of Uncertainty in
Artificial Intelligence (UAI), 2005
A. Krause, C. Guestrin. "Optimal Value of Information in Graphical
Models - Efficient Algorithms and Theoretical Limits". Proc. of the
International Joint Conference on Artificial Intelligence (IJCAI), 2005
Anderson, B. and Moore, A.
Fast Information Value for Graphical Models
In Neural Information Processing Systems, 2005.
Active Learning: Theory and Applications. Simon Tong. Stanford
Topic J: Modeling Text and Images
Images are oftened annotated with text, such as captions or tags,
which can be viewed as an additional source of information when
clustering images or building topic models. For example a green patch
might indicate that there is a plant in the image, until one reads the
caption "man in a green shirt". A related problem (Carbonetto et.
al. 2004) is data association, linking words to segmented objects in an
image. For example, if the caption contains the words boat and sea we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.
D. Blei and M. Jordan. Modeling annotated data. In Proceedings of
the 26th annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 127–134
Peter Carbonetto, Nando de Freitas and Kobus Barnard.
A Statistical Model for General Contextual Object Recognition. ECCV 2004
Topic K: 2D CRFs for Visual Texture
Discriminative Fields for
Modeling Spatial Dependencies in Natural Images is about
applying 2D conditional random fields (CRFs) for classifying image regions as containing
"man-made building" or not, on the basis of texture. The goal of
this project is to reproduce the results in the NIPS 2003 paper.
labeled training data.
graphcuts code for approximate
- Kevin Murphys Matlab CRF code
- Carl Rasmussen's matlab conjugate
gradient minimizer (better than
using netlab or matlab optimization toolbox)
to CRFs by Hanna Wallach
- Maxent page, includes code
- Steerable pyramid matlab code, possibly useful set of image
- Matlab wavelet toolbox, possibly useful set of
image features .
Paper of CRFs for sign detection,
J. Weinman, 2004
Random Field Modeling in Computer Vision, S. Z. Li, 1995. (I
have a hardcopy of the 2001 edition.)
- G. Winkler, "Image Analysis, Random
Fields, and MCMC Methods", 2nd edition, 2003.
Markov random fields and images, P. Perez. CWI Quarterly,
11(4):413-437, 1998. Review article.
CRFs for satellite image
The goal of this
project is to classify pixels in satellite image data into classes
like field vs road vs forest, using MRFs/CRFs (see above), or some other technique.
Some possibly useful links:
Below are a
number of data sets that could be used for your project. If you
want to use a data set that is not on the list it is strongly
advised that you talk to either a TA or the instructor before
submitting your intial proposal.
Thanks to Dieter Fox, Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data
Data A: Functional MRI
Functional fMRI measures brain activation over time, which allows
one to measure changes as an activity is performed (eg, looking at a
picture of a cat vs. looking at a picture of a chair). Tasks using this
data are typically of the form "predict cognitive state given fMRI
data". fMRI data is both temporal and spatial: each voxel contains a
time series, each voxel is correlated to voxels near it.
Corel Image Data
Images featurized by color histogram, color histogram layout,
color moments, and co-occurence texture. Useful for projects on
image segementation, especially since there is a large benchmark
Most segmentation algorithms have focused on segmentation based
on edges or based on discontinuity of color and texture. The
ground-truth in this dataset, however, allows supervised learning
algorithms to segment the images based on statistics calculated
over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the
superpixels into larger segments.
Graphical models can be used to represent smoothness in clusters,
by adding appropriate potentials between neighboring pixels. In
this project, you can address, for example, learning of such
potentials, and inference in models with very large tree-width.
This data set contains 1000 text articles posted to each of 20
online newgroups, for a total of 20,000
articles. This data is useful for a variety of text
classification and/or clustering projects. The "label" of
each article is which of the 20 newsgroups it belongs to. The
newsgroups (labels) are hierarchically organized (e.g., "sports",
Data D: Sensor
Using this 54-node sensor network deployment, we collected
temperature, humidity, and light data, along with the voltage level
of the batteries at each node. The data was collected every 30
seconds, starting around 1am on February 28th 2004.
This is a real dataset, with lots of missing data, noise, and
failed sensors giving outlier values, especially when battery
levels are low. Additional data for an intelligent lighting
network, which include link quality information between pairs of
sensors can is available at
Ideas for projects include
Learn graphical models representing the correlations between
measurements at different nodes
Develop new distributed algorithms for solving a learning task on
A collection of preprints in the field of high-energy physics.
Includes the raw LaTeX source of each paper (so you can extract
either structured sentences or a bag-of-words) along with the graph
of citations between papers.
A competition for multimedia information retrieval. They keep a fairly
large archive of video data sets, along with featurizations of the data.
Activity modelling is the task of inferring what the user is
doing from observations (eg, motion sensors, microphones). This
data set consists of GPS motion data for two subjects tagged with
labels like car
An example of a DBN model for this problem is
A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors
This dataset contains webpages from
4 universities, labeled with whether they are professor, student,
project, or other pages.
Ideas for projects: learning classifiers to predict the type of
webpage from the text, using web structure to improve page
Data I: Record
The datasets provided below comprise of lists of
records, and the goal is to identify, for any dataset, the set of
records which refer to unique entities. This problem is known by
the varied names of deduplication, identity uncertainty and record
One common approach is to cast the deduplication problem as a
classification problem. Consider the set of record-pairs, and
classify them as either "unique" or "not-unique". Some papers on
record deduplication include
Data J: Enron
Consists of ~500K e-mails collected from Enron employees. It has
been used for research into information extraction, social network
analysis, and topic modeling.
Internet Movie Database
The Internet Movie Database makes their data publically available,
with certain usage restrictions. It contains tables and links relating
movies, actors, directors, box office grosses, and much more. Various
slices of the data have been used extensively in research on relational
Data L: Netflix
Netflix is running a competition for movie recommendation
algorithms. They've released a dataset of 100M ratings from 480K
randomly selected users over 17K titles. The data set, and contest
details, are available at
A much smaller (but more widely used) movie rating data set is Movielens
Data M: NIPS Corpus
A data set based on papers from a machine learning
conference (NIPS volumes 1-12). The data can be viewed as a
tripartite graph on authors, papers, and words. Links represent
authorship and the words used in a paper. Additionally, papers are
tagged with topics and we know which year each paper was written.
Potential projects include authorship prediction, document
clustering, and topic tracking.
Character recognition (digits)
Optical character recognition, and the simpler digit recognition
task, has been the focus of much ML research. We have two datasets
on this topic. The first tackles the more general OCR task, on a
small vocabulary of words: (Note that the first letter of each word
was removed, since these were capital letters that would make the
task harder for you.)
This dataset has includes 45 years of daily precipitation data
from the Northwestern US. Ideas for projects include predicting
rain levels, deciding where to place sensors to best predict
rainfall, or active learning in fixed sensor networks.
UC Irvine has a repository that could be useful for your
project. Many of these data sets have been used extensively in graphical models research.
Sam Roweis also has a link to
several datasets (most ready for use in Matlab):