Probabilistic Graphical Models

10-708, Fall 2006

School of Computer Science, Carnegie-Mellon University

Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in a team of two students (no team of more than two students is permitted). Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:

1. a writeup in the form of a NIPS paper (8 pages maximum in NIPS format, including references), due Dec 8, worth 60% of the project grade, and

2. a poster presenting your work for a special ML class poster session at the end of the semester, due Dec 1, worth 20% of the project grade.

In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by Nov 1, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.

Project Proposal:

You must turn in a brief project proposal (1-page maximum) by Oct 11th.

You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a non-trivial technical component (either theoretical or application-oriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project.If the topic of your project overlaps with any previous (or current) classwork or research, you must explain what parts of this project are new work.

Since this is a class on probabilistic graphical models, it is imperative that such models are central to your project.

You may use the list of available dataset provided below and pick a “less adventurous” project from the following list of potential project ideas. These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset.

Project proposal format: Proposals should be one page maximum. Include the following information:

· Project title

· Project idea. This should be approximately two paragraphs.

· Software you will need to write.

· Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal

· Teammate: will you have a teammate? If so, whom? Maximum team size is two students.

· Nov 1 milestone: What will you complete by Nov 1? Experimental results of some kind are expected here.

Project suggestions:

· Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using graphical models. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.

You can also find some project ideas below.

Topics

For each of the topics we provide some suggested readings. If you're interested in the problem, these are the references to start with. Do not consider these references exhaustive; you will be expected to review the literature in greater depth for your project. While you are not forced to choose one of these topics, it is strongly advised that you talk to the instructor if you want to deviate significantly from the topics below.

Topic A: Structure Learning

This area refers to finding the qualitative (graph) structure of a set of variables in either a directed or undirected graphical model. Potential projects include

Comparing structure learning algorithms for Bayesian networks (eg, hillclimbing, PDAGs, optimal reinsertion) in terms of quality of density estimation, sensitivity of the size of the data set, classification performance, etc.
Structure search given a fixed ordering -- If you are given a total ordering of the variables x₁...x_n where the parents of x_i are a subset of x₁...x_i-1, structure learning becomes simpler than search over the space of directed acyclic graphs (K&F 15.4.2)
Learning the structure of an undirected graphical model (Abbeel et. al. 2006, Parise and Welling 2006)
Learning compact representations for conditional probability distributions -- In discrete Bayesian networks having a large number of parents means a node's CPD is large. It is possible that given a particular assignment to a few of the parents, the rest of the parents do not matter (context-specific independence), which can lead to a compact representation of a CPD (K&F 15.6).
Bayesian model averaging -- insteading of finding the single best structure for a Bayesian network, compute a posterior distribution over structures (K&F 15.5)
Optimal structure learning -- the naive algorithms are super-exponential in the number of variables, but both the optimal MAP (Singh & Moore 2005) and optimal BMA (Koivisto & Sood 2004) structures can be computed in exponential time at the cost of exponential memory.

References

Koller & Friedman Chapter 15

Pieter Abbeel, Daphne Koller and Andrew Y. Ng.
Learning Factor Graphs in Polynomial Time & Sample Complexity.
Journal of Machine Learning Research, 7(Aug):1743--1788, 2006.
http://ai.stanford.edu/~pabbeel//pubs/abbeel06a.pdf

Inferring Graphical Model Structure using L1-Regularized Pseudo-Likelihood. Martin Wainwright, Pradeep Ravikumar, John Lafferty. NIPS 2006 (to appear)

D. Margaritis. Distribution-Free Learning of Bayesian Network Structure in Continuous Domains. Proceedings of The Twentieth National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA, July 2005.
http://www.cs.iastate.edu/~dmarg/Papers/Margaritis-AAAI05.pdf

Yuhong Guo and Russ Greiner (2005), ``Discriminative Model Selection for Belief Net Structures". In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05).
http://www.cs.ualberta.ca/~yuhong/research/papers/bnmodelgg.pdf

Ajit Singh and Andrew Moore (2005), Finding Optimal Bayesian Networks by Dynamic Programming. Tech Report CMU-CALD-05-106
http://reports-archive.adm.cs.cmu.edu/anon/cald/CMU-CALD-05-106.pdf

Mikko Koivisto and Kismat Sood (2004), Exact Bayesian Structure Discovery in Bayesian Networks. JMLR 5.
http://jmlr.csail.mit.edu/papers/volume5/koivisto04a/koivisto04a.pdf

Sridevi Parise and Max Welling (2006) Structure Learning in Markov Random Fields, NIPS 2006
http://www.ics.uci.edu/~welling/publications/papers/StructLearnMRF-submit.pdf

Topic B: Inference

The most common use of a probabilistic graphical model is computing queries, the conditional distribution of a set of variables given an assignment to a set of evidence variables. In general, this problem is NP-hard, which has led to a number of algorithms (both exact and approximate). Potential topics include

Comparing approximate inference algorithms in terms of accuracy, computational complexity, sensitivity to parameters. Some exact algorithms include Junction trees and Bucket elimination. On larger networks one typically resorts to algorithms that produce approximate solutions, such as sampling (Monte Carlo methods), variational inference, and generalized belief propagation.
Adaptive Generalized Belief Propagation (Welling 2004) & Expectation Propagation (K&F 11) -- Compare these methods to each other and Gibbs sampling.
Convex Procedures -- Methods that performance approximate inference by convex relaxation (Wainwright 2002)
Linear programming methods for approximating the MAP assignment (Wainwright et. al. 2005b, Yanover et. al. 2006)
Recursive conditioning -- An any-space inference algorithm that recursively decomposes an inference on a general Bayesian network into inferences on a smaller subnetwork. (Darwiche 2001).

References

Koller & Friedman Chapters 7-11

Adnan Darwiche
Recursive Conditioning
In Artificial Intelligence Journal. Vol 125, No 1-2, pages 5-41. 2001.
http://reasoning.cs.ucla.edu/fetch.php?id=18&type=ps

T. Jaakkola.
Tutorial on variational approximation methods.
In Advanced mean field methods: theory and practice. MIT Press, 2000.
http://people.csail.mit.edu/tommi/papers/Jaa-var-tutorial.ps

An Introduction to Variational Methods for Graphical Models M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. In M. I. Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press, 1999.
http://www.cs.berkeley.edu/~jordan/papers/variational-intro.pdf

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Generalized Belief Propagation", Advances in Neural Information Processing Systems (NIPS), Vol 13, pps 689-695, December 2000
http://www.merl.com/reports/docs/TR2000-26.pdf

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms", IEEE Transactions on Information Theory, ISSN; 0018-9448, Vol. 51, Issue 7, pp. 2282-2312, July 2005
http://www.merl.com/reports/docs/TR2004-040.pdf

M. J. Wainwright, T. Jaakkola and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, vol. 51, page 2313--2335, July 2005
http://www.eecs.berkeley.edu/~wainwrig/Papers/WaiJaaWil05_Upper.pdf

M. J. Wainwright, "Stochastic Processes on Graphs: Geometric and Variational Approaches", Ph.D. Thesis, Department of EECS, Massachusetts Institute of Technology, 2002.
http://www.eecs.berkeley.edu/~wainwrig/Papers/Final2_Phd_May30.pdf

M. J. Wainwright, T. S. Jaakkola and A. S. Willsky,
MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming
approaches. IEEE Transactions on Information Theory, Vol. 51(11), pages 3697--3717. November 2005.
http://people.csail.mit.edu/tommi/papers/WaiJaaWil_TRMAP_arxiv.pdf

Linear Programming Relaxations and Belief Propagation - an Empirical Study
Chen Yanover, Talya Meltzer, Yair Weiss
JMLR Special Issue on Machine Learning and Large Scale Optimization, Sep 2006
http://www.jmlr.org/papers/volume7/yanover06a/yanover06a.pdf

Max Welling
On the Choice of Regions for Generalized Belief Propagation
UAI 2004
http://www.ics.uci.edu/~welling/publications/papers/ClusterChoice.pdf

Max Welling, Tom Minka and Yee Whye Teh (2005) Structured Region Graphs: Morphing EP into GBP. UAI 2005
http://www.ics.uci.edu/~welling/publications/papers/full.pdf

Topic C: Temporal Models

There are lots of applications where we want to explicitly model time (control, forecasting, online-learning). Hidden Markov Models are one of the simplest discrete-time models, but there are many others: Kalman filters for continuous state-spaces, factorial Hidden Markov models for problems with many hidden variables that allows for efficient variational inference, and dynamic Bayesian networks which allow arbitrarily complex relationships between hidden and observed variables. Projects include,

Comparing the performance of factorial Hidden Markov Models (Ghahramani & Jordan 1997) to dynamic Bayesian networks (K&F 18).
Expermental evaluation of approximate inference algorithms for DBNs, such as Boyen-Koller, Particle Filtering, and Thin Junction Trees (Paskin 2003). Kevin Murphy's thesis provides a good overview of inference in DBNs.
Comparing Kalman filters against more general DBN models.

References

K&F Chapter 18

Ghahramani, Z. and Jordan, M.I. (1997). Factorial Hidden Markov Models. Machine Learning 29: 245-273
http://www.gatsby.ucl.ac.uk/~zoubin/papers/fhmmML.ps.gz

Kevin Murphy's PhD Thesis.
http://www.cs.ubc.ca/~murphyk/Thesis/thesis.pdf

Kevin Murphy's book chapter on DBNs:
http://www.cs.ubc.ca/~murphyk/Papers/dbnchapter.pdf

Xavier Boyen and Daphne Koller, Tractable Inference for Complex Stochastic Processes, in Uncertainty in Artificial Intelligence UAI '98, 1998.
http://ai.stanford.edu/~xb/uai98/index.html

Xavier Boyen and Daphne Koller, Exploiting the Architecture of Dynamic Systems, in National Conference on Artificial Intelligence AAAI '99, 1999.
http://ai.stanford.edu/~xb/aaai99/index.html

Mark A. Paskin (2003). Thin Junction Tree Filters for Simultaneous Localization and Mapping. In G. Gottlob and T. Walsh eds., Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence ( IJCAI-03), pp. 1157–1164. San Francisco, CA: Morgan Kaufmann.
http://ai.stanford.edu/~paskin/pubs/Paskin2003a.pdf

Topic D: Hierarchical Bayes

In text classification one can view a corpus as generated by a hierarchical process. For example, select an author. Given an author there is a distribution over topics he is interested in. Select a topic according to this distribution. Given a topic there is a distribution over words used in a document. Finally, generate a bag of words from this distribution. One approach to modelling this process is hierarchical Bayes (often nonparametric hierarchical Bayes).

Another application of nonparametric hierarchical Bayes is clustering, where instead of selecting the number of clusters a priori, the model averages over the number of clusters to produce a posterior over clusterings of data points.

Potential projects include

Implementing Latent Dirichlet Allocation for document clustering
Compare nonparametric Bayesian clustering methods such as Dirichlet Processes and Chinese Restaurant Processes.
Implement a hierarchical clustering model, either for topic modelling or clustering.

References

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

Dirichlet process, Chinese restaurant processes and all that. M. I. Jordan. Tutorial presentation at the NIPS Conference, 2005.
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps

D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. In Neural Information Processing Systems (NIPS) 16, 2003.
http://www.cs.princeton.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf

D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144, 2005.
http://www.cs.princeton.edu/~blei/papers/BleiJordan2004.pdf

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2005.
http://www.cs.princeton.edu/~blei/papers/TehJordanBealBlei2004.pdf

Ian Porteous, Alex Ihler, Padhriac Smyth and Max Welling (2006) Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick-Breaking Representation UAI 2006
http://www.ics.uci.edu/~welling/publications/papers/ddp_uai06_v8.pdf

Mark Steyvers and Tom Griffiths
Matlab Topic Modelling Toolbox.
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Topic E: Relational Models

Almost all of the machine learning / statistics methods you have studied assume that the data is independent or exchangable. In many cases this is not true. For example, knowing the topic of a web page tells you something about the likely topics of pages linked to it. The independence assumption fails on most graph-structured data sets (relational databases, social networks, web pages).

Potential projects include

Implementing a restricted case of Probabilistic Relational Models (eg, no existence uncertainty) and compare the performance against some baseline non-relational models.
Implementing Relational Markov Networks and compare the performance against some baseline non-relational models

References

Learning Probabilistic Relational Models, L. Getoor, N. Friedman, D. Koller, A. Pfeffer. Invited contribution to the book Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001
http://www.cs.umd.edu/~getoor/Publications/lprm-ch.ps

Discriminative Probabilistic Models for Relational Data, B. Taskar, P. Abbeel and D. Koller. Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), Edmonton, Canada, August 2002.
http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps

L. Liao, D. Fox, and H. Kautz. Location-Based Activity Recognition. in Proceedings of the Neural Information Processing Systems (NIPS), 2005.
http://www.cs.washington.edu/homes/liaolin/Research/nips2005.pdf

Topic F: Hybrid Bayesian Networks

Many real systems contain a combination of discrete and continuous variables, which can be modeled as a hybrid BN. Potential projects include

Compare inference algorithms for hybrid DBNs against those that first discretize all the continuous variables, and then just use the standard algorithms (variable elimination, junction trees).

References

K&F Chapter 12

Hybrid Bayesian Networks for Reasoning about Complex Systems, Uri N. Lerner. Ph.D. Thesis, Stanford University, October 2002.
http://ai.stanford.edu/~uri/Papers/thesis.ps.gz

Topic G: Influence Diagrams

A Bayesian network models a part of the world, but not decisions taken by agents nor the effect that these decisions can have upon the world. Influence diagrams extend Bayesian networks with nodes that represent actions an agent can take, the costs and utilities of actions, and most importantly the relationships between them.

In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.

Projects include

Implementing algorithms for selecting a good or optimal strategy in the single-agent case (K&F 21)
Finding Nash equilibria in multiplayer games (Koller & Milch, 2003)

References

K&F Chapter 21

D. Koller and B. Milch (2003). "Multi-Agent Influence Diagrams for Representing and Solving Games." Games and Economic Behavior, 45(1), 181-221. Full version of paper in IJCAI '03.
http://ai.stanford.edu/~koller/Papers/Koller+Milch:GEB03.pdf

Nash Propagation for Loopy Graphical Games. M. Kearns and L. Ortiz. Proceedings of NIPS 2002.
http://www.cis.upenn.edu/~mkearns/papers/nashprop.pdf

Multiagent Planning with Factored MDPs;
Carlos Guestrin, Daphne Koller and Ronald Parr;
In Advances in Neural Information Processing Systems (NIPS 2001), pp. 1523 - 1530, Vancouver, Canada, December 2001.
http://www.cs.cmu.edu/~guestrin/Publications/NIPS2001MultiAgents/nips01-multiagents.ps.gz

Planning Under Uncertainty in Complex Structured Environments;
Carlos Guestrin;
Ph.D. Dissertation, Computer Science Department, Stanford University, August 2003.
http://www.cs.cmu.edu/~guestrin/Publications/Thesis/thesis.pdf

Topic H: Max-margin Graphical Models

Typically the parameters of a graphical model are learned by maximum likelihood or maximum a posterori. An alternative criteria for parameter estimation is to maximize the margin between classes, which can be thought of as a combination of graphical models (to represent structured relationships between inputs and outputs) with kernel methods. Projects include,

An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.

Compare max-margin to likelihood based methods (eg, character recognition, part of speech tagging)

References

Max-Margin Markov Networks, B. Taskar, C. Guestrin and D. Koller. Neural Information Processing Systems Conference (NIPS03), Vancouver, Canada, December 2003.
http://www.cs.berkeley.edu/~taskar/pubs/mmmn.ps

Taskar's thesis:
http://www.cs.berkeley.edu/~taskar/pubs/thesis.pdf

Topic I: Active Learning / Value of Information

Active learning refers to algorithms where the learner has some influence on what samples he sees. For example, say you can perform 5 tests on a patient, out of a panel of 60 tests. Given an existing model of patients, which ones do you pick ? What about the sequential case where you consider the result of each test before choosing another one ? Possible projects include,

Apply active learning to activity modelling or sensor networks (which sensor should you sample from).
Compare optimization criteria (eg, experimental design criteria) [CITE]
Active learning that models parameter uncertainty.

References

A. Krause, C. Guestrin. "Near-optimal Nonmyopic Value of Information in Graphical Models". Proc. of Uncertainty in Artificial Intelligence (UAI), 2005
http://www.cs.cmu.edu/~krausea/files/05nearoptimal.pdf

A. Krause, C. Guestrin. "Optimal Value of Information in Graphical Models - Efficient Algorithms and Theoretical Limits". Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), 2005
http://www.cs.cmu.edu/~krausea/files/05optimal.pdf

Anderson, B. and Moore, A.
Fast Information Value for Graphical Models
In Neural Information Processing Systems, 2005.
http://www.cs.cmu.edu/~brigham/papers/nips2005.pdf

Active Learning: Theory and Applications. Simon Tong. Stanford University 2001.
http://www.robotics.stanford.edu/~stong/papers/tong_thesis.pdf

Topic J: Modeling Text and Images

Images are oftened annotated with text, such as captions or tags, which can be viewed as an additional source of information when clustering images or building topic models. For example a green patch might indicate that there is a plant in the image, until one reads the caption "man in a green shirt". A related problem (Carbonetto et. al. 2004) is data association, linking words to segmented objects in an image. For example, if the caption contains the words boat and sea we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.

References

D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134
http://www.cs.princeton.edu/~blei/papers/BleiJordan2003.pdf

Peter Carbonetto, Nando de Freitas and Kobus Barnard.
A Statistical Model for General Contextual Object Recognition. ECCV 2004
http://www.cs.ubc.ca/~nando/papers/mrftranstwo.pdf

Topic K: 2D CRFs for Visual Texture Classification

Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:

labeled training data.
C++ graphcuts code for approximate inference
Kevin Murphys Matlab CRF code
Carl Rasmussen's matlab conjugate gradient minimizer (better than using netlab or matlab optimization toolbox)
Intro to CRFs by Hanna Wallach
Maxent page, includes code
Steerable pyramid matlab code, possibly useful set of image features
Matlab wavelet toolbox, possibly useful set of image features .
Paper of CRFs for sign detection, J. Weinman, 2004
Markov Random Field Modeling in Computer Vision, S. Z. Li, 1995. (I have a hardcopy of the 2001 edition.)
G. Winkler, "Image Analysis, Random Fields, and MCMC Methods", 2nd edition, 2003.
Markov random fields and images, P. Perez. CWI Quarterly, 11(4):413-437, 1998. Review article.

2D CRFs for satellite image classification

The goal of this project is to classify pixels in satellite image data into classes like field vs road vs forest, using MRFs/CRFs (see above), or some other technique. Some possibly useful links:

Fully Bayesian Image Segmentation -- an Engineering Perspective, Morris et al, 1996.
A binary tree-structured MRF model for multispectral satellite image segmentation ,2003

Data Sets

Below are a number of data sets that could be used for your project. If you want to use a data set that is not on the list it is strongly advised that you talk to either a TA or the instructor before submitting your intial proposal.

Thanks to Dieter Fox, Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data sets.

Data A: Functional MRI

Functional fMRI measures brain activation over time, which allows one to measure changes as an activity is performed (eg, looking at a picture of a cat vs. looking at a picture of a chair). Tasks using this data are typically of the form "predict cognitive state given fMRI data". fMRI data is both temporal and spatial: each voxel contains a time series, each voxel is correlated to voxels near it.

http://multivac.ml.cmu.edu/10708

Data B: Corel Image Data

Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

http://kdd.ics.uci.edu//databases/CorelFeatures/CorelFeatures.html
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/

Data C: Twenty Newsgroups

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects. The "label" of each article is which of the 20 newsgroups it belongs to. The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Data D: Sensor Networks

Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28^th 2004.

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/lightsensor.zip

Ideas for projects include

· Learn graphical models representing the correlations between measurements at different nodes

· Develop new distributed algorithms for solving a learning task on this data

References:

http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf
http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf

Data E: arXiv Preprints

A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.

http://www.cs.cornell.edu/projects/kddcup/datasets.html

Data F: TRECVID

A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Data G: Activity Modelling

Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car , working, athome, shopping.

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/gps-labels.zip

An example of a DBN model for this problem is

A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors (UAI-2006)
http://www.cs.washington.edu/homes/fox/abstracts/gps-msb-uai-06.abstract.html

Data H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.

Data I: Record Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include

www.isi.edu/info-agents/papers/tejada01-is.pdf
http://www.cs.cmu.edu/~pradeepr/papers/kdd03.pdf

Data J: Enron e-mail

Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

http://www.cs.cmu.edu/~enron/
http://www.cs.cmu.edu/~einat/datasets.html

Data K: Internet Movie Database

The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models.

http://www.imdb.com/interface

Data L: Netflix

Netflix is running a competition for movie recommendation algorithms. They've released a dataset of 100M ratings from 480K randomly selected users over 17K titles. The data set, and contest details, are available at

http://www.netflixprize.com

A much smaller (but more widely used) movie rating data set is Movielens

http://www.grouplens.org/

Data M: NIPS Corpus

A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

http://www.cs.toronto.edu/~roweis/data.html

Data N: Character recognition (digits)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

Data O: Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.

Other sources of data

UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab):

http://www.cs.toronto.edu/~roweis/data.html