Your Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. All projects must have an implementation component, though theoretical aspects may also be explored. You should also evaluate your approach, preferably on real-world data. Below, you will find some project ideas, but the best idea would be to combine optimization with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters. If you are uncertain about this requirement, please email the instructors.

Projects can be done by you as an individual, or in teams of two students.   Each project will also be assigned a 708 instructor as a project consultant/mentor.   They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours.  Your project will be worth 30% of your final class grade, and will have two final deliverables:

  1. a writeup in the format of a NIPS paper (8 pages maximum in NIPS format, including references; this page limit is strict), due Dec 3rd by 3pm by email to the instructors list, worth 60% of the project grade, and

  2. a poster presenting your work for a special class poster session on Dec 1st, 3-6pm in the NSH Atrium, worth 20% of the project grade. 

 In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by Nov 6th (by 5pm, start of recitation) (either by email or submitted to Michelle), worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.

Project Proposal

You must turn in a brief project proposal (1-page maximum) by Oct 8th in class. Read the list of potential project ideas below (once posted). You are encouraged to use one of the ideas. If you prefer to do a different project and you are proposing your own data set you must have access to this data already, and present a clear proposal for what you would do with it.   

Project proposal format: Proposals should be one page maximum. Include the following information:

Project suggestions:

Ideally, you will want to pick a problem in a domain of your interest, e.g., computer vision, natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using graphical models. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.

You can also find some project ideas below.


For each of the topics we provide some suggested readings. If you're interested in the problem, these are the references to start with. Do not consider these references exhaustive; you will be expected to review the literature in greater depth for your project. While you are not forced to choose one of these topics, it is strongly advised that you talk to the instructor if you want to deviate significantly from the topics below. 

Topic A: Structure Learning

This area refers to finding the qualitative (graph) structure of a set of variables in either a directed or undirected graphical model. Potential projects include


Koller & Friedman Chapter 17

Pieter Abbeel, Daphne Koller and Andrew Y. Ng.
Learning Factor Graphs in Polynomial Time & Sample Complexity.
Journal of Machine Learning Research, 7(Aug):1743--1788, 2006.

High dimensional graphical model selection using L1-regularized logistic regression. Martin Wainwright, Pradeep Ravikumar, John Lafferty. NIPS 2006

S.-I. Lee, V. Ganapathi, and D. Koller (2007). "Efficient Structure Learning of Markov Networks using L1-Regularization." Advances in Neural Information Processing Systems (NIPS 2006).

Sridevi Parise and Max Welling (2006) Structure Learning in Markov Random Fields, NIPS 2006

D. Margaritis. Distribution-Free Learning of Bayesian Network Structure in Continuous Domains. Proceedings of The Twentieth National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA, July 2005.

Yuhong Guo and Russ Greiner (2005),  ``Discriminative Model Selection for Belief Net Structures".  In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05).

Ajit Singh and Andrew Moore (2005), Finding Optimal Bayesian Networks by Dynamic Programming. Tech Report CMU-CALD-05-106

Mikko Koivisto and Kismat Sood (2004), Exact Bayesian Structure Discovery in Bayesian Networks. JMLR 5.

Topic B: Inference

The most common use of a probabilistic graphical model is computing queries, the conditional distribution of a set of variables given an assignment to a set of evidence variables. In general, this problem is NP-hard, which has led to a number of algorithms (both exact and approximate). Potential topics include


Koller & Friedman Chapters 8-12

Adnan Darwiche
Recursive Conditioning
In Artificial Intelligence Journal. Vol 125, No 1-2, pages 5-41. 2001.

T. Jaakkola.
Tutorial on variational approximation methods.
In Advanced mean field methods: theory and practice. MIT Press, 2000.

An Introduction to Variational Methods for Graphical Models M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. In M. I. Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press, 1999.

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Generalized Belief Propagation", Advances in Neural Information Processing Systems (NIPS), Vol 13, pps 689-695, December 2000

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms", IEEE Transactions on Information Theory, ISSN; 0018-9448, Vol. 51, Issue 7, pp. 2282-2312, July 2005

M. J. Wainwright, T. Jaakkola and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, vol. 51, page 2313--2335, July 2005

M. J. Wainwright, "Stochastic Processes on Graphs: Geometric and Variational Approaches", Ph.D. Thesis, Department of EECS, Massachusetts Institute of Technology, 2002.

Pawan Mudigonda, Vladimir Kologorov, and Philip Torr An Analysis of Convex Relaxations for MAP Estimation

M. J. Wainwright, T. S. Jaakkola and A. S. Willsky, 
MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming
approaches.  IEEE Transactions on Information Theory, Vol. 51(11), pages 3697--3717. November 2005.

Linear Programming Relaxations and Belief Propagation - an Empirical Study
Chen Yanover, Talya Meltzer, Yair Weiss
JMLR Special Issue on Machine Learning and Large Scale Optimization, Sep 2006

D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, T. Jaakkola. "Tightening LP Relaxations for MAP using Message Passing". Uncertainty in Artificial Intelligence UAI 2008

Max Welling. On the Choice of Regions for Generalized Belief Propagation
UAI 2004

Max Welling, Tom Minka and Yee Whye Teh (2005) Structured Region Graphs: Morphing EP into GBP. UAI 2005

Topic C: Temporal Models

There are lots of applications where we want to explicitly model time (control, forecasting, online-learning). Hidden Markov Models are one of the simplest discrete-time models, but there are many others: Kalman filters for continuous state-spaces, factorial Hidden Markov models for problems with many hidden variables that allows for efficient variational inference, and dynamic Bayesian networks which allow arbitrarily complex relationships between hidden and observed variables. Projects include,


K&F Chapters 7.2 and 13

Ghahramani, Z. and Jordan, M.I. (1997).  Factorial Hidden Markov Models.  Machine Learning 29: 245-273

Kevin Murphy's PhD Thesis.

Kevin Murphy's book chapter on DBNs:

Xavier Boyen and Daphne Koller, Tractable Inference for Complex Stochastic Processes, in Uncertainty in Artificial Intelligence UAI '98, 1998.

Xavier Boyen and Daphne Koller, Exploiting the Architecture of Dynamic Systems, in National Conference on Artificial Intelligence AAAI '99, 1999.

Mark A. Paskin (2003). Thin Junction Tree Filters for Simultaneous Localization and Mapping. In G. Gottlob and T. Walsh eds., Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence ( IJCAI-03), pp. 1157–1164. San Francisco, CA: Morgan Kaufmann.

Y. Shi, F. Guo, W. Wu and E. P. Xing, GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data, The Eleventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2007).

A. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy Dynamic Bayesian Networks for Audio-Visual Speech Recognition EURASIP, Journal of Applied Signal Processing, 11:1-15, 2002

Topic D: Hierarchical Bayes Topic Models

Statistical topic models have recently gained much popularity in managing large collection of text documents. These models make the fundamental assumption that a document is a mixture of topics, where the mixture proportions are document-specific, and signify how important each topic is to the document. Moreover, each topic is a multinomial distribution over a given vocabulary which in turn dictates how important each word is for a topic. The document- specific mixture proportions provide a low-dimensional representation of the document into the topic-space. This representation captures the latent semantic of the collection and can then be used for tasks like classifications and clustering, or merely as a tool to structurally browse the otherwise unstructured collection. The most famous of such models is known as LDA ,Latent Dirichlet Allocation (Blei et. al. 2003). LDA has been the basis for many extensions in text, vision, bioiformatic, and social networks. These extensions incorporate more dependency structures in the generative process like modeling authors-topic dependency, or implement more sophisticated ways of representing inter-topic relationships.

Potential projects include


D. Blei. Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley, Division of Computer Science, 2004.


D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

Griffiths, T, Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235 2004.

Y.W. Teh, D. Newman and M. Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation.In NIPS 2006.


Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. The Author-Topic Model for authors and documents.In UAI 2004.

D. Blei, J. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems 21, 2007

J. Boyd-Graber, D. Blei, and X. Zhu. A topic model for word sense disambiguation. In Empirical Methods in Natural Language Processing, 2007.

Wei Li and Andrew McCallum. Pachinko Allocation: Scalable Mixture Models of Topic Correlations. Submitted to the Journal of Machine Learning Research, (JMLR), 2008

Application in Vision:

L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005.

L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification . IEEE Intern. Conf. in Computer Vision (ICCV). 2007

Application in Social Networks:

Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096, 2004.

E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, Mixed Membership Model for Relational Data. JMLR 2008.


Mark Steyvers and Tom Griffiths
Matlab Topic Modelling Toolbox.

David Blei
Latent Dirichlet allocation (LDA) in C .

Topic D: Non-parametric Hierarchical Bayes and Dirichlet processes

Clustering is an important problem in machine learning in which the goal is learn the latent groups (clusters) in the data. While parametric approaches to clustering requires specifications of the number of clusters, non-parametric approaches, like Dirichlet process mixture models (DPM), can model potentially countably infinite number of clusters. DP provides a distribution over partitions of the data (i.e. clusters) and can be used as a prior over the number of clusters. Posterior inference (MAP) can then be used to do automatic model selection or a fully bayesian approach can be used to integrate all possible clusterings, weighted by their posterior probability, in future predictions . DP has been widely used not only in simple clustering settings, but also to model (and learn from data) general structures like trees, grammars, hierarchies, etc with interesting applications in information retrieval, natural langauge processing, vision, and biology.

Potential projects include



Dirichlet process, Chinese restaurant processes and all that. M. I. Jordan. Tutorial presentation at the NIPS Conference, 2005.

Zoubin Ghahramani's UAI tutorial slides

Yee Whee Teh. Dirichlet process, Tutorial and Practical Course. MLSS 2007

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2005.


D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144, 2005.

Neal, R. M. (1998) ``Markov chain sampling methods for Dirichlet process mixture models'', Technical Report No. 9815, Dept. of Statistics, University of Toronto

Hal Daume III. Fast search for Dirichlet process mixture models # Conference on AI and Statistics (2007)

Vikash K. Mansinghka, Daniel M. Roy, Ryan Rifkin, Josh Tenenbaum. A-Class: A simple, online, parallelizable algorithm for probabilistic classification

Ian Porteous, Alex Ihler, Padhriac Smyth and Max Welling (2006) Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick-Breaking Representation UAI 2006


Modeling Documents and IR: D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. In Neural Information Processing Systems (NIPS) 16, 2003.

NLP: P. Liang, S. Petrov, D. Klein, and M. Jordan. The infinite PCFG using Hierarchical Dirichlet processes. In Empirical Methods in Natural Language Processing, 2007

Haghighi, A. and Klein, D. (2007). Unsupervised co-reference resolution in a nonparametric Bayesian model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics

Vision: Sudderth, E., Torralba, A., Freeman, W., and Willsky, A. (2005). Describing visual scenes using transformed Dirichlet processes. In Advances in Neural Information Processing Systems 18

J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsupervised Discovery of Visual Object Class Hierarchies. CVPR 2008

Information Integration: Robert Hall, Charles Sutton, Andrew McCallum. Unsupervised Deduplication using Cross-field Dependencies. KDD 2008


Y.W. Teh. Nonparametric Bayesian Mixture Models - release 2.1.

Hal Daume III . Fast search for Dirichlet process mixture models

Kenichi Kurihara,. Variational Dirichlet Process Gaussian Mixture Model

Topic E: Relational Models

Almost all of the machine learning / statistics methods you have studied assume that the data is independent or exchangable. In many cases this is not true. For example, knowing the topic of a web page tells you something about the likely topics of pages linked to it. The independence assumption fails on most graph-structured data sets (relational databases, social networks, web pages). 

Potential projects include


Learning Probabilistic Relational Models,  L. Getoor, N. Friedman, D. Koller, A. Pfeffer. Invited contribution to the book Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001

Discriminative Probabilistic Models for Relational Data,  B. Taskar, P. Abbeel and D. Koller. Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), Edmonton, Canada, August 2002.

L. Liao, D. Fox, and H. Kautz. Location-Based Activity Recognition. in Proceedings of the Neural Information Processing Systems (NIPS), 2005.

Razvan Bunescu and Raymond J. Mooney. Statistical Relational Learning for Natural Language Information Extraction Introduction to Statistical Relational Learning, Getoor, L. and Taskar, B. (Eds.), pp. 535-552, MIT Press, Cambridge, MA, 2007.

Hoifung Poon and Pedro Domingos. Joint Unsupervised Coreference Resolution with Markov Logic. EMNLP 2008.

Topic F: Hybrid Bayesian Networks

Many real systems contain a combination of discrete and continuous variables, which can be modeled as a hybrid BN. Potential projects include


K&F Chapter 14

Hybrid Bayesian Networks for Reasoning about Complex Systems, Uri N. Lerner. Ph.D. Thesis, Stanford University, October 2002.

Topic G: Influence Diagrams

A Bayesian network models a part of the world, but not decisions taken by agents nor the effect that these decisions can have upon the world. Influence diagrams extend Bayesian networks with nodes that represent actions an agent can take, the costs and utilities of actions, and most importantly the relationships between them. 

In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.

Projects include


K&F Chapter 22

D. Koller and B. Milch (2003). "Multi-Agent Influence Diagrams for Representing and Solving Games." Games and Economic Behavior, 45(1), 181-221. Full version of paper in IJCAI '03.

Nash Propagation for Loopy Graphical Games. M. Kearns and L. Ortiz. Proceedings of NIPS 2002.

Multiagent Planning with Factored MDPs; 
Carlos Guestrin, Daphne Koller and Ronald Parr;
In Advances in Neural Information Processing Systems (NIPS 2001), pp. 1523 - 1530, Vancouver, Canada, December 2001.

Planning Under Uncertainty in Complex Structured Environments; 
Carlos Guestrin;
Ph.D. Dissertation, Computer Science Department, Stanford University, August 2003.

Topic H: Max-margin Graphical Models

Typically the parameters of a graphical model are learned by maximum likelihood or maximum a posterori. An alternative criteria for parameter estimation is to maximize the margin between classes, which can be thought of as a combination of graphical models (to represent structured relationships between inputs and outputs) with kernel methods. Projects include,

An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.


Max-Margin Markov Networks,  B. Taskar, C. Guestrin and D. Koller. Neural Information Processing Systems Conference (NIPS03), Vancouver, Canada, December 2003.

Taskar's thesis:

Topic I: Active Learning / Value of Information

Active learning refers to algorithms where the learner has some influence on what samples he sees. For example, say you can perform 5 tests on a patient, out of a panel of 60 tests. Given an existing model of patients, which ones do you pick ? What about the sequential case where you consider the result of each test before choosing another one ? Possible projects include,


A. Krause, C. Guestrin. "Near-optimal Nonmyopic Value of Information in Graphical Models".  Proc. of Uncertainty in Artificial Intelligence (UAI), 2005

A. Krause, C. Guestrin. "Optimal Value of Information in Graphical Models - Efficient Algorithms and Theoretical Limits". Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), 2005

Anderson, B. and Moore, A.
Fast Information Value for Graphical Models
In Neural Information Processing Systems, 2005.

Active Learning: Theory and Applications. Simon Tong. Stanford University 2001.

Topic J: Modeling Text and Images

Images are oftened annotated with text, such as captions or tags, which can be viewed as an additional source of information when clustering images or building topic models. For example a green patch might indicate that there is a plant in the image, until one reads the caption "man in a green shirt". A related problem (Carbonetto et. al. 2004) is data association, linking words to segmented objects in an image. For example, if the caption contains the words boat and sea we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.


D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134

Peter Carbonetto, Nando de Freitas and Kobus Barnard. 
A Statistical Model for General Contextual Object Recognition. ECCV 2004

Vidit Jain, Erik Learned-Miller, Andrew McCallum. People-LDA: Anchoring Topics to People using Face Recognition. International Conference on Computer Vision (ICCV), 2007

Topic K: 2D CRFs for Visual Texture Classification

Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:

2D CRFs for satellite image classification

The goal of this project is to classify pixels in satellite image data into classes like field vs road vs forest, using MRFs/CRFs (see above), or some other technique. Some possibly useful links:

Topic L: MAP-MRF Inference via Graph Cuts

Recent works have shown that for a particular class of pairwise potentials (loosely related to submodular functions), MAP inference in MRFs can be achieved via Graph Cuts for which max-flow based polynomial-time algorithms exist. Possible goals for this project include:

Fast Approximate Energy Minimization via Graph Cuts, Yuri Boykov, Olga Veksler and Ramin Zabih. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11), November 2001.

What Energy Functions can be Minimized via Graph Cuts?, Vladimir Kolmogorov and Ramin Zabih. In:  IEEE Transactions on Pattern Analysis and Machine Intelligence, February 2004.

A Comparative Study of Energy Minimization Methods for Markov Random Fields., Rick Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marshall Tappen, Carsten Rother.


Below are a number of data sets that could be used for your project. If you want to use a data set that is not on the list it is strongly advised that you talk to either a TA or the instructor before submitting your intial proposal.

Thanks to Dieter Fox, Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data sets.

Data A: Functional MRI

Functional fMRI measures brain activation over time, which allows one to measure changes as an activity is performed (eg, looking at a picture of a cat vs. looking at a picture of a chair). Tasks using this data are typically of the form "predict cognitive state given fMRI data". fMRI data is both temporal and spatial: each voxel contains a time series, each voxel is correlated to voxels near it.

Data B: Corel Image Data

Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture.  The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions.  One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments.  Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

Data C: Twenty Newsgroups

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Data D: Sensor Networks

Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at

Ideas for projects include 

·          Learn graphical models representing the correlations between measurements at different nodes

·          Develop new distributed algorithms for solving a learning task on this data


Data E: arXiv Preprints

A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.


A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

Data G: Activity Modelling

Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car, working, athome, shopping.

An example of a DBN model for this problem is

A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors (UAI-2006)

Data H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.  

Data I: Record Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage.

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include

Data J: Enron e-mail

Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

Data K: Internet Movie Database

The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models.

Data L: Netflix

Netflix is running a competition for movie recommendation algorithms. They've released a dataset of 100M ratings from 480K randomly selected users over 17K titles. The data set, and contest details, are available at

A much smaller (but more widely used) movie rating data set is Movielens

Data M: NIPS Corpus

A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

Data N: Character recognition (digits)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Data O: Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.

Other sources of data

UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.

Sam Roweis also has a link to several datasets (most ready for use in Matlab):