![]() |
Probabilistic Graphical Models
10-708, Spring 2012Eric Xing School of Computer Science, Carnegie-Mellon University |
Course Project
Your class project is an opportunity for you to explore an interesting problem in the context of a real-world data set. Projects should be done in teams of three students. Each project will be assigned a 708 instructor as a project consultant/mentor; instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 40% of your final class grade, and will have 4 deliverables:
- Proposal : 1 page (10%).
Due : TBA - Midway Report : 7-8 pages (20%).
Due : TBA - Final Report : At least 12 pages (40%).
Due : TBA - Presentation : (30%)
All write-ups should use the NIPS style.
Project Proposal
You must turn in a brief project proposal (1-page maximum). Read the list of potential project ideas below. We strongly suggest using one of these ideas, though you may discuss other project ideas with us, whether applied or theoretical. Note that even though you can use data sets you have used before, you cannot use work that you started prior to this class as your project.
Project proposal format: Proposals should be one page maximum. Include the following information:
- Project title
- Data set (this can include crawling websites for real-world data)
- Project idea. This should be approximately two paragraphs.
- Software you will need to write.
- Papers to read. Include at least 3 relevant papers, which you will want to read before submitting your proposal
- Teammates: Who are your teammates? Your team should consist of three students.
- Midterm milestone: What will you complete by the midterm? Experimental results of some kind are expected here. You should also describe what portion of the project each teammate will be doing.
Midway Report
This should be a 7-8 page report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.
Grading scheme for the project report:
- 70% for proposed method (should be almost finished)
- 25% for the design of upcoming experiments
- 5% for plan of activities (in an appendix, please show the old one and the revised one, along with the activities of each group member)
Final Report
Your final report is expected to be at least 12 pages. You should submit both an electronic and a hardcopy version for your final report. It should have roughly the following format:
- Introduction - Motivation
- Problem definition
- Proposed method
- Intuition - why should it be better than the state of the art?
- Description of its algorithms
- Experiments
- Description of your testbed; list of questions your experiments are designed to answer
- Details of the experiments; observations
- Conclusions
Presentation
All project teams are to present their work at the end of the semester.
Each team will be given a timeslot to present their work to the whole class
(using Powerpoint or equivalent software). Live demonstrations of your
software are highly encouraged. More details will be announced later.
Project Suggestions:
Each of the following project suggestions has an associated instructor. If you are interested in a particular project, we highly recommend that you contact the instructor to get further ideas or details.
Contact details:
- Sinead: sinead AT cs DOT cmu DOT edu
- Junming: junmingy AT cs DOT cmu DOT edu
Influence in social media (Qirong)
In social media, the spread of themes and ideas along network links is called a cascade. These links can be explicit (as in the case of Facebook friends), or implicit (Twitter retweets/followers and up/down-voting of posts in forums). Non-statistical algorithms like NetInf (http://snap.stanford.edu/netinf/) have been developed to study cascades, but there have been few graphical model approaches thus far.
Your task is to (1) obtain a network by crawling a website of your choice, and (2) develop a graphical model to predict when a cascade will occur on this network, given knowledge of events such as "X people have retweeted Y", or "X people have up-voted post Y". As for the exact definition of a cascade, you may refer to the NetInf paper, or come up with your own. Also, you will need to decide how you will go about validating your model. We suggest either collecting real cascade data from your chosen website, or developing a simulator to generate artificial cascades.
Intervention in social media (Qirong)
This project is related to the previous one; we suggest reading that first.
When a social media cascade occurs, we might want to help or hinder its progress. For example, if we were in charge of a viral marketing campaign, we might want to directly contact celebrities in the hope that they spread our product. Conversely, if we were trying to prevent malicious apps from spreading on a social network, we might want to contain the problem by disabling particular accounts from the network. While this problem has been studied by epidemiologists in the context of disease spread, few of the methods in the literature take advantage of graphical models.
Your task is to (1) obtain a network by crawling a site of your choice, and (2) develop a graphical model to predict what nodes should be seeded/removed to help/hinder an ongoing cascade. As with the previous project, we leave the definition of a cascade up to you (though you should definitely read the NetInf paper and related literature to get some ideas). In order to validate your approach, you will need to generate cascades on your network (thus you will need to write a simulator of some kind).
Multifaceted Visualization of social media (Qirong)
Many algorithms developed to study social media are limited, in that they only consider the network or textual aspects of the medium. Yet, most social media are not limited to these information types or modalities: as an example, Facebook has pictures, videos, and various kinds of structural data such as tags and profile information. No summary or visualization of a social medium is really complete unless all available information has been considered. The challenge, then, is developing methods that can make use of diverse types of information.
Your task is to (1) obtain as much user data as you can by crawling some website of your choosing, and (2) develop a graphical model that "summarizes" the information you have collected, in the spirit of latent space models like Nallapati et al. (2008). Your model's output should (1) reveal interesting aspects of the social medium, and (2) must involve nontrivial probabilistic inference. Also, writing software to visualize your output will be a definite plus for this project.
Note that your data and graphical model must involve at least 2 modalities (e.g. text and network, or text and pictures, etc.), though we would really like to see 3 or more. In particular, your graphical model should be designed in a way that makes it easy to extend to new modalities.
Suggested Reading:
Ramesh Nallapati, Amr Ahmed, Eric Xing, William Cohen. Joint Topic Models for Text and Citations. KDD 2008.
High Dimensional Structure Learning of Markov Random Fields with Correlated Variables (Ankur)
Learning the structure of a Markov Random Field (a pairwise undirected graphical model) has many applications such as in regulatory genomics. However, in many of these cases, the number of variables (nodes) is very large but the sample size is small. As discussed briefly in class, recent methods have aimed to use the LASSO and other sparse regression methods to estimate MRF structure with strong theoretical guarantees under certain conditions. However, one of these conditions is that the variables are largely uncorrelated with one another. One solution to this problem is to simply preprocess the data and cluster the correlated variables. Recently, however, the high dimensional statistics community has proposed the Trace Lasso (Grave et al. 2011) which automatically takes into account correlations among variables. The goal of this project is to adapt this method for structure learning of a single MRF or a sequence of MRFs over time and explore the results empirically on a real dataset compared with other methods.
Suggested Reading:
Ravikumar P., Wainwright M.J., Lafferty J. High-dimensional Ising model selection using ?1-regularized logistic regression. Annals of Statistics, 2010.
Edouard Grave, Guillaume R. Obozinski, Francis Bach. Trace Lasso: a trace norm regularization for correlated designs, NIPS 2011.
Graphical Models for Reciprocal Recommendations (Ankur)
Recommender systems, such as the one used by Netflix or Amazon that recommend products to a particular user given the history of that user and other users have become very popular recently. However, a variant of this problem called "reciprocal recommendations" is not well explored.>/p>
One example of a "reciprocal recommendation" system is in online dating. Here the system wants to recommend to user A a possible match (user B), and then recommend A to B as well. In order for the match to be a success both A must like B and B must like A. (Note how this is different from the traditional scenario where A simply must like object O).
The goal of this project is to find a dataset, develop and implement a model that can perform this task.
Suggested Reading:
L. Mackey, D. Weiss, M.I. Jordan, Mixed Membership Matrix Factorization. ICML 2011
Applications of Nonparametric Graphical Models (Ankur)
In many real world scenarios, variables are related by rich dependencies such as bioinformatics, computer vision, and fMRI analysis. However, in many cases these relationships may be nonlinear and not easily expressed with Gaussian distributions. Thus standard inference/learning techniques may not suitable since most existing techniques only work for discrete/Gaussian graphical models. There have been many approaches for solving this problem.(1) Gaussian copulas (i.e. Nonparanormal)
(2) Kernel Density Estimation for Forest Graphical Models
(3) Hilbert Space Embedding of Distributions (i.e. Kernel Graphical Models)
The goal of this project is to find a suitable application where the data exhibit both rich structure as well as significant non-Gaussianity and to apply one or more of these methods to perform an interesting task and possibly compare the approaches.
Suggested Reading:
A. Smola, A. Gretton, L. Song, and B. Scholkopf, A Hilbert Space Embedding of Distributions , 2007.
L. Song, A.P. Parikh, E.P. Xing, Kernel Embeddings of Latent Tree Graphical Models, NIPS 2011.
H. Liu. J. Lafferty, L. Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. JMLR 2009.
H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest Density Estimation Journal of Machine Learning Research (JMLR) Volume 12. 907-951. 2011.
Stochastic blockmodels for message frequencies (Sinead)
Stochastic blockmodels are a class of models for network data that assume each node belongs to a latent cluster. Each pair of clusters is associated with an edge probability, and the link between two nodes is sampled according to the edge probability associated with their groups. Extensions include allowing membership of multiple groups (mixed membership stochastic blockmodels), allowing infinitely many groups, and introducing temporal variation.
Often, rather than observing binary links between nodes, we observe edges indirectly in the form of events -- emails, messages, packets of data, etc. The Poisson process is a standard prior for modeling event times. We propose adapting the stochastic block model framework so that each latent cluster, rather than being associated with a link probability, is associated with the rate of a Poisson process. Extensions could include time-varying or mixed membership variations.
Modeling music using the sequence memoizer (Sinead)
N-gram models allow us to predict future terms in sequences, based on patterns we have seen before. The sequence memoizer is a hierarchical model that provides a nonparametric version of n-gram models, and has been successfully employed in modelling and compressing text data.
In this project, we propose using the sequence memoizer to model music sequences, either for prediction or for music compression. You will obtain and preprocess appropriate music datasets, implement the sequence memoizer, and create a demo for predicting held-out segments of music.
Infering haplotypes using Dependent Dirichlet processes (Sinead)
An individual's genotypes consist of two haplotypes, that can be thought of as being sampled from some mixture model. Since the total number of haplotypes in a population is unknown, the Dirichlet process has been used to model genotypes using an infinite mixture model. The hierarchical Dirichlet process extends such a model to allow different, but related, mixture models for different populations.
Such models do not account for the fact that we might have information about the similarity of populations -- for example, based on geographical proximity. The dependent Dirichlet process is a class of models that extend the Dirichlet process to model multiple distributions associated with times and locations, such that distributions that are close tend to be more similar. We propose using such a model to infer haplotypes in related populations.
This project will consist of reviewing a number of existing dependent Dirichlet process models, selecting a model appropriate for the task, implementing the chosen model and evaluating against existing models.
Supervised LDA (or MMSB) (Junming)
There exist two versions of supervised LDA models that tried to find features (topic vectors) that can discriminate different classes by incorporating response information. The first approach was based on GLM by using the empirical topic proportions as covariates; the second one employed the class-dependent transformation parameters.
The goal of the project is to have an empirical comparison of these two approaches in terms of prediction power of learned features. Another related project is to adapt these methods in the settings of MMSB.
Suggested Reading:
David M. Blei, Jon D. McAuliffe, Supervised Topic Models
Simon Lacoste-Julien, Fei Sha, Michael I. Jordan, DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Hybrid inference for MMSB (Junming)
Hybrid variational/gibbs-sampling based methods were proposed to perform inference in topic models such as LDA, showing the advantage of combining the merits of both approaches. In this project we propose to extend this idea to another popular topic models for network interaction, MMSB.
Suggested Reading:
Max Welling, Hybrid Variational/Gibbs Collapsed Inference in Topic Models
E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels
Tightness of LP relaxation (Junming)
There is a restricted class of discrete graphical model called 'attractive graphical model', which is widely used in computer vision. For such family of models, a polynomial-time algorithm exists for finding the exact MAP estimation. A loopy version of MAX-Product algorithm can also be applied, but whether it will find the exact answer is an open problem. Put a different way, whether the LP relaxation of such models is tight is unknown.
This project consists of implementing the MAX-product algorithm and investigate its theoretical property in such restricted class of graphical models.
Suggested Reading:
M. J. Wainwright, T. S. Jaakkola and A. S. Willsky (2005). MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IEEE Transactions on Information Theory
Hiroshi Ishikawa, Exact optimization for Markov random fields with convex priors
[validate xhtml]
