10708 Probabilistic Graphical Models

Course Project

Your class project is an opportunity for you to explore an interesting problem in the context of a real-world data set. Projects should be done in teams of three students. Each project will be assigned a 708 instructor as a project consultant/mentor; instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 40% of your final class grade, and will have 4 deliverables:

Proposal : 1 page (10%).
Due : Feb 14
Midway Report : 7-8 pages (20%).
Due : March 18
Final Report : At least 12 pages (40%).
Due : May 8
Presentation : (30%)
NSH 1305 in May 6, 9AM - 1PM

All write-ups should use the NIPS style.

Project Proposal

You must turn in a brief project proposal (1-page maximum). Read the list of potential project ideas below. We strongly suggest using one of these ideas, though you may discuss other project ideas with us, whether applied or theoretical. Note that even though you can use data sets you have used before, you cannot use work that you started prior to this class as your project.

Project proposal format: Proposals should be one page maximum. Include the following information:

Project title
Data set (this can include crawling websites for real-world data)
Project idea. This should be approximately two paragraphs.
Software you will need to write.
Papers to read. Include at least 3 relevant papers, which you will want to read before submitting your proposal
Teammates: Who are your teammates? Your team should consist of three students.
Midterm milestone: What will you complete by the midterm? Experimental results of some kind are expected here. You should also describe what portion of the project each teammate will be doing.

Midway Report

This should be a 7-8 page report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.

Grading scheme for the project report:

70% for proposed method (should be almost finished)
25% for the design of upcoming experiments
5% for plan of activities (in an appendix, please show the old one and the revised one, along with the activities of each group member)

Final Report

Your final report is expected to be at least 12 pages. You should submit both an electronic and a hardcopy version for your final report. It should have roughly the following format:

Introduction - Motivation
Problem definition
Proposed method

Intuition - why should it be better than the state of the art?
Description of its algorithms

Experiments

Description of your testbed; list of questions your experiments are designed to answer
Details of the experiments; observations

Conclusions

Presentation

All project teams are to present their work at the end of the semester. Each team will be given a timeslot to present their work to the whole class (using Powerpoint or equivalent software). Live demonstrations of your software are highly encouraged.

Final project presentation: NSH 1305 in May 6, 9AM - 1PM
Project presentation will be for 4 hours in length, which consists of two sessions with 48 minutes break in the middle.
We will also have poster presentation during the break. So, please prepare your poster presentation as well.
Each team will have 9 minutes presentation and 2 minute Q&A.
All members should present equally long time (i.e., about 3 minutes per person for 3 people team).
Each member in the team will be evaluated separately. (In other words, members in the same team can get different score for final presentation).
Students should fully attend project presentation from the beginning to the end.

Project Suggestions:

If you are interested in a particular project, we highly recommend that you contact one of instructors to get further ideas or details.

Contact details:

Gunhee: gunhee AT cs DOT cmu DOT edu
Seunghak: seunghak AT cs DOT cmu DOT edu
Kriti: kpuniyan AT cs DOT cmu DOT edu
Chong: chongw AT cs DOT cmu DOT edu
Qirong: qho+ AT cs DOT cmu DOT edu
Ankur: apparikh AT cs DOT cmu DOT edu
Sinead: sinead AT cs DOT cmu DOT edu
Junming: junmingy AT cs DOT cmu DOT edu

1) Personalize topic modelling

Topic modelling is a useful way of organizing textual documents. Imagine the following scenario. We have a large set of documents and each user only owns a portion of the entire dataset. And users usually have overlaps. How can we build a model that tailors individual users but also share the statistical strength among all other users? This can potentially provide better individual user experience.

Key points: model design, efficient inference and evaluation.

potential data set: http://www.cs.cmu.edu/~chongw/citeulike/

2) A thorough empirical study of approximation inference algorithms on one or two popular models.

Take Bayesian logistic regression for example, systematically study the advantages and disadvantages of different (approximation) algorithms, including
* MAP estimation or Laplace approximation
* Gibbs sampling
* Variational inference using Jaakkola &Jordan bound
* Multi-variant delta method

A good report on this will be a nice reference for many people working in approximate inference.

Suggested reading:

[1] Jaakkola, T., and M. Jordan. "A variational approach to Bayesian logistic regression models and their extensions." Sixth International Workshop on Artificial Intelligence and Statistics. 1997.
[2] Chong Wang and David M. Blei. Variational inference in nonconjugate models. 2012
[3] M. Braun and J. McAuliffe. Variational inference for large-scale models of discrete choice. Journal of American Statistical Association, 105(489), 2007

3) Detection of disease-related genetic variants

In biology, it is known that some genetic variants in human genomes (different individuals have different genome information) are responsible for disease susceptibilities. For example, some genetic variants may lead to cancer with a certain probability, and thus individuals having the variants are more likely to get cancer. In this project, we want to find such genetic variants which are associated with a certain disease. As a starting point, you are encouraged to implement a baseline algorithm such as logistic regression method with L1 regularization to detect disease-related genetic variants. Then, your task is to improve the baseline method (in this case, L1 regularized logistic regression) using a more sophisticated model which considers structures of data or biological information (e.g. pathway database). It is recommended to reduce the scope of the project by choosing a specific dataset [4] and a specific method [2,3].

Suggested reading:

[1] Wu et al. "Genome-wide association analysis by lasso penalized logistic regression." Bioinformatics 25.6 (2009): 714-721.
[2] Jenatton et al. "Proximal methods for sparse hierarchical dictionary learning." Proceedings of the International Conference on Machine Learning (ICML). 2010.
[3] Meier et al. "The group lasso for logistic regression." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.1 (2008): 53-71.
[4] Breast cancer dataset: http://sage.fhcrc.org/downloads/downloads.php

4) Detection of gene expression-related genetic variants

This project is related to the first project (please see references in the previous project). Main difference is that this project aims to detect genetic variants which are related to gene expressions (continuous values), rather than disease status (discrete values; disease/healthy status is denoted by 0/1). As a baseline method, you are encouraged to implement a simple method such as linear regression method with L1 regularization. Then your task is to improve the performance of the baseline method (in this case, L1 regularized linear regression) by taking advantage of genome structures (e.g. linkage disequilibrium) or genome annotations for each genetic variant (e.g. gene locations). It is recommended to reduce the scope of this project by choosing a specific dataset [3] and a specific method [1,2].

Suggested reading:

[1] Grave, Edouard, Guillaume Obozinski, and Francis Bach. "Trace Lasso: a trace norm regularization for correlated designs." arXiv preprint arXiv:1109.1990(2011).
[2] Rank regression, http://www.math.wustl.edu/~sawyer/handouts/RankRegress.pdf
[3] Yeast eQTL dataset (ask for TAs)

5) Identifying genetic interactions in genome data

One of open research problems in computational biology is to identify interactions among genes. A popular definition of interactions between gene A and gene B is as follows: If the overall effect of A and B deviates from the sum of their individual effects, there exists interaction between A and B. As a starting point, you can try a popular approach such as graphical lasso [1] for interaction detection on yeast data [3]. To improve the performance of graphical lasso, you may want to add additional constraints on your model based on biological information

Suggested reading:

[1] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "Sparse inverse covariance estimation with the graphical lasso." Biostatistics 9.3 (2008): 432-441.
[2] Wu, Xintao, Yong Ye, and Liying Zhang. "Graphical modeling based gene interaction analysis for microarray data." ACM SIGKDD Explorations Newsletter 5.2 (2003): 91-100.
[3] Yeast eQTL data (ask for TAs)

6) Parallel algorithms for network inference

Due to large-scale datasets such as human genome data, it is desirable to infer networks in parallel. For those of you who are interested in systems, it is recommended to focus on a network inference method (e.g. graphical lasso [1]), and make its algorithm fast and/or parallel.

Suggested reading:

[1] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "Sparse inverse covariance estimation with the graphical lasso." Biostatistics 9.3 (2008): 432-441.
[2] Witten, Daniela M., Jerome H. Friedman, and Noah Simon. "New insights and faster computations for the graphical lasso." Journal of Computational and Graphical Statistics 20.4 (2011): 892-900.
[3] Bradley, Joseph K., et al. "Parallel coordinate descent for l1-regularized loss minimization." ICML 2011.
[4] Yeast eQTL data (ask for TAs)

7) Human pose/action recognition in natural images

Suppose that we are given an image from Facebook, Flickr, or Tweeter. Can we classify whether the image contains a human? If yes, can we localize the humans in the image, detect the pose of human body parts, and eventually infer what activity the human make? In this project, you are encouraged to implement your own human pose/action recognition. Since this area of research is so vast, we do not recommend you to try a general problem. More specified is better. (e.g. body upper-part detection. Face detection + human segmentation) Here are some references as a starting point.

[1] CVPR 2011 Tutorial on Human Activity Recognition (http://cvrc.ece.utexas.edu/mryoo/cvpr2011tutorial/)
[2] Human Activity Recognition Summer course (http://www.cs.sfu.ca/~mori/courses/cmpt888/summer10/)
[2] Stanford Vision lab (http://vision.stanford.edu/discrim_rf/) (http://ai.stanford.edu/~bangpeng/ppmi.html)
[3] Poselet (http://www.cs.berkeley.edu/~lbourdev/poselets/)
[4] 2D articulated human pose estimation (http://groups.inf.ed.ac.uk/calvin/articulated_human_pose_estimation_code/)

8) Detecting Regions of Interest

The goal of this project is to quickly find out the regions of interest that are likely to contain a single coherent topic or object. The details of problem statement can be found [1]. The Matlab codes of cosegmentation in [2-3] may be also helpful for this project. In addition, the following two lines of work are closely related to this topic.

(1) Objetness detection. This task quantifies multiple rectangular regions in an image, which are likely to contain an object of any class. You can first try two popular objetness detectors in [4-5].

(2) Saliency detection: The goal of saliency detection is to detect the regions that represent the scene the most. You can start your projects from some Matlab implementation available on the Web such as [6-7].

Likewise, let the project to be reasonable scoped.

[1] G. Kim and A. Torralba. Unsupervised Detection of Regions of Interest using Iterative Link Analysis. NIPS 2009. (Project homepage: http://www.cs.cmu.edu/~gunhee/r_roi.html)
[2] G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade. Distributed Cosegmentation via Submodular Optimization on Anisotropic Diffusion. ICCV 2001. (Project homepage: http://www.cs.cmu.edu/~gunhee/r_seg_submod.html)
[3] G. Kim and E. P. Xing. On Multiple Foreground Cosegmentation. CVPR 2012. (Project homepage: http://www.cs.cmu.edu/~gunhee/r_mfc.html)
[4] Objectness measure V1.5 (http://groups.inf.ed.ac.uk/calvin/objectness/).
[5] Category Independent Object Proposals (http://vision.cs.uiuc.edu/proposals/).
[6] Saliency Mat Algorithm (http://www.klab.caltech.edu/~harel/share/gbvs.php)
[7] Context-Aware Saliency Detection (http://webee.technion.ac.il/labs/cgm/Computer-Graphics-Multimedia/Software/Saliency/Saliency.html)

9) Image Labels using Conditional Random Fields

In this project, you can implement your own algorithm for classification and labeling of regions given an image. Arguably, the two most popular such examples may be Discriminative Random Fields [1] and TextonBoost [2]. As a project, you may implement one of such algorithms, and test with real-world images. Or, you can propose a novel image-labeling algorithm. You may start the project with the code of [3].

[1] S.Kumar and M.Hebert. Discriminative Random Fields. by IJCV 2006.
[2] J. Shotton C. Rother, and A. Criminisi. TextonBoost, by. ECCV 2006. (http://jamie.shotton.org/work/code.html).
[3] Justin's GM/CRF Toolbox (http://phd.gccis.rit.edu/justindomke/JGMT/).

10) Generalized linear model with regularization

Regularization on the regression coefficients is a common technique for robust modeling in generalized linear models [1]. In this project, your goal is to implement and test various types of regularizations (e.g. L1, L1/L2, elastic Net) for generalized linear models such as Poisson regression and the Cox model. (You may focus on one type of regressions so that the project is achievable in one semester). For example, if you are interested in Poisson regression, you can start from the implementation of [2-3]. As another interesting variation, you can introduce reduced rank regression idea to your model [4].

[1] glmnet: Lasso and elastic-net regularized generalized linear models (http://cran.r-project.org/web/packages/glmnet/index.html).
[2] Generalized Linear Model (GLM) for spike trains (http://pillowlab.cps.utexas.edu/code_GLM.html).
[3] Sparse GLM (http://www4.stat.ncsu.edu/~hzhou3/softwares/sparsereg/html/demo_glm.html).
[4] L. Chen, J. Z. Huang. Sparse Reduced-Rank Regression for Simultaneous Dimension Reduction and Variable Selection. JASA 2012.

11) Network Inference

Networks are ubiquitous such as computer networks, social networks, World Wide Web, and gene regulation networks. In many cases, the underlying network structures are hidden, but only state changes of nodes are observable. (e.g., viral marketers can track when customers buy products, but typically cannot observe who influenced customers' decisions). We here introduce two recent threads of work in this area. We encourage you to extend their framework with novel applications in your research context.

(1) Network inference of diffusion and influence. The papers, codes, and data are available at [1].

(2) The inference of time-varying networks such as the inference for directed networks [2], undirected networks [3], and networks with jumps[4].

[1] Network Inference (http://www.stanford.edu/~manuelgr/software.html).
[2] L. Song, M. Kolar, E. P. Xing. Time-Varying Dynamic Bayesian Networks. Advances in Neural Information Processing Systems 23, 2009.
[3] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating time-varying networks. Annals of Applied Statistics, 2010. AOAS
[4] M. Kolar, E.P. Xing. Estimating Networks with Jumps. Electronic Journal of Statistics, 2012.

12) Influence in social media

In social media, the spread of themes and ideas along network links is called a cascade. These links can be explicit (as in the case of Facebook friends), or implicit (Twitter retweets/followers and up/down-voting of posts in forums). Non-statistical algorithms like NetInf (http://snap.stanford.edu/netinf/) have been developed to study cascades, but there have been few graphical model approaches thus far.

Your task is to (1) obtain a network by crawling a website of your choice, and (2) develop a graphical model to predict when a cascade will occur on this network, given knowledge of events such as "X people have retweeted Y", or "X people have up-voted post Y". As for the exact definition of a cascade, you may refer to the NetInf paper, or come up with your own. Also, you will need to decide how you will go about validating your model. We suggest either collecting real cascade data from your chosen website, or developing a simulator to generate artificial cascades.

13) Intervention in social media

This project is related to the previous one; we suggest reading that first.

When a social media cascade occurs, we might want to help or hinder its progress. For example, if we were in charge of a viral marketing campaign, we might want to directly contact celebrities in the hope that they spread our product. Conversely, if we were trying to prevent malicious apps from spreading on a social network, we might want to contain the problem by disabling particular accounts from the network. While this problem has been studied by epidemiologists in the context of disease spread, few of the methods in the literature take advantage of graphical models.

Your task is to (1) obtain a network by crawling a site of your choice, and (2) develop a graphical model to predict what nodes should be seeded/removed to help/hinder an ongoing cascade. As with the previous project, we leave the definition of a cascade up to you (though you should definitely read the NetInf paper and related literature to get some ideas). In order to validate your approach, you will need to generate cascades on your network (thus you will need to write a simulator of some kind).

14) Multifaceted Visualization of social media

Many algorithms developed to study social media are limited, in that they only consider the network or textual aspects of the medium. Yet, most social media are not limited to these information types or modalities: as an example, Facebook has pictures, videos, and various kinds of structural data such as tags and profile information. No summary or visualization of a social medium is really complete unless all available information has been considered. The challenge, then, is developing methods that can make use of diverse types of information.

Your task is to (1) obtain as much user data as you can by crawling some website of your choosing, and (2) develop a graphical model that "summarizes" the information you have collected, in the spirit of latent space models like Nallapati et al. (2008). Your model's output should (1) reveal interesting aspects of the social medium, and (2) must involve nontrivial probabilistic inference. Also, writing software to visualize your output will be a definite plus for this project.

Note that your data and graphical model must involve at least 2 modalities (e.g. text and network, or text and pictures, etc.), though we would really like to see 3 or more. In particular, your graphical model should be designed in a way that makes it easy to extend to new modalities.

Suggested Reading:
Ramesh Nallapati, Amr Ahmed, Eric Xing, William Cohen. Joint Topic Models for Text and Citations. KDD 2008.

15) Applications of Nonparametric Graphical Models

In many real world scenarios, variables are related by rich dependencies such as bioinformatics, computer vision, and fMRI analysis. However, in many cases these relationships may be nonlinear and not easily expressed with Gaussian distributions. Thus standard inference/learning techniques may not suitable since most existing techniques only work for discrete/Gaussian graphical models. There have been many approaches for solving this problem.
(1) Gaussian copulas (i.e. Nonparanormal)
(2) Kernel Density Estimation for Forest Graphical Models
(3) Hilbert Space Embedding of Distributions (i.e. Kernel Graphical Models)
The goal of this project is to find a suitable application where the data exhibit both rich structure as well as significant non-Gaussianity and to apply one or more of these methods to perform an interesting task and possibly compare the approaches.

Suggested Reading:
A. Smola, A. Gretton, L. Song, and B. Scholkopf, A Hilbert Space Embedding of Distributions , 2007.
L. Song, A.P. Parikh, E.P. Xing, Kernel Embeddings of Latent Tree Graphical Models, NIPS 2011.
H. Liu. J. Lafferty, L. Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. JMLR 2009.
H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest Density Estimation Journal of Machine Learning Research (JMLR) Volume 12. 907-951. 2011.

16) Stochastic blockmodels for message frequencies

Stochastic blockmodels are a class of models for network data that assume each node belongs to a latent cluster. Each pair of clusters is associated with an edge probability, and the link between two nodes is sampled according to the edge probability associated with their groups. Extensions include allowing membership of multiple groups (mixed membership stochastic blockmodels), allowing infinitely many groups, and introducing temporal variation.

Often, rather than observing binary links between nodes, we observe edges indirectly in the form of events -- emails, messages, packets of data, etc. The Poisson process is a standard prior for modeling event times. We propose adapting the stochastic block model framework so that each latent cluster, rather than being associated with a link probability, is associated with the rate of a Poisson process. Extensions could include time-varying or mixed membership variations.

17) Modeling music using the sequence memoizer

N-gram models allow us to predict future terms in sequences, based on patterns we have seen before. The sequence memoizer is a hierarchical model that provides a nonparametric version of n-gram models, and has been successfully employed in modelling and compressing text data.

In this project, we propose using the sequence memoizer to model music sequences, either for prediction or for music compression. You will obtain and preprocess appropriate music datasets, implement the sequence memoizer, and create a demo for predicting held-out segments of music.

18) Infering haplotypes using Dependent Dirichlet processes

An individual's genotypes consist of two haplotypes, that can be thought of as being sampled from some mixture model. Since the total number of haplotypes in a population is unknown, the Dirichlet process has been used to model genotypes using an infinite mixture model. The hierarchical Dirichlet process extends such a model to allow different, but related, mixture models for different populations.

Such models do not account for the fact that we might have information about the similarity of populations -- for example, based on geographical proximity. The dependent Dirichlet process is a class of models that extend the Dirichlet process to model multiple distributions associated with times and locations, such that distributions that are close tend to be more similar. We propose using such a model to infer haplotypes in related populations.

This project will consist of reviewing a number of existing dependent Dirichlet process models, selecting a model appropriate for the task, implementing the chosen model and evaluating against existing models.

19) Supervised LDA (or MMSB)

There exist two versions of supervised LDA models that tried to find features (topic vectors) that can discriminate different classes by incorporating response information. The first approach was based on GLM by using the empirical topic proportions as covariates; the second one employed the class-dependent transformation parameters.

The goal of the project is to have an empirical comparison of these two approaches in terms of prediction power of learned features. Another related project is to adapt these methods in the settings of MMSB.

Suggested Reading:
David M. Blei, Jon D. McAuliffe, Supervised Topic Models
Simon Lacoste-Julien, Fei Sha, Michael I. Jordan, DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

20) Hybrid inference for MMSB

Hybrid variational/gibbs-sampling based methods were proposed to perform inference in topic models such as LDA, showing the advantage of combining the merits of both approaches. In this project we propose to extend this idea to another popular topic models for network interaction, MMSB.

Suggested Reading:
Max Welling, Hybrid Variational/Gibbs Collapsed Inference in Topic Models
E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels

21) Tightness of LP relaxation

There is a restricted class of discrete graphical model called 'attractive graphical model', which is widely used in computer vision. For such family of models, a polynomial-time algorithm exists for finding the exact MAP estimation. A loopy version of MAX-Product algorithm can also be applied, but whether it will find the exact answer is an open problem. Put a different way, whether the LP relaxation of such models is tight is unknown.

This project consists of implementing the MAX-product algorithm and investigate its theoretical property in such restricted class of graphical models.

Suggested Reading:
M. J. Wainwright, T. S. Jaakkola and A. S. Willsky (2005). MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IEEE Transactions on Information Theory
Hiroshi Ishikawa, Exact optimization for Markov random fields with convex priors

22) Nonparametric sparse models

Nonparametric sparse models generalize the parametric sparse models (such as lasso) in that it doesn't assume any parametric form of underlying statistical models, such as linear dependence of responses on the features. The idea of this project is to develop a parallelized version of structured sparse additive models.

Suggested Reading:
J. Yin, X. Chen and E. P. Xing. Group sparse additive models.
Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models.
Markus Heglanda, Ian McIntosh, Berwin A. Turlach, A parallel solver for generalized additive models

23) Scalable Deep Learning Algorithms

Deep belief nets are probabilistic generative models that are composed of multiple levels of non-linear operations. This new area of machine learning research has recently shown great successes in computer vision and speech recognition applications. The goal of this project may be to explore scalable deep learning algorithms. It is a great project candidate to implement the algorithms in the following suggested reading and test with different datasets. More challengingly, one may propose an improved algorithm.

Suggested Reading:
H. Lee, R. Grosse, R. Ranganath, A.Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML 2009.
Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. On optimization methods for deep learning. ICML 2012.