Introduction to Machine Learning

10-701, Fall 2016
School of Computer Science
Carnegie Mellon University

Problem Sets

There will be 5 problem sets during the semester, in addition to a final project. Problem sets will consist of both theoretical and programming problems.


Overview: The class project is an opportunity for you to explore an interesting problem of your choice in the context of a real-world data set. You can either choose one of the suggested projects we provided, or pick your own topic. Do not hesitate to discuss your project with TAs/instructors to get feedback on your ideas.

Team: Projects can be done by a team of three. Feel free to post on Piazza if you need teammates.

Milestones: There are 4 delieverables in total:

  • Proposal: A short description of the project, including: a title, Andrew ids of the members, and a short description of your proposed project (about a page). Due on Oct. 3, 2016.
  • Midway report: More detailed introduction, review of related works, details of the proposed method, and preliminary results if available, in 4-5 pages. Due on Nov. 7, 2016.
  • Poster session: Present your work to the peers, instructors, and other community members who will stop by. Scheduled on 2nd December 2016. (SCS poster printing)
  • Final report: A full academic paper, including: problem definition and motivation, background and related work, details of the proposed method, details of experiments and results, conclusion and future work. 8 pages excluding references and appendix. Due on 9th December 2016.

Poster Sessions: There will be 2 poster sessions:

  • Session 1: Friday, December 2nd, GHC 7th Floor Atrium, 8 - 11:30 AM
  • Session 2: Friday, December 2nd, NSH 3305, 2 - 6 PM

All reports should be in NIPS format.

Project Ideas

Predicting Excitement at [KDD Cup 2014 Task] is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school. The 2014 KDD Cup asks participants to help identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. Successful predictions may require a broad range of analytical skills, from natural language processing on the need statements to data mining and classical supervised learning on the descriptive factors around each project.

Contact Person: Hemank





Predicting Crime in Pittsburgh

Crime forecasting has been an active area of research topic from machine learning perspective. The problem, besides being very relevant is also one of the most challenging problems. The crime prediction problem requires prediction in spatio-temporal space (~where and when?). Criminological research has shown that crime can spread through local environments via a contagion-like process. For example, burglars will repeatedly attack clusters of nearby targets because local vulnerabilities are well known to the offenders. A gang shooting may incite waves of retaliatory violence in the local set space (territory) of the rival gang. The local, contagious spread of crime leads to the formation of crime clusters in space and time. It has also been shown that the crime follows a periodic pattern, for instance in Chicago, the crime increases by a huge amount during the summer, and drops in the winter. Many interesting methods have been applied for trying to solve this problem. The goal of this project is to build on the extensive literature, and try to apply a novel method for the crime problem in Pittsburgh. Novel methods could either be in the direction of applying gaussian processes or even modelling crime as a Hawkes Process.


Contact Person: Hemank


[1] Pittsburgh Crime Dataset -

[2] Flaxman, Seth R. A general approach to prediction and forecasting crime rates with gaussian processes. In CMU Data Analysis Paper, 2014.

[3] G. O. Mohler, M. B. Short, P. J. Brantingham F. P. Schoen- berg and Tita, G. E. Self-exciting point process modeling of crime. In Jounal of the American Statistical Associa- tion, Volume 106, No 493, 2011.

[4] G. O. Mohler, M. B. Short, Sean Malinowski Mark John- son G. E. Tita Andrea L. Bertozzi P. J. Brantingham. Randomized controlled field trials of predictive policing, 2015.

[5] Gorr, W. and Haries, R. Introduction to crime forecasting. In International Journal of Forecasting, 2003.

[6] J. Cohen, W. Gorr and Olligschlaegar, Anreas M. Leading indicators and spatial interactions: A crime-forecasting model for proactive police deployment. In Geographical Analysis Volume 39 Issue 1, 2007.

[7] L. Anselin, J. Cohen, D. Cook W. Gorr and Tita, G. Spatial analyses of crime. In Measurement and Analysis of Crime and Justice, 2000.

[8] S. Aldor-Noiman, L.D. Brown, E.B. Fox and Stine, R.A. Spatio-temporal low count processes with application to violent crime events. In arXiv, Statistics- Applications, 1304.5642, 2013.

Taddy, M. A. Autoregressive mixture models for dynamic spatial poisson processes: Application to tracking intensity of violent crime. In Journal of the American Statistical Association 105 (492), 2010.

Computational Education


Project1: Multi-view Information Extraction from Textbooks This project is about targeted information extraction from textbooks - given a set of textbooks, we may want to extract structured knowledge such as all math theorems and axioms in the textbook. The theorems are often accompanied by images that help them explain it. You will have to use the context, typographical information, etc. that can help you extract such information. The knowledge extracted can then be used for downstream application like summarizing the textbook, answering questions, etc.


Project2: Recognizing difficult to comprehend portions of textbooks and fixing them This project further is about building a model for how hard it is for students to understand portions of textbooks. This might depend on a lot of factors - your job is to identify these factors, annotate such a textbook for comprehension difficulty (you may crowd-source this task) and then build a model. You can extend this project by mining the web for various images that can help the students understand the text better.


Contact Person: Devendra Chaplot


[1] R. Agrawal, S. Chakraborty, S. Gollapudi, A. Kannan, K. Kenthapadi: Empowering Authors to Diagnose Comprehension Burden in Textbooks. KDD 2012.

[2] R. Agrawal, S. Gollapudi, A. Kannan, K. Kenthapadi: Enriching Textbooks with Images. CIKM 2011.

[3] R. Agrawal, S. Gollapudi, A. Kannan, K. Kenthapadi: Identifying Enrichment Candidates in Textbooks. WWW 2011.


Approximate Gaussian Process (GP) regression


GPs are flexible non-parametric models that have applications in robotics, reinforcement learning and optimization. However, the application of GPs is limited by its cubic complexity in both learning (optimization) and inference phases. For scaling GPs, inducing point methods have been proposed [1]. These sparse approximation methods scale GPs by using a subset of the original dataset (either implicitly or explicitly). An alternate approach to scaling GPs was proposed recently by Deisenroth et al [2]. They have proposed a distributed GP (DGP) framework that utilizes current distributed and parallel computer systems, by appropriately approximating the marginal likelihood function (for the learning phase), and suitable weighting in the inference phase. The main idea is to divide the dataset into several chunks, and use an exact GP separately for each chunk. In this project, the first task is to combine sparse GP approximation with the distributed GP approach, by replacing each individual exact GP with sparse GPs. The performance of this should method should be evaluated empirically on few standard datasets, and compared with the original DGP approach. Subsequently, the next task is to replace the sparse approximation on each chunk with structured kernel interpolation (SKI, a recent approach by Wilson et al[3]) which reduces the time complexity of GPs from cubic to linear. Note that is is more of a systems and application oriented project, interested students can extend it to a GPU based implementation.


Contact Person: Siddharth Goyal


[1] Quiñonero-Candela, Joaquin, and Carl Edward Rasmussen. "A unifying view of sparse approximate Gaussian process regression." Journal of Machine Learning Research 6, no. Dec (2005).

[2] Deisenroth, Marc Peter, and Jun Wei Ng. "Distributed gaussian processes." In International Conference on Machine Learning (ICML), vol. 2, 2015.

[3] Wilson, Andrew Gordon, and Hannes Nickisch. "Kernel interpolation for scalable structured Gaussian processes (KISS-GP)." arXiv preprint arXiv:1503.01057 (2015).

Bayesian optimization using a recent active learning approach

Bayesian optimization (BO) deals with the task of finding the maximum of an expensive black box function [1]. The unknown function is usually modeled as a GP. Subspace identification Bayesian optimization (SI-BO) is a recent approach that employs low-rank matrix recovery techniques for BO on a reasonable number of input dimensions [2]. In a recent work, Garnett et al [3] have proposed an active learning based method for finding low-dimensional structure in a high dimensional setting where a GP is employed. The lower dimensional subspace is treated as a set of hyperparameters and is obtained by active learning. In this project, the main task is to replace the subspace learning method of SI-BO with the active learning method in [3]. The implementation should be followed by empirical evaluation on standard BO benchmarks.


Contact Person: Siddharth Goyal


[1] Brochu, Eric, Vlad M. Cora, and Nando De Freitas. "A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning." arXiv preprint arXiv:1012.2599 (2010).

[2] Djolonga, Josip, Andreas Krause, and Volkan Cevher. "High-dimensional gaussian process bandits." In Advances in Neural Information Processing Systems, pp. 1025-1033. 2013.

[3] Garnett, Roman, Michael A. Osborne, and Philipp Hennig. "Active learning of linear embeddings for Gaussian processes." arXiv preprint arXiv:1310.6740 (2013).


Breast Cancer Recurrence Prediction

Background: According to the World Health Organization, breast cancer is one of the leading causes of death in developed countries. There is a large amount of data that shows how a patient was monitored, diagnosed, and treated for this disease.  


Baseline Task: Use supervised and semi-supervised classification to determine the likelihood of breast cancer recurrence. Compare the accuracy of different supervised and semi-supervised methods and determine if training on unlabeled data could improve classification accuracy of breast cancer recurrence.

Dataset: Breast Cancer Wisconsin Data Sets (Prognostic and Diagnostic)

Further Work:

Suggestion 1: Determine if time to recurrence can be predicted in patients who indicate recurrence.

Suggestion 2: Student choice

Contact Person: Brynn Edmunds


[1] Diana Dumitru. "Prediction of recurrent events in breast cancer using the Naive Bayesian classification" Annals of University of Craiova (2009), Vol 36(2).

Climate analysis

In this work, you will be working on a climate dataset available at: This dataset includes time series of climate data (precipitation, temperature, dew point, etc) for various locations in US. Here are some possible topics you could try on this dataset: 1) Weather prediction: Apply several machine learning algorithms to learn the patterns in the dataset. Use the learned pattern to predict the future weather, 2) Anomaly detection: Is there any anomalous events in the past data? Can you detect and explain possible causes of anomalies? 3) Regional correlation: Explore the dataset to see if there is any correlation between the regions. Perform multi-resolution analysis.

Contact Person: Hyun-Ah Song

[1] Widmann, Martin, and Christopher S. Bretherton. "Validation of mesoscale precipitation in the NCEP reanalysis using a new gridcell dataset for the northwestern United States." Journal of Climate 13.11 (2000): 1936-1950.


Physiological data analysis

In this work, you will explore a physiological dataset available at: This dataset is a physiological dataset from MIT Affective Computing Group. It contains physiological data from four sensors - electromyogram (EMG), blood volume pressure, skin conductance, and respiration measured at eight states of emotions for single subject. Here are some possible topics you could try for the project: 1) Feature construction: can you come up with some creative/effective features for better classification of the emotions? 2) Feature construction: can you come up with some creative/effective features for better classification of the emotions? 3) Classification: can you apply well-known classifiers or design a classifier of your own that can classify better the method proposed in the reference paper? (81% accuracy)

Contact Person: Hyun-Ah Song

[1] Healey, Jennifer. "Wearable and automotive systems for the recognition of affect from physiology." Unpublished doctoral dissertation, Massachusetts Institute of Technology (2000).

[2] Picard, Rosalind W., Elias Vyzas, and Jennifer Healey. "Toward machine emotional intelligence: Analysis of affective physiological state." IEEE transactions on pattern analysis and machine intelligence 23.10 (2001): 1175-1191.

[3] Vyzas, Elias. Recognition of emotional and cognitive states using physiological data. Diss. Massachusetts Institute of Technology, 1999.


Multi-task and transfer learning (ML with small datasets) :

In a lot of real-world situations there are datasets that are too small for traditional ML techniques to be applied without overfitting. Transfer learning is a subfield of machine learning that investigates techniques that try to make use of related datasets in order to increase the predictive power. Transfer learning can be roughly divided into three subtypes: 1) multitask learning, 2) domain adaptation, 3) transfer deep learning.


Project1: Multitask Learning: Multitask Learning [1][2]  is supervised transfer learning where a lot of tasks can be taken into consideration, and they are trained jointly in the hope that they can transfer knowledge to each other and achieve better generalization. The covariates in each task are assumed to have the same distribution, the labels have different distributions and all data comes from the same domain. These assumptions are restrictive but they apply to a lot of scenarios and this area is heavily studied both in multi-linear learning [3] and online learning[4]. A lot of research in this subfield revolves around improving optimization techniques and avoiding negative transfer - a situation where sharing information between tasks is counter-productive. Also there have been some theoretical analysis and approaches for it using sparse coding [11]. One possible future direction would be to build on most recent work in either or try to combine the two.


Project2: Domain Adaptation: Domain adaptation [5][6] usually assumes two task: a source task and a target task. In this scenario, the domains, the marginal distributions and the conditional distributions of both the covariates and the labels can change depending on the assumptions, and furthermore, in some cases very few or none labeled data points are available in the target task. This field usually relies on kernel methods to reweight the dataset such that the joint distributions between the source and the target task are matched as closely as possible. There are also some deep learning techniques that have been developed in this area very recently [7]. One could build on the theory of trying to match distributions using kernel methods, or try to develop better deep learning architectures. Both directions are very active currently.

Contact Person: Petar


[1] Evgeniou, Theodoros, and Massimiliano Pontil. "Regularized multi--task learning." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004.

[2] Kumar, Abhishek, and Hal Daume III. "Learning task grouping and overlap in multi-task learning." arXiv preprint arXiv:1206.6417 (2012).

[3] Romera-Paredes, Bernardino, et al. "Multilinear multitask learning." Proceedings of the 30th International Conference on Machine Learning. 2013.

[4] Keerthiram Murugesan* et al, “Adaptive Smoothed Online Multi-Task Learning”. NIPS preprint 2016

[5] Gong, Mingming, et al. "Domain adaptation with conditional transferable components." Proceedings of The 33rd International Conference on Machine Learning. 2016.

[6] Zhang, Kun, et al. "Domain Adaptation under Target and Conditional Shift." ICML (3). 2013.

[7] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation networks." CoRR, abs/1502.02791 1 (2015): 2.


From Fall 2015 Edition

Image Question Answering

This project is about free-form and open-ended Visual Question Answering (VQA) [1,2]. Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. There are two versions of this task - a version of answer selection where candidate answers are given and the task is to pick the correct answer - and - a version where candidate answers are not given and the answer has to be generated by the algorithm. Various previous works have attempted to solve this problem, but, with only limited success. Your job is to do better at this task. One idea is to classify questions into categories and building a model for each category - or better - learning a multi-task model. You can also think of various deep learning methods using CNNs and LSTMs here.


[2] Antol et. al. Visual Question Answering. ArXiv 2015.

[3] Gao et. al. Are You Talking to a Machine?: Dataset and Methods for Multilingual Image Question Answering. NIPS 2015

[4] Ren et. al. Exploring Models and Data for Image Question Answering. NIPS 2015

[5] Ferraro et. al. A Survey of Current Datasets for Vision and Language Research. ArXiv 2015

Event Structure Learning

Scripts have been proposed to model the stereotypical event sequences found in narratives. Scripts encode knowledge of stereotypical events, including information about their typical ordered sequences of sub-events and corresponding arguments (temporal, causal, subevents, etc) [1]. The existence of such structures is based on the assumption that natural language documents are written with a model representation in mind describing specific course of actions of individuals that are performed in real-world scenarios. The goal of this project is to capture the semantics of the event scripts that are encoded in documents (such as a terrorist attack or something like the event structure of chopping an onion).

There is a small body of preliminary research on automatically learning models of scripts from large corpora of raw text [2-7]. However, all these works use an impoverished representation of events. While they learn interesting event structure, these works make many assumptions - e.g. structures are restricted to be chains, structures are limited to frequent topics in a large corpus or redundant documents about specific events are required, sometimes the relations are binary, and often only slots with named entities are learned.

In this work, (a) you could explore supervised (or better semi-supervised or unsupervised) learning approaches for discovering events as well as the temporal relations involving events (and possibly time expressions). Alternatively, you can look as this as a structure learning problem and use techniques similar to those we learned for graphical model structure learning.


[1] R. Schank and R. Abelson, Scripts, Plans, Goals and Understanding: An Inquiry into Human Knowledge Structures. Lawrence Erlbaum and Associates, Hillsdale, NJ, 1977.

[2] N. Balasubramanian, S. Soderland, Mausam, and O. Etzioni. Generating coherent event schemas at scale. EMNLP 2013.

[3] C. A. Bejan. 2008. Unsupervised discovery of event scenarios from texts. FLAIRS 2008.

[4] N. Chambers and Daniel Jurafsky. Unsupervised learning of narrative event chains. ACL 2008.

[5] N. Chambers and D. Jurafsky. Unsupervised learning of narrative schemas and their participants. ACL-IJCNLP 2009.

[6] N. Chambers. 2013. Event schema induction with a probabilistic entity-driven model. EMNLP 2013.

[7] J. Cheung, H. Poon, and L. Vanderwende. Probabilistic frame induction. NAACL 2013.

Neural Networks for Multi-view Learning across Images and Text

The problem of image/scene understanding is an important and challenging one. Often images are accompanied with descriptions that describe them. This is an important in image search. Many multi-view learning approaches have been proposed that extract features for both sentences and images, and map them to the same semantic embedding space. These methods are used to address multiple tasks such as retrieving the sentences given the query image, retrieving the images given the query sentences, generating captions that describe image scenes, etc.

Problem 1: The first proposed problem is to link objects in the images to appropriate mentions in the captions. We reason about which particular object each noun/pronoun in the captions is referring to in the image. This could potentially allow us to jointly model the textual and visual information to disambiguate the coreference resolution problem within and across images and texts. Towards this goal, one could explore deep-learning or structure prediction models that exploit features computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects.

[1] C. Kong et. al. What are you talking about? Text-to-Image Coreference. CVPR 2014.


Problem 2: The first proposed problem is to recognize what appears in images while incorporating knowledge of spatial relationships and interactions between objects and some background knowledge (knowledge of how the world works - e.g. books are placed on a table - usually not under it). Another challenge here is in generating a description that is not only relevant but also grammatically correct, thereby, requiring a model for language. In this project, one could explore integrating recursive deep learning methods for image understanding either with existing language models or other neural networks that learn a language model.


[1] H. Fao et. al. From Captions to Visual Concepts and Back. ArXiv

[2] R. Kiros et. al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. ArXiv.

[3] A. Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. ArXiv


[5] J. Donahue et. al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. ArXiv

[6] J. Mao et. al. Explain Images with Multimodal Recurrent Neural Networks



Unsupervised Methods for Joint Entity and Event Coreference Resolution

Coreference resolution in text is the process of determining when two mentions (named, nominal or pronominal entity mentions, event mentions, etc.) refer to the same identity in the real world. Coreference is a fundamental problem in NLP : it is an important step in achieving a deeper understanding of the text and is potentially useful for many downstream applications such as paraphrase detection, textual entailment, summarization, question answering, etc. Various structured prediction approaches and non-parametric Bayesian approaches have been proposed for entity coreference resolution. However, there is a well known duality between entities and events. We could benefit by building a jointly model entity and event coreference using the fact that coreferentiality among events imply a coreferentiality in their participant entities. The project will involve building a structure prediction that can jointly reason over entity and event coreference structure. There is a large body of work in coreference resolution. But you could look at these example previous works [1-4] to understand the task and literature.

[1] C. Bejan et. al. Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution. NIPS 2009.

[2] A. Haghighi and D. Klein. Unsupervised Coreference Resolution in a Nonparametric Bayesian Model. ACL 2009.

[3] G. Durrett and D. Klein. A Joint Model for Entity Analysis: Coreference, Typing, and Linking. TACL 2014.

[4] H. Lee et. al. Joint Entity and Event Coreference Resolution across Documents. EMNLP 2012.

Bayesian Learning for Neural Networks

Neural networks that are popular nowadays have close relationship with graphical models. Instead of black-box back-propagation, can we use Bayesian methods in neural networks? Can we make them more scalable?

[1] David Mackay's papers:

[2] Radford Neal's thesis:

[3] Nando de Freitas's thesis:

Dropout Training for Graphical Models

Dropout training has been proposed to remedy the overfitting problem in deep neural networks. Some recent works discussed the interpretation of this method as adaptive regularization or augmenting noisy training data. Recalling the close relationship between neural networks and graphical models, can we apply the same technique to graphical models?

[1] Srivastava et. al. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.

[2] van der Maaten et. al. Learning with marginalized corrupted features. ICML, 2013.

[3] Chen et. al. Dropout training for support vector machines. AAAI, 2014.

Fast sampling for mixture models

MCMC algorithms can be made fast by borrowing ideas from traditional computer science. Can you build constant time samplers for mixture models? Can you make the algorithm run in an online fashion?

[1] Luc Devroye. Non-uniform random variate generation. Springer-Verlag, 1986.

[2] Alastair J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, vol. 3 (1977), pp. 253-256.

[3] Peter M Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, vol. 24, no. 3 (1994), pp. 327-336.

Dirichlet Process Distance Metric Learning

Distance Metric Learning (DML) [1] takes data pairs labeled either as similar or dissimilar to learn a Mahalanobis distance matrix M such that under M, similar pairs will be placed close to each other and dissimilar pairs are separated apart. The learned distance metrics are essential for many tasks such as retrieval, clustering and classification. In real word problems, due to the complexity of data which are inherently embedded in an unknown amount of groups, a single Mahalanobis matrix is insufficient to properly measure distances for data from all groups. In this project, we are going to study the problem of infinite distance metric learning, which aims to learn an unbounded number of Mahalanobis distance matrices where each matrix is responsible for measuring the distance of data in one specific group. Using Bayesian nonparametric techniques, the number of distance matrices can be automatically decided from data, rather than set in an ad-hoc way. To achieve this, we are going to place a Dirichlet Process [2] prior over the Mahalanobis distance matrices. The inference and learning technique could be variational inference [3] or MCMC sampling [2].

[1] Xing, E. P., Jordan, M. I., Russell, S., and Ng, A. Y. (2002). Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems (pp. 505-512). [2] Yee Whye Teh. Dirichlet Process.

[3] Blei, D. M., and Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian analysis, 1(1), 121-143.

Indian Buffet Process Distance Metric Learning

In the previous problem, we consider learning infinite number of distance matrices to accommodate the complexity of data. Each of the matrix is of finite dimension. In this problem, we will study infinite distance metric learning from another perspective, that we learn a single distance matrix, but the distance matrix is of infinite dimension. Interpreted from a latent space modeling view, DML aims to learn a linear projection matrix to project the data from the original feature space to a latent space. After projected into the latent space, data labeled as similar are placed close to each other and those labeled as dissimilar are separated apart. How to choose the dimension of the latent space has a critical influence of performance and setting it to a fixed value limits the power of the distance metric. In this project, we study the problem of learning a distance matrix with unbounded dimension. The dimensionality of the latent space grows with data and is automatically inferred from data. To do this, we place an Indian Buffet Process [1] over the distance matrix to enable an infinite dimensionality.

[1] Griffiths, T., & Ghahramani, Z. (2005). Infinite latent feature models and the Indian buffet process.

Feature Enriched Collective Matrix Factorization

Collective Matrix Factorization (CMF) [1] aims to model the inter-relations between multiple parties of data. For example, in a biology domain with genes, diseases, proteins, there are rich relations between these data: genes decide proteins, proteins decide diseases, genes interact with each other, etc. CMF can flexibly model these relations. However, it is unable to model the features associated with data, such as the chromatin features of genes, the types of diseases, etc. In this project, we are going to develop a feature enriched collective matrix factorization model to simultaneously model the features of data and the relations between data.

[1] Singh, A. P., & Gordon, G. J. (2008, August). Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 650-658). ACM.

Heterogeneous network embedding

Representation learning has been a hot topic in machine learning research. The learned representations (a.k.a. embedding) of words [1] and images are useful in various tasks of NLP/CV. A few recent work [2] extended the algorithms to network data and significantly improved a variety of tasks. However, these work has mainly focused on homogeneous networks which contain only one or two type of vertexes (i.e. persons in a friend network), while in the real world heterogeneous networks (which allows more than one vertex type) are ubiquitous. e.g., a social network can contain users, posts, interest groups, and so on. A general representation learning framework that takes into account these diverse features is desirable to learn better network embedding, and facilitate a wide range of applications such as recommender systems. In this project, your job is to develop such a framework. One idea to extend the popular skip-gram [1] algorithm in the NLP literature.

[1] T. Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. NIPS13

[2] J. Tang et al. LINE: Large-scale Information Network Embedding. WWW15

Personalized topic models

In increasing many real-world applications, such as recommender systems for news or scientific articles, we want to estimate (probabilistic) models for each user. For example, to create the best user exp erience in online applications, we want to build a personalized topic model [1] for each user and there could be millions of such users. Each user has a subset of the entire dataset, e.g., she/he only accessed a subset of all the news articles. This problem differs from previous work: 1) compared to traditional hierarchical models, here users’ datasets are usually not disjoint (i.e. the user overlapping setting); 2) compared to traditional personalized methods which train a topic model for each user separately and thus suffer from huge computational complexity and difficulty in topic alignment, we want to share the statistical strength across different users. In this project, your job is to develop such a model (we have a basic model you can improve over it), and apply it to various real applications such as personalized recommender systems.

[1] D. Blei et al. Latent Dirichlet Allocation. JMLR03

Large-scale Distributed Convolutional Neural Network

Large deep neural network models have recently demonstrated state-of-the-art accuracy on hard visual recognition tasks. Unfortunately such models are extremely time consuming to train and require large amount of compute cycles. Complex tasks require deep models with a large number of parameters that have to be trained. Such large models require significant amount of data for successful training to prevent over-fitting on the training data which leads to poor generalization performance on unseen test data. Unfortunately, increasing model size and training data, which is necessary for good prediction accuracy on complex tasks, requires significant amount of computing cycles proportional to the product of model size and training data volume. Due to the computational requirements of deep learning almost all deep models are trained on GPUs. While this works well when the model fits within 2-4 GPU cards attached to a single server, it limits the size of models that can be trained. A possible solution to train extremely large models using real-word big data is to build a large-scale distributed system comprised of commodity servers. In this project, you are expected to come up with potential solutions about data parallelism and model parallelism for training large-scale convolutional neural network in a distributed setting (e.g. GPU/CPU clusters).

[1] Petuum.

[2] Petuum: A New Platform for Distributed Machine Learning on Big Data. KDD 2015

[3] On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. NIPS 2014

[4] More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. NIPS 2014

[5] ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012

Unsupervised Learning of Visual Representation from Videos

Understanding temporal sequences is important for solving many problems in the AI-set. Videos, as a typical kind of temporal sequences, are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in, showing us examples of what constitutes objects, how objects move against backgrounds, what happens when cameras move and how things get occluded. Being able to learn a representation that disentangles these factors would help in making intelligent machines that can understand and act in their environment. Additionally, learning good video representations is essential for a number of useful tasks, such as recognizing actions and gestures. Supervised learning has been extremely successful in learning good visual representations that not only produce good results at the task they are trained for, but also transfer well to other tasks and datasets. Therefore, it is natural to extend the same approach to learning video representations. However, videos are much higher dimensional entities compared to single images. Therefore, it becomes increasingly difficult to do credit assignment and learn long range structure, unless we collect much more labelled data or do a lot of feature engineering (for example computing the right kinds of flow features) to keep the dimensionality low. The costly work of collecting more labelled data and the tedious work of doing more clever engineering can go a long way in solving particular problems, but this is ultimately unsatisfying as a machine learning solution. This highlights the need for using unsupervised learning to find and represent structure in videos. Moreover, videos have a lot of structure in them (spatial and temporal regularities) which makes them particularly well suited as a domain for building unsupervised learning models. In this project, we expect you to explore possible machine learning solutions (CNN, sparse coding) for unsupervised learning on video sequences and evaluate the learned visual representations using different computer vision tasks.

[1] Unsupervised Learning of Video Representations using LSTMs. ICML 2015

[2] Unsupervised Visual Representation Learning by Context Prediction. ICCV 2015

[3] Sparse Output Coding for Scalable Visual Recognition. IJCV 2015


Semantic Segmentation for Images

Semantic segmentation associates one of the pre-defined class labels to each pixel of an image. The input image is divided into the regions, which correspond to the objects of the scene or stuff. To perform a semantic segmentation of an image is to infer the semantic label for every pixel. Using simple semantic labels, the pixels in the image have been explained, each one generated by some unknown model for the category label. If such a segmentation can be achieved, then the image can be catalogued for image search, used for navigation, or any number of other tasks which require basic semantic understanding of arbitrary scenes. A wide range of machine learning techniques, including convolutional neural network, graphical models, and spectral methods etc., have been extensively employed in this interesting task. In this task, you need to investigate existing methods/models, evaluation metrics, public dataset for supervised semantic segmentation tasks, and then propose your solution for image semantic segmentation, and evaluate it on standard datasets.

[1] Fully Convolutional Networks for Semantic Segmentation. CVPR 2015

[2] Semantic Segmentation using Regions and Parts. CVPR 2012

[3] Recurrent Convolutional Neural Networks for Scene Labeling. ICML 2014

Cool Additional Datasets [copied from Barnabas’ 10-715 website]

  1. Yahoo webscope datasets. There are plenty of them free for download. However, you need to sign up individually since the datasets typically come with noncommercial restrictions.
  2. IMDB data
  3. Twitter gardenhose
  4. AOL query log
  5. GigaDB bioinformatics database. Try e.g. searching for homo sapiens.
  6. TREC datasets (text retrieval).
  7. Linguistic Data Consortium homepage
  8. Stanford Social Networks datasets
  9. Frequent itemset mining data
  10. Wikipedia dump
  11. Amazon AWS public datasets