Machine Learning

10-701/15-781, Spring 2014

Barnabas PoczosAarti Singh

Home People Lectures Recitations Homeworks Project Previous material Table of algorithms
Course Project Guidelines

Deadline is extended till Thursday noon. (Before the Thursday class)

The project proposal should be one or two pages. It should contain the title, group members names, email addresses, and andrew ids. Each group should submit a single write up. This write up must include a clear description of your project in its relationship to machine learning. It is ok to build on prior work, but you must clearly state what portion of the project is novel and has not yet been completed. Your write up should include an outline of all that is to be done and a rough timeline for completion. Your outline should be optimistic in the sense that if completed you would consider your project to be perfect. Separately, you should describe the minimum you could complete and still consider your project successful. The write up should also include what resources are required, such as data sets and external software packages, and how they will be attained. We will provide feedback on your proposals and assign each group a TA mentor. Keep this in mind when constructing your write up so that we can provide you with the most valuable feedback possible.

Project Suggestions

Here are some project ideas and datasets for this year:

Vector Space Models for Natural Language Processing (Provided by Jayant Krishnamurthy)

Many tasks in natural language processing can be performed with only very shallow understanding of text. The vector space model is one example of a useful, but shallow, data representation that has been successfully used for many tasks, including detecting synonyms, finding analogies, and learning properties of noun phrases. The vector space model represents the meaning of a noun phrase(NP) (e.g. "the New York Yankees" or "house") as a vector of co-occurrence counts with contexts. A context is a short snippet of text like "alex rodriguez plays for _" or "_ on the street". The model is essentially a (very) large matrix A, whose rows represent noun phrases and whose columns represent contexts. The value of entry A_{i,j} is the number of times noun phrase i occurred with context j in a large corpus of documents (e.g., the Web). Intuitively, this model contains useful information because some contexts only occur with certain types of noun phrases; for example, the context "athletes, such as _" only occurs with athletes.

Download Dataset and Related Tools

Project ideas:
In this project, we will provide you with an NP-context co-occurrence matrix and ask you to do something interesting with it. Possible applications include finding synonyms, finding members of categories (i.e., "is this noun phrase an athlete?"), or clustering noun phrases to automatically induce categories. For more ideas, we recommend reading (1).

Another interesting set of projects could use this data as a case study for learning in high-dimensions. These projects could examine how dimensionality reduction or regularization affect performance on one of the above tasks.

Vector space models are also used for many other tasks, including document classification and topic modeling. The 20 Newsgroups data is a classic data set for these tasks. You can obtain this dataset from here

(1) Peter D. Turney and Patrick Pantel (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, pp. 141-188.

Semi-Supervised Learning (Provided by Jayant Krishnamurthy)

In many applications, it is easy to obtain a large amount of unlabeled data, but difficult or costly to label this data. Semi-supervised learning studies algorithms which learn from a small amount of labeled data and a large pool of unlabeled data. Interestingly, semi-supervised learning is not always successful, and unlabeled data points do not always improve performance. Semi-supervised learning algorithms typically make an assumption about the data distribution which enables learning -- for example, several algorithms assume that the decision boundary should not pass through regions with high data density. When this assumption is satisfied, the algorithms perform better than supervised learning.

The goal of this project is to experiment with semi-supervised learning algorithms on a data set of your choice. Some algorithms you can consider using are: co-training, self-training, transductive SVMS (S3VMs), or one of the many graph-based algorithms. (We recommend reading (1) for a survey of the many approaches to semi-supervised learning.) You may compare several semi-supervised and supervised algorithms on your data set, and perhaps draw some general conclusions about semi-supervised learning.

This project can use essentially any data set. For some ideas, we recommend consulting the UC Irvine Machine Learning Repository.

(1) Xiaojin Zhu. Semi-supervised Learning Literature Survey. Available Here.

Collective Classification (Role discovery) in Heterogenous Networks (Provided by Leman Akoglu)

In machine learning, the typical approach of classification is to classify each object independently without considering the network structure that connects these objects, if such a network structure exists. This setting, however, occurs naturally in data for a variety of applications such as Web graphs, social networks, bibliographic data, email networks, etc. For example, in the web page classification problem where web pages are interconnected with hyperlinks, the task is to classify each web page, that is to assign a label to each web page that would indicate its topic the best. You can think of this problem as given a collection of web pages, how to group them into sports, news, blogs, etc. with respect to their text data. In such a setting, there exists another significant source of information: the underlying network of hyper-linked web pages. In this network, it is reasonable to assume that the linked web pages are correlated (sports pages link to other sports pages, news pages can also link to sports pages, etc.) (See here for a related recent (best) paper)

The problem of node classification where nodes have several attributes as well as form a network structure among themselves is often referred to as collective classification. (See here for a short survey) In this project, you can use similar ideas from this field of research and/or extend existing ideas to do role discovery in social networks or communication networks. It is of great interest in the data mining community to detect the labels of nodes in, say, email networks to know who is a manager, regular employee, the executive assistant, etc. One can also use similar ideas to detect fraud in customer-product networks such as in Amazon, Ebay, etc. (See here)

For this project, Leman can provide the Enron email dataset which consists of enron employee and non-employee email interactions. The nodes are labeled as Employee, Director, Vice President, Trader, Manager, etc. which she think is a great real data resource for this type of problem.

Also, she also suggests the Reality mining dataset (Click Here) could be of use which shows interactions between faculty and students at MIT over 9 months --the classification task would be classifying nodes into business school student/faculty vs CS student faculty here.

Classification and Event Detection in Time Series of Graphs (Provided by Leman Akoglu)

Graph classification has been widely studied in both the data mining (e.g. here) and the machine learning community (e.g. here). The task here is to assign a given graph a class label using a collection of graphs arriving in a sequence over time as training data. Such a setting arises in several applications: The benchmark run of software programs with different input parameters constitutes a request-flow-graph (RFG). In a RFG the nodes represent the function calls and the edges represent the dependency between functions (e.g. function A calls function B).

Given a collection of RFGs of many different runs of a software program the task is to classify each run as successful or unsuccessful to detect software bugs or program crashes.

One can also think of other scenarios where the data is collected over a (maybe distributed) system by submitting a benchmark of requests. Each request generates a graph with nodes representing articulation points in the system and the edges representing the interactions between them. Once these graphs can be classified into a specific type of requests, one can study the normal/abnormal functioning of the system. Using machine learning, this process can be automated within the system eliminating human interaction with the system.

In this project, the goal is to study the performance of different machine learning algorithms on the graph classification task. Another important direction is to come up with ideas about what features to use and analyze which features extracted from the graphs are more useful and distinctive in the classification process. (See on feature selection)

For this project,unfortunatelycurrently there is no dataset available but you can:

1 - simulate several (a benchmark of) runs of a software program and generate many graphs with nodes representing the functions and edges representing the calls. You can manually inject bugs into the software to get malicious runs and try to do classification accordingly.

2 - If you have access to a distributed environment such as Hadoop, etc., they can submit read, write, download, etc. system requests to generate request flow graphs and classify them into read, write, etc.

Leman believes data collection for those above cases should not take too long.

Robust low-rank matrix factorization with missing data by minimizing the l1 loss (Provided by Tzu-Kuo Huang)

Low-rank matrix factorization has a variety of applications. Two of them are collaborative filtering and image completion, where the goal is to fill in the missing entries of a data matrix. Most low-rank matrix factorization methods minimize the l2 loss, i.e, the sum of squared residuals between the known entries and the reconstruction given by a product of two low rank matrices. It is known, however, that the l2 loss is prone to outliers, and the resulting factorization can over-fit the known entries in the data matrix. One common remedy for this issue is penalizing the matrix norms of the two low-rank factors, while another less explored way is replacing the l2 loss by the l1 loss, which is known to be more robust against outliers. Two papers ([1] and [2]) in the field of computer vision have proposed algorithms for low-rank matrix factorization under the l1 loss, and have demonstrated interesting results in computer vision applications.

The goals of this project include, but are not limited to, the following:
  1. Implement the algorithms proposed in [1] and [2], try them on other applications of low-rank matrix factorization, such as Collaborative Filtering, for which two widely-used benchmark data sets are Netflix and MovieLens (Click here). Compare results with those obtained by minimizing the l2 loss, and/or any state-of-the-art methods you know of.
  2. The algorithms proposed in [1] and [2] are for minimizing the l1 loss without regularizing the factors. Try extending these algorithms to handle regularization terms, and see how the regularized versions perform compared with the un-regularized ones.

[1] Efficient Computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L1 Norm, CVPR 2010

[2] Robust L1 Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming, CVPR 2005

Spectral algorithms for learning Hidden Markov Models (Provided by Tzu-Kuo Huang)

Hidden Markov Models (HMMs) are a common tool for analyzing dynamic data. The most widely-used algorithm for learning HMMs is the Baum-Welch algorithm, which is essentially doing expectation maximization. Like all EM procedures, the Baum-Welch algorithm suffers from local optima, and in addition the trouble of determining the number of hidden states. Recently some researchers proposed alternative learning algorithms ([1] and [2]) that are free from local optima, and demonstrated their performance in a number of applications, including robot vision and audio classification. The goal of this project is to implement these algorithms and apply them to other dynamic data, such as brain wave signals (e.g., Click here), video streams (e.g., CMU motion capture database, which contains both video data and biomarker data), and/or any dynamic data in your field of interest. Compare results with those obtained by either standard HMMs or domain specific extensions/enhancements.

For those who are looking for a specific application, the UCI machine learning archive has a section on time series data, which may be a good starting point.


[1] A spectral algorithm for learning hidden Markov models, COLT 2009.

[2] Hilbert Space Embeddings of Hidden Markov Models, ICML 2010.

Privacy-Related Project Ideas (Provided by Rob Hall)

A recent paper shows how decision trees may be approximated in the "secure multiparty" setting, by using tools of cryptography. This allows several parties to compute a decision tree on the union of their data without requiring them to share the data itself with each other. See here for the algorithm and here for a general discussion.

An alternative notion of privacy which has received much attention lately is so-called "Differential Privacy" (see e.g, here. So far approaches have been studied for many machine learning methods under this model of privacy (such as logistic regression: see here and svm: see here. This notion of privacy is fundamentally different to the usual cryptographic one.

The datasets for this project can be found at the UCI machine learning archive (Please consult Rob Hall for more details about the datasets.).

Project Idea 1: Differentially Private Decision Trees
See whether it is possible to implement a decision tree learner in a differentially-private way. This would entail creating a randomized algorithm which outputs a decision tree. Furthermore, when one of the elements of the input data is changed, the distribution over the outputs should not change by much (cf, the definition of differential privacy). Analyze under what conditions the approach will work and analyze the error rate relative to that of the non-private decision tree.

Project Idea 2: Secure Multiparty EM
Implement EM for some kind of mixture model in a secure multiparty way (i.e., making use of cryptography to protect the intermediate quantities). This has been studied already in the context of imputation (see here). Although in principle, existing cryptographic primitives may be used to compute any funciton, in order to get a reasonably efficient algorithm you will have to come up with approximations which are less expensive to compute. Analyze the effects of any such approximations you make.

Project Idea 3: Differentially Private Sparse Covariance Estimation
The estimation of gaussian covariance matrices has received ample attention lately (see e.g, here). When the covariance matrix is sparse (i.e., the off-diagonal elements are mostly 0) then there is a nice correspondence between the multivariate gaussian, and a certain type of undirected graphical model. See whether it is possible to estimate sparse covariance matrices in a way which satisfies the definition of differential privacy as shown above. Analyze how much error will be in the algorithm compared with the non-private one.

KDD cup 2010 (Provided by Datashop in Pittsburgh Science Learning Center, suggested by Min Chi)

The KDD Cup is the annual Data Mining and Knowledge Discovery competition in which some of the best data mining teams in the world compete to solve an practical data mining problem of some importance. This year (2010)'s challenge is about how generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks? This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.

[Download the datasets:] Click here to download the data.

More information about the datasets can be found by Click here.

You will apply the various machine learning techniques to the KDD datasets to improve the accuracy on predicting "Correct First Attempt values" for the test portion of the data. Please report the Root Mean Squared Error (RMSE).

30-40 Datasets for Education Data Mining (Provided by Min Chi))

[Dataset Webpage:] Go to the webpage (You may need log in through WebISO) and click "Public Datasets".

There are about 30-40 public datasets available from the webpage (Only select the datasets whose status is labeled "complete"). If you clicking each datasets, you will find a general description and the related publications. To look at the datasets, click "Export" link.

Project Idea 1:
For each dataset, you can compare various machine learning techniques (at least five to seven different ML methods) on predicting "Correct First Attempt values" (Generally listed in the column "Outcome"). Please report the Root Mean Squared Error (RMSE).

Project Idea 2:
Across datasets, you can compare several machine learning techniques (at least two to three different ML methods) on predicting "Correct First Attempt values"(Generally listed in the column "Outcome"). Please report the Root Mean Squared Error (RMSE) on the test data. The hypothesis here is that there may not be an absolute winner, different machine learning techniques may be effective on different task domains. For example, you can split the datasets into science (physics & math) vs. second language learning (Chinese, French).

Some project ideas from previous years:

Image Segmentation Dataset
The goal is to segment images in a meaningful way. Berkeley collected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations). Two-hundred of these images are training images, and the remaining 100 are test images. The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions. It also includes code for a segmentation example. This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.

Download Dataset

Project idea: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

Character recognition (digits) data
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Download dataset.

Project suggestion:
* Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

NBA statistics data
This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season

Project idea:
* outlier detection on the players; find out who are the outstanding players.
* predict the game outcome.

Precipitation data
This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

Download Dataset

Project ideas:
Weather prediction: Learn a probabilistic model to predict rain levels.
Sensor selection: Where should you place sensor to best predict rain.

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
Download Dataset.

Project ideas:
* Learning classifiers to predict the type of webpage from the text.
* Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?


Email Annotation
The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

Download Dataset

Project Ideas:
* Model the task as a Sequential Labeling problem, where each email is a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".


Netflix Prize Dataset
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize.

Project idea:
  • Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
  • Can you discover clusters of similar movies or users?
  • Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website.

Physiological Data Modeling (bodymedia)
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.

1. Which sensors correspond to each column?
characteristic1 age
characteristic2 handedness
sensor1 gsr_low_average
sensor2 heat_flux_high_average
sensor3 near_body_temp_average
sensor4 pedometer
sensor5 skin_temp_average
sensor6 longitudinal_accelerometer_SAD
sensor7 longitudinal_accelerometer_average
sensor8 transverse_accelerometer_SAD
sensor9 transverse_accelerometer_average

2. What are the activities behind each annotation?
The annotations for the contest were:
5102 = sleep
3104 = watching TV

Datasets can be downloaded from here.

Project idea:
* behavior classification; to classify the person based on the sensor measurements.

Object Recognition
The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.

Download Dataset.

Project ideas:
* You can try to create an object recognition system which can identify which object category is the best match for a given test image.
* Apply clustering to learn object categories without supervision.

Enron E-mail Dataset
The Enron E-mail data set contains about 500,000 e-mails from about 150 users.

The data set is available here

Project ideas:
* Can you classify the text of an e-mail message to decide who sent it?

More data
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

Sam Roweis also has a link to several datasets out there:

Dr. Jan Wiebe MPQA opinion annotated corpus: