Introduction to Machine Learning

10-401, Spring 2018

Carnegie Mellon University

Maria-Florina Balcan




Course Project Guidelines



One of the course requirements is to do a project in a group of 2 or 3. Your class project is an opportunity for you to explore an interesting machine learning problem of your choice, whether empirically or theoretically. This can take one of the following forms.
  • Conduct a small experiment (pick a dataset, and apply an appropriate machine learning algorithm)
  • Read a couple of machine learning papers and present the main ideas (these should be outside the topics covered in class)
  • Work on a small theoretical question in machine learning
If you are having trouble coming up with an idea, feel free to consult with a TA or the instructor. We provide several ideas below. Your project will consist of 15% of your final grade, and will have three deliverables:
  • Proposal, March 21: 1 page
  • In-class Presentation, April 23 & 25 (30%)
  • Final Report, May 14: 4-6 pages (70%)
Each group should submit one proposal and final report. Your project will be evaluated based on a few criteria:
  • If the project is experimental, it will be graded on the extensiveness of the study and experiments. Projects which have well-designed experiments and a thorough analysis of the results are scored higher. If the project is to look at a few papers, it will be graded on the clarity of the ideas. Projects that cleanly expresses the main ideas and concepts of the papers are scored higher.
  • The writing style and the clarity of the written paper.

Project Proposal (Due Date: March 21)

A list of suggested papers, projects, and data sets are posted below. Read the list carefully. You are encouraged to use one of the suggested data sets, because we know that they have been successfully used for machine learning in the past. If you prefer to use a different data set, we will consider your proposal, but (1) you must discuss your idea with an instructor or a TA before submitting the proposal and (2) you must have access to this data already, and present a clear proposal for what you would do with it.

Page limit: Proposals should be one page maximum.

Include the following information:
  • Project title and teammates
  • Project idea. This should be approximately two paragraphs.
  • If it is an experimental project, give a dataset and the code you plan to write. If your project is to read a few papers, list the papers.
  • Project goals: what do you expect to complete before the project presentations?

Presentation (April 23,25)

You should prepare slides for a 10 minute presentation of your project, with 2 minutes for questions. Your slides should contain a full summary of your project. Each group member should present part of the slides.

Final Report (Due Date: May 14)

This should be a 4-6 page report on your project. Include the project idea, any background or related work, the main ideas behind the project, results (if applicable), and conclusions.

Project Suggestions, grouped by type of project


Conduct a small experiment

Semi-Supervised Learning

In many applications, it is easy to obtain a large amount of unlabeled data, but difficult or costly to label this data. Semi-supervised learning studies algorithms which learn from a small amount of labeled data and a large pool of unlabeled data. Interestingly, semi-supervised learning is not always successful, and unlabeled data points do not always improve performance. Semi-supervised learning algorithms typically make an assumption about the data distribution which enables learning -- for example, several algorithms assume that the decision boundary should not pass through regions with high data density. When this assumption is satisfied, the algorithms perform better than supervised learning.

The goal of this project is to experiment with semi-supervised learning algorithms on a data set of your choice. Some algorithms you can consider using are: co-training, self-training, transductive SVMS (S3VMs), or one of the many graph-based algorithms. (We recommend reading (1) for a survey of the many approaches to semi-supervised learning.) You may compare several semi-supervised and supervised algorithms on your data set, and perhaps draw some general conclusions about semi-supervised learning.

This project can use essentially any data set. For some ideas, we recommend consulting the UC Irvine Machine Learning Repository.

(1) Xiaojin Zhu. Semi-supervised Learning Literature Survey. Available Here.

Computer Vision

Project Idea 1: Scene Classification
The Places dataset contains 365 different scene categories with 50 images each, available for download here. Create a scene classifier using Histogram of Oriented Gradients (HOG) features and a number of different classifiers First download the Places dataset Choose, say, 10 classes for your project (otherwise this dataset is too big). Implement (or find an implementation) of HOG and get the HOG features on your mini-dataset. How can you improve the feature representation of your images? Is there some way to compute more features using HOG? With your new dataset of images processed using HOG and class labels, run a few different classifiers such as SVMs, neural networks and random forests. Compare and analyse the results. Why do some work better than others?

Project Idea 2: Object Recognition
The Caltech 265 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds, available for download here. For object recognition, you can try to create an object recognition system which can identify which object category is the best match for a given test image. Another idea is to apply a clustering algorithm to learn object categories without supervision.

Privacy-Related Project Ideas

The most popular notion of privacy which has received much attention lately is so-called "Differential Privacy" (see e.g, here). So far, approaches have been studied for many machine learning methods under this model of privacy (such as logistic regression: see here and svm: see here). This notion of privacy is fundamentally different from the usual cryptographic one.

The datasets for this project can be found at the UCI machine learning archive (Please consult Rob Hall for more details about the datasets.).

Project Idea: Differentially Private Decision Trees
See whether it is possible to implement a decision tree learner in a differentially-private way. This would entail creating a randomized algorithm which outputs a decision tree. Furthermore, when one of the elements of the input data is changed, the distribution over the outputs should not change by much (cf, the definition of differential privacy). Analyze under what conditions the approach will work and analyze the error rate relative to that of the non-private decision tree.

Neural Network Experiments
  • The idea of this project is to play around with the many possible parameters and architectures in neural networks
  • First, download the MNIST dataset here.
  • Get started with a neural network codebase. There are a lot of options, from MATLAB toolboxes, to online code, to deep learning libraries like pytorch and tensorflow, to code that you have written yourself. Choose whatever you are most confortable with.
  • Do a series of experiments and compare various parameters and architectures. Some experiments you should try
    • Sigmoid vs. ReLU transfer functions
    • Network depth (2 vs 3 layers)
    • Network .width. (number of nodes per layer)
    • Learning rate schedules - fixed lr, lr schedule, RMSprop, ADAM
    • Think of others as well
  • The main goal of this project should be to give thoughtful analysis. You should use plots and visualizations to validate your analysis

Brain Activity for Meanings of Nouns
Project Idea: Predict Nouns from fMRI Data
The goal of this project is to reproduce or improve the results from the paper, i.e., to predict nouns with a similar accuracy as in the paper. For example, you may implement decision trees, SVM, and/or Neural Nets. Talk about the decisions you made in terms of reproducing the results, and some thoughtful ideas about how the results could be improved (even if you were not able to actually produce such results).

NBA statistics data
This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season

Project idea:
* outlier detection on the players; find out who are the outstanding players.
* predict the game outcome.

Netflix Prize Dataset
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize.

Project idea:
  • Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
  • Can you discover clusters of similar movies or users?
  • Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website.


Human Choice Prediction
This dataset contains data from experiments on human choice behavior. In each experiment, participants were faced with various probabilistic decision problems, in which the participants were faced with multiple options, with and without feedback. The data set is available here

Project ideas:
* Can you design a machine learning algorithm which predicts human choice behavior and models well-known choice phenomena such as the certainty effect and loss aversion?


Read and summarize a few papers

Privacy Semi-Supervised Learning Distributed Machine Learning Boosting Active Learning Interactive Clustering Online Learning Clustering under Approximation Stability Adversarial Machine Learning Contextual Bandit Learning Reinforcement Learning


Theoretical question in machine learning