10-831/90-921, Mining Massive Datasets (Special Topics in Machine Learning and Policy)

Sample Syllabus from Spring 2013

Course Description

This course is intended for Ph.D. students in Heinz College, the Machine Learning Department, and other university departments who wish to engage in detailed exploration of a specific topic at the intersection of machine learning and public policy. Qualified master's students may also enroll with permission of the instructor; all students are expected to have some prior background in machine learning and data mining (10-601, 10-701, 90-866, or a similar course). This year's course will focus on the topic of Mining Massive Datasets. Many policy and management problems can benefit from the analysis of massive data on the societal scale, including use of electronic medical records for health care, electronic crime reports and emergency call records for law enforcement, cellular telephone data to infer contact networks and/or track individuals' location and proximity, and a multitude of online data sources (Twitter, Facebook, search queries, purchases, recommendations, clickstreams, etc.) Such datasets may consist of millions to trillions of data records, be very high dimensional, and have complex structure. In many cases, standard machine learning methods are computationally infeasible for the scale and complexity of data, or produce poor results (e.g. due to the curse of dimensionality). However, new technologies (such as MapReduce/Hadoop), scalable algorithms, and novel machine learning methods can mitigate these problems and enable effective use of these massive quantities of data. We will explore these challenges and opportunities in detail through readings, discussions on current research and future directions, and course projects, with the goals of understanding and advancing the current state of the art. Specific topics will be selected from Rajaraman, Leskovec, and Ullman's book, Mining of Massive Datasets, and supplemented with current research papers representing new advances and current or potential policy applications.

Course Objectives

Upon completion of this course, the student will be able to:

1. Discuss selected topics and research directions in Mining Massive Datasets, such as similarity search, streaming data, clustering, and graph mining.

2. Present current topics in machine learning and policy, focusing on methods for mining massive datasets and potential policy and management applications, by synthesizing and summarizing the current state of the art, and facilitating discussion by posing questions, preliminary conclusions, and ideas to explore.

3. Develop a research project relevant to Mining Massive Datasets and produce a report describing the project's background, methods, results, and conclusions.

Class Schedule

Mondays and Wednesdays, 10:30-11:50am, Hamburg Hall 1511


Class participation: 20%
Topic presentation 1: 20%
Topic presentation 2: 20%
Project proposal presentation (4/3): 5%
Project proposal (due 4/3): 5%
Final presentation (5/1): 5%
Final report (due 5/3): 25%

Class Participation

One major goal of this course is to have engaging and insightful group discussions about selected topics and research directions in Mining Massive Datasets, and thus active participation by all students in these discussions is an essential component of the course. Students are expected to attend all class meetings, to do the assigned readings and practice exercises in advance of each class meeting, and to contribute useful insights, comments, and questions to the discussions.

Topic Presentations

Eight of the thirteen course meetings will be devoted to discussion of four specific topics in Mining Massive Datasets: similarity search, streaming data, clustering, and graph mining. Each student is expected to give a high quality, twenty-minute PowerPoint presentation (15 minutes + 5 minutes for questions) at two of these meetings; each class will consist of 2-3 presentations plus 20 minutes of open discussion. A minimally acceptable presentation ("B" grade) will clearly and succintly present the important details of the assigned reading, while a quality presentation ("A" grade) will a) motivate the problem and solution in terms of current or potential future applications to public policy and management, b) bring in additional sources (such as current research papers) to supplement the presentation and discussion, c) describe the major open problems and promising directions for future work, and/or d) faciliate the open discussion by posing questions for discussion, preliminary conclusions, and ideas to explore. Given that multiple students will be presenting each day, you are responsible for coordinating the material covered.

Course Projects

All students are expected to be involved in a research project relevant to Mining Massive Datasets, to make significant progress on this research over the duration of the course, and to produce a written document describing the project's background (including a description of any previous work by the student and related work by others), methods, results, and conclusions. You are encouraged (but not required) to work in groups of two for this project; larger groups may be acceptable but require the instructor's permission. Each group will be expected to give two brief presentations of their work to the class (at the beginning of the course, describing their proposed work, and at the end of the course, describing their completed work), and to submit a short (1-2 page) proposal, thus providing opportunities for their work to benefit from feedback both from the instructor and from the class. If desired, the course project can be part of the students' ongoing doctoral research (in which case the group's proposal should make it clear what specific aspect of this work will be addressed during the duration of the course), or can be a smaller-scale project specific to the course. Note that the course project requirement can be waived for students auditing the course, but all students are expected to give two presentations on the readings and to be active participants in class discussions.

Grading for course projects will be based on: 20% significance of the problem, 20% novelty of the proposed approach, 20% correctness of the methodology, 20% clarity and completeness of the writeup, and 20% progress made over the course duration.


(M 3/18) Introduction to Mining Massive Datasets. Readings: Ch. 1.

(W 3/20) A Crash Course on Map-Reduce. Readings: Ch.2, Sections 2.1-2.3. Practice Exercise: 2.3.1.

(W 3/27) Similarity Search I. Readings: Ch.3, Sections 3.1-3.3 and 3.5.

(M 4/1) Similarity Search II. Readings: Ch.3, Sections 3.4 and 3.6-3.9.

(W 4/3) Project Proposal Presentations

(M 4/8) Streaming Data I. Readings: Ch.4, Sections 4.1-4.3 and 4.7.

(W 4/10) Streaming Data II. Readings: Ch.4, Sections 4.4-4.6.

(M 4/15) Clustering I. Readings: Ch.7, Sections 7.1 and 7.3.4-7.3.5 + two papers: Jain (required) and Koga (optional).

(W 4/17) Clustering II. Readings: Ch.7, Sections 7.4 and 7.6 (you could also read 7.5, but this material is optional).

(M 4/22) Graph Mining I. Readings: Ch.10, Sections 10.1-10.3.

(W 4/24) Graph Mining II. Readings: Ch. 10, Sections 10.5-10.7 + Pegasus paper (Kang et al.).

(M 4/29) Scaling Up Pattern Detection to Massive Datasets. Required readings: Krause and Guestrin; Neill et al. Optional reading: Christakis and Fowler.

(W 5/1) Final Project Presentations

(F 5/3) Final projects due 11:59pm tonight!