90-921/10-831, Special Topics in Machine Learning and Policy

Spring 2013: Mining Massive Datasets

Course Description

Special Topics in Machine Learning and Policy (90-921/10-831) is intended for Ph.D. students in Heinz College, the Machine Learning Department, and other university departments who wish to engage in detailed exploration of a specific topic at the intersection of machine learning and public policy. Qualified master's students may also enroll with permission of the instructor; all students are expected to have some prior background in machine learning and data mining (10-601, 10-701, 90-866, 90-904/10-830, or a similar course).

This year's course will focus on the topic of Mining Massive Datasets. Many policy and management problems can benefit from the analysis of massive data on the social scale, including use of electronic medical records for health care, electronic crime reports and emergency call records for law enforcement, cellular telephone data to infer contact networks and/or track individuals' location and proximity, and a multitude of online data sources (Twitter, Facebook, search queries, purchases, recommendations, clickstreams, etc.) Such datasets may consist of millions to trillions of data records, be very high dimensional, and have complex structure (graphs/networks, relational data, spatial and temporal data, etc.). In many cases, standard machine learning methods are computationally infeasible for the scale and complexity of data, or produce poor results (e.g. due to the curse of dimensionality). However, new technologies (such as MapReduce/Hadoop), scalable algorithms, and novel machine learning methods can mitigate these problems and enable effective use of these massive quantities of data.

We will explore these challenges and opportunities in detail through readings, discussions on current research articles and future directions, and course projects, with the goals of understanding and advancing the current state of the art. Specific topics will be selected from Rajaraman and Ullman's book, Mining Massive Datasets, and supplemented with current research papers representing new advances and current or potential policy applications. Potential topics include: distributed file systems and MapReduce, clustering and classification for very large and high dimensional data, similarity search, streaming data, web search, frequent pattern mining, event detection, and social network analysis.

Back to Daniel's home page