90-921/10-831, Special Topics in Machine Learning and Policy
Spring 2013: Mining Massive Datasets
Course Description
Special Topics in Machine Learning and Policy (90-921/10-831) is
intended for Ph.D. students in Heinz College, the Machine Learning
Department, and other university departments who wish to engage in
detailed exploration of a specific topic at the intersection of machine
learning and public policy. Qualified master's students may also enroll
with permission of the instructor; all students are expected to have
some prior background in machine learning and data mining (10-601,
10-701, 90-866, 90-904/10-830, or a similar course).
This year's course will focus on the topic of Mining Massive Datasets.
Many policy and management problems can benefit from the analysis of
massive data on the social scale, including use of electronic medical
records for health care, electronic crime reports and emergency call
records for law enforcement, cellular telephone data to infer contact
networks and/or track individuals' location and proximity, and a
multitude of online data sources (Twitter, Facebook, search queries,
purchases, recommendations, clickstreams, etc.) Such datasets may
consist of millions to trillions of data records, be very high
dimensional, and have complex structure (graphs/networks, relational
data, spatial and temporal data, etc.). In many cases, standard machine
learning methods are computationally infeasible for the scale and
complexity of data, or produce poor results (e.g. due to the curse of
dimensionality). However, new technologies (such as MapReduce/Hadoop),
scalable algorithms, and novel machine learning methods can mitigate
these problems and enable effective use of these massive quantities of
data.
We will explore these challenges and opportunities in detail through
readings, discussions on current research articles and future
directions, and course projects, with the goals of understanding and
advancing the current state of the art. Specific topics will be
selected from Rajaraman and Ullman's book, Mining Massive Datasets, and
supplemented with current research papers representing new advances and
current or potential policy applications. Potential topics include:
distributed file systems and MapReduce, clustering and classification
for very large and high dimensional data, similarity search, streaming
data, web search, frequent pattern mining, event detection, and social
network analysis.
Back to Daniel's home page