90-866, Large Scale Data Analysis for Public Policy
Course Description
The past decade has seen the increasing availability of very large scale
data sets, arising from the rapid growth of transformative technologies
such as the Internet and cellular telephones, along with the development
of new and powerful computational methods to analyze such datasets. Such
methods, developed in the closely related fields of machine learning, data
mining, and artificial intelligence, provide a powerful set of tools for
intelligent problem-solving and data-driven policy analysis. These
methods have the potential to dramatically improve the public welfare by
guiding policy decisions and interventions, and their incorporation into
intelligent information systems will improve public services in domains
ranging from medicine and public health to law enforcement and
security.
This course will provide a basic introduction to large scale data analysis
methods, focusing on three main problem paradigms (prediction, modeling,
and detection). Students will learn how to translate policy questions
into these paradigms, choose and apply the appropriate artificial
intelligence and machine learning tools, and correctly interpret,
evaluate, and apply the results for policy analysis and decision making.
We will emphasize tools that can "scale up" to real-world policy problems
involving reasoning in complex and uncertain environments, discovering new
and useful patterns, and drawing inferences from large amounts of
structured, high-dimensional, and multivariate data. No previous
knowledge of artificial intelligence or machine learning is
required.
(Note: this course was previously listed as "Artificial Intelligence Tools
for Policy", and is not open to students who have previously taken
90-866.)
Lecture slides (Fall 2010)
Module I: Prediction (pdf)
Module II: Modeling (pdf)
Module III: Detection (pdf)
Grading
Class participation: 5%
Project plan: 10%
Project preliminary report: 10%
Final project report: 25%
Final project presentation: 10%
Final exam: 40%
The projects will be done in teams of 2-3 people and will require the
application of AI/ML methods to real-world policy data. We plan to give
students the flexibility to define their own projects, enabling them to
focus on policy questions which are most relevant to their own specific
interests. However, each project should consist of the following
components:
- Define a relevant policy question to be answered using a dataset
of your choice. We will provide several example datasets, as well as other
suggested sources of publicly available data.
- Frame the problem in terms of one of the AI paradigms discussed in
this class. Discuss this problem framework in detail, justify your choice
of a problem framework, and report on methods that have been used to solve
the problem in past work.
- Choose an appropriate solution method for the problem. Describe the
solution method in detail, compare to relate methods, and defend your
choice of method.
- Find, or develop, an appropriate software implementation of this
method. We encourage you to use pre-existing toolkits such as Weka, though
it would also be acceptable to write your own functions in Matlab, R, etc.
if desired.
- Evaluate your method, discussing both quantitative performance results
(e.g. cross-validation error) and qualitative consideration of the
usefulness of the resulting models, explanations, etc. for the given
domain.
- Consider extensions and variations of the original method, or
alternative methods, and examine/compare their effects on performance.
Project teams will be self-selected. Typically, all team members will
receive the same grade, but we may make exceptions for unevenly
distributed workloads. Final project reports should contain a detailed
description of the contributions of each team member to the
project.
Occasionally, we will hand out short practice exercises to reinforce
understanding of the course material. You do not need to turn these in. We
will post answers with explanations on Blackboard, and these should help
you study for the final exam.
Sample syllabus
Lecture 1: Introduction to Large Scale Data Analysis
Course overview
Overview of artificial intelligence: viewpoints, successes, and
failures
Relevance of AI and machine learning for policy
Common AI/ML paradigms
Software tools for AI/ML
Lecture 2: Prediction and rule-based learning
The prediction problem (classification and regression)
Decision trees for classification and regression
Lecture 3: Instance-based learning
K-nearest neighbors for classification
Kernel regression
Cross-validation
Lecture 4: Model-based learning
Bayesian classification
The naive Bayes assumption
Lecture 5: Guest mini-lectures and panel discussion, "Prediction for
health policy"
Lecture 6: Representation and search
Goal-directed search: priority search and A*
State-space search: hill-climbing and simulated annealing
Lecture 7: Clustering for modeling groups
Hierarchical clustering
K-means clustering
Leader clustering
Lecture 8: Bayesian networks for modeling probabilities
Building Bayes Nets
Interpreting Bayes Nets
Lecture 9: More Bayesian networks
Inference with Bayes Nets
Learning Bayes Net structure
Lecture 10: Anomaly detection
Distance-based anomaly detection
Model-based anomaly detection
Detecting anomalies using Bayesian networks
Lecture 11: Guest mini-lectures and panel discussion, "Modeling and
detection for crime policy"
Lecture 12: Biosurveillance
An exciting application of pattern detection, and your instructor's main
research area.
Lecture 13: Pattern detection
Detecting patterns of anomalies
Spatial cluster detection
Back to Daniel's home page