90-866, Large Scale Data Analysis for Public Policy

Course Description

The past decade has seen the increasing availability of very large scale data sets, arising from the rapid growth of transformative technologies such as the Internet and cellular telephones, along with the development of new and powerful computational methods to analyze such datasets. Such methods, developed in the closely related fields of machine learning, data mining, and artificial intelligence, provide a powerful set of tools for intelligent problem-solving and data-driven policy analysis. These methods have the potential to dramatically improve the public welfare by guiding policy decisions and interventions, and their incorporation into intelligent information systems will improve public services in domains ranging from medicine and public health to law enforcement and security.

This course will provide a basic introduction to large scale data analysis methods, focusing on three main problem paradigms (prediction, modeling, and detection). Students will learn how to translate policy questions into these paradigms, choose and apply the appropriate artificial intelligence and machine learning tools, and correctly interpret, evaluate, and apply the results for policy analysis and decision making. We will emphasize tools that can "scale up" to real-world policy problems involving reasoning in complex and uncertain environments, discovering new and useful patterns, and drawing inferences from large amounts of structured, high-dimensional, and multivariate data. No previous knowledge of artificial intelligence or machine learning is required.

(Note: this course was previously listed as "Artificial Intelligence Tools for Policy", and is not open to students who have previously taken 90-866.)

Lecture slides (Fall 2010)

Module I: Prediction (pdf)
Module II: Modeling (pdf)
Module III: Detection (pdf)

Grading

Class participation: 5%
Project plan: 10%
Project preliminary report: 10%
Final project report: 25%
Final project presentation: 10%
Final exam: 40%

The projects will be done in teams of 2-3 people and will require the application of AI/ML methods to real-world policy data. We plan to give students the flexibility to define their own projects, enabling them to focus on policy questions which are most relevant to their own specific interests. However, each project should consist of the following components: Project teams will be self-selected. Typically, all team members will receive the same grade, but we may make exceptions for unevenly distributed workloads. Final project reports should contain a detailed description of the contributions of each team member to the project.

Occasionally, we will hand out short practice exercises to reinforce understanding of the course material. You do not need to turn these in. We will post answers with explanations on Blackboard, and these should help you study for the final exam.

Sample syllabus

Lecture 1: Introduction to Large Scale Data Analysis
Course overview
Overview of artificial intelligence: viewpoints, successes, and failures
Relevance of AI and machine learning for policy
Common AI/ML paradigms
Software tools for AI/ML

Lecture 2: Prediction and rule-based learning
The prediction problem (classification and regression)
Decision trees for classification and regression

Lecture 3: Instance-based learning
K-nearest neighbors for classification
Kernel regression
Cross-validation

Lecture 4: Model-based learning
Bayesian classification
The naive Bayes assumption

Lecture 5: Guest mini-lectures and panel discussion, "Prediction for health policy"

Lecture 6: Representation and search
Goal-directed search: priority search and A*
State-space search: hill-climbing and simulated annealing

Lecture 7: Clustering for modeling groups
Hierarchical clustering
K-means clustering
Leader clustering

Lecture 8: Bayesian networks for modeling probabilities
Building Bayes Nets
Interpreting Bayes Nets

Lecture 9: More Bayesian networks
Inference with Bayes Nets
Learning Bayes Net structure

Lecture 10: Anomaly detection
Distance-based anomaly detection
Model-based anomaly detection
Detecting anomalies using Bayesian networks

Lecture 11: Guest mini-lectures and panel discussion, "Modeling and detection for crime policy"

Lecture 12: Biosurveillance
An exciting application of pattern detection, and your instructor's main research area.

Lecture 13: Pattern detection
Detecting patterns of anomalies
Spatial cluster detection

Back to Daniel's home page