Research Seminar in Machine Learning and Policy (10-830/90-904)

Sample Syllabus from Spring 2013

Course Description

This research seminar is intended for Ph.D. students in Heinz College, the Machine Learning Department, and other university departments who wish to engage in cutting-edge research at the intersection of machine learning and public policy. Qualified master's students may also enroll with permission of the instructor; all students are expected to have some prior background in machine learning and/or artificial intelligence (10-601, 10-701, 90-866, or a similar course). The course has three main objectives: 1) to facilitate in-depth discussions of current research articles and essential topics in machine learning and policy, 2) to benefit the students' own ongoing research projects through presentations, critiques, and discussions, and 3) to encourage interdisciplinary research collaborations between students in Heinz, MLD, and other departments. We plan to achieve these goals through a discussion-based course format: students will present and discuss current research articles on selected topics in machine learning and policy, as well as giving presentations on their ongoing research projects and/or smaller-scale course projects in this domain.

Course Objectives

Upon completion of this course, the student will be able to:

Class Schedule

Mondays and Wednesdays, 10:30-11:50am, Hamburg Hall 1511

Grading

Class participation: 20%
Topic presentation 1: 20%
Topic presentation 2: 20%
Project proposal presentation (Wednesday, January 30th): 5%
Project proposal (due Wednesday, January 30th, at the beginning of class): 5%
Final presentation (Monday, March 4th): 5%
Final report (due Monday, March 4th, at 11:59pm Eastern time): 25%

Class Participation

One major goal of this course is to have engaging and insightful group discussions about selected topics and research directions at the intersection of machine learning and public policy, and thus active participation by all students in these discussions is an essential component of the course. Students are expected to attend all class meetings, to read the assigned research articles in advance, and to contribute useful insights, comments, and questions to the discussions.

Topic Presentations

Nine of the fourteen course meetings will be devoted to discussion of specific topics (such as causal discovery, social networks, and the wisdom of crowds). Each student is expected to give a high quality, 20 minute PowerPoint presentation at two of these meetings. Goals of the presentation should be 1) to introduce the topic and provide essential background information, 2) to very briefly review the assigned readings and their relevance to the discussion, and 3) to facilitate the remainder of the discussion by posing questions for discussion, preliminary conclusions, and ideas to explore. For most of the topic discussions, we will have two student presenters: in this case, the students are responsible for coordinating their presentations to avoid unnecessary repetition and to explore different aspects or perspectives of the general topic under discussion. *** PLEASE LIMIT YOUR PRESENTATION TO 20 MINUTES TO ALLOW SUFFICIENT TIME FOR DISCUSSION!!! ***

To ensure that presentations will be useful and relevant for the class, the presenter(s) should send the instructor a brief text outline of the main topics/points that their presentation will cover, and a proposed set of 2-3 electronically available research articles that the class should read, at least one week (and preferably two weeks) prior to the presentation. The instructor will provide feedback and suggestions, and will post the articles on Blackboard so that the class can read them in advance of the presentation. The chosen research articles should present methods and approaches that are new (or not commonly known), that are likely to be of relevance to both ML and policy researchers, that cover a variety of perspectives, and that raise important issues for discussion.

An important thing to keep in mind is that you want to focus on papers and discussion topics with explicit connections to policy (these could be methodological connections, e.g. combining ML methods with or comparing to methods used in policy, or could be applications of ML to general or specific policy areas). But it's fine for some of the papers/topics to be more general as well. Some good ways to find relevant papers include looking through recent ML conference proceedings (KDD, ICML, AAAI, NIPS) or various journals (economics/econometrics, statistics, policy analysis, etc... I might be able to provide more specific recommendations depending on your topic of interest). Using Citeseer, Google Scholar, or just a Google search is also a good way to get started. Your main goals should be to get a general sense of the discussion topic (the current state of the art, and where the field is headed), and to find some specific papers and questions that will be interesting to discuss.

Course Projects

All students are expected to be involved in a research project relevant to machine learning and public policy, to make significant progress on this research over the duration of the course, and to produce a written document describing the project's background (including a description of previous work by the student and related work by others), methods, results, and conclusions. This final report should also include a brief description of how the progress of the work and the student's future research directions have been influenced by the semester's discussions. Students will also be expected to give two brief presentations of their work to the class (at the beginning of the course, describing their proposed work, and at the end of the course, describing their completed work), and to submit a short (1-2 page) proposal, thus providing opportunities for their work to benefit from feedback both from the instructor and from the class. If desired, the course project can be part of the student's ongoing doctoral research (in which case the student's proposal should make it clear what specific aspect of this work will be addressed during the duration of the course), or can be a smaller-scale project specific to the course. Note that the course project requirement can be waived for students auditing the course, but all students are expected to give two topic presentations and to be active participants in class discusions.

*** While we expect most projects to be done individually, students can also work in pairs with the instructor's permission. Permission to do so is much more likely to be granted if the two students have complementary skills and perspectives (e.g. ML methodology + expertise in the application domain of interest). Please talk to the instructor if you are interested in this possibility. ***

Syllabus

(M 1/14) The Big Picture
Introductions (be prepared to speak for 2-3 minutes each about your background and interests)
Discussion of the course syllabus (course structure, goals, topic presentations, course projects)
The current convergence of ML and public policy (research, curriculum, at CMU and elsewhere)
Preliminary discussion of the "big picture" (we will revisit many of these issues and questions at the end of the course)

(W 1/16) Quick Review of Core Machine Learning Concepts
This course assumes a knowledge of basic machine learning methods as a prerequisite. The instructor will provide a quick review of many of these core concepts, using slides from his "Large Scale Data Analysis for Policy" course. Please look over these slides before the lecture and come prepared to ask questions on any topics that may be unfamiliar. Topics to be covered include supervised learning (decision trees, k-nearest neighbor, naive Bayes), unsupervised learning (clustering, anomaly detection, anomalous pattern detection), graphical models (e.g. Bayesian networks), and other relevant ML paradigms (active learning, reinforcement learning, ...)

(W 1/23) Discussion Topic 1: New Directions in Supervised Learning- Explanation and Visualization
Readings: Domingos (required), El-Arini et al. (required), Harle et al. (optional), Green (optional), Guyon and Elieseff (optional)

Some possible questions for discussion: Perhaps the most common current application of machine learning methods to policy is the use of simple classification and regression techniques (e.g. decision trees) for prediction. In policy analysis, we typically wish not only to achieve high-accuracy predictions of the output variable, but also to determine which input variables have the greatest influence on our predictions. What are the tradeoffs between accuracy and interpretability in classification? How can we design classifiers so that the results are easily visualizable and understandable, but without losing (much) prediction accuracy? When is it preferable to use a more interpretable classifier (such as decision trees or naive Bayes) and how can the outputs of each classifier be interpreted? How are our interpretations affected by correlations between input variables? When can we draw conclusions about the causal relationships between the input and output variables? When is it better to use a less interpretable classifier (such as neural networks or support vector machines), or combine classifiers (e.g. through bagging or boosting) for higher accuracy? Can we improve the interpretability of such methods without sacrificing accuracy, or can we learn an interpretable model which closely matches the output of the less interpretable model? How can we perform dimensionality reduction and data visualization such that the low-dimensional classifier has high accuracy, but the projected dimensions are still interpretable?

(M 1/28) Discussion Topic 2: Graphical Models and Causality
Readings: Mahmood (required), Stuart (required), Pearl (optional), Statnikov et al. (optional), Su et al. (optional)

Some possible questions for discussion: How can we use Bayesian networks and other graphical models to understand the relationships between variables in policy domains? What recent innovations have made inference and learning of Bayesian networks efficient and scalable (e.g. new methods for structure learning; variational inference; context-specific independence)? How can we deduce causal relationships from observational data, experimental data, or from a combination of data and prior knowledge? What assumptions need to be made to interpret a Bayes Net causally? What role do latent (hidden) variables play in causal discovery, and how can we model them?

(W 1/30) Project Proposal Presentations
Each student will present a short PowerPoint presentation on their proposed course project, as well as turning in a short (1-2 page) proposal.

(M 2/4) Discussion Topic 3: Active Learning and User Interaction
Readings: Attenberg x 2 (1 required, 1 optional), Mozafari (required), Yan (optional).

Some possible questions for discussion: How can unlabeled data be used effectively to improve the accuracy of classification? Given a human user "in the loop" who can provide feedback to the system, how can active learning methods be used to choose the best points for the user to label? When are active learning methods useful for policy, and how can they best be applied? How can active learning be used to identify the most relevant patterns in massive amounts of data? How can systems and interfaces be designed to incorporate ML methods in ways that maximize benefit to users in the public sector?

(W 2/6) Discussion Topic 4: Integrating Social Science and Machine Learning 1- Economic Modeling
Readings: Angrist (required), Levitt (required), Danaher (optional), Zubizarreta (optional).

Some possible questions for discussion: How do the models, methods of analysis, and perspectives applied in typical social science, economics, and policy research compare to those developed by machine learning and data mining researchers? Is it possible to integrate ideas from these two fields to improve the quality of decision-making and policy analysis? Can economic and econometric techniques such as instrumental variables and propensity score matching be integrated with ML methods to make policy-relevant inferences from massive data?

(M 2/11) Discussion Topic 5: Integrating Social Science and Machine Learning 2- Social Network Analysis
Readings: Burt (required), Eisenstein (required), Kwon (required), Madan (optional), Somasundaran (optional)

Some possible questions for discussion: How can we integrate approaches to modeling and mining of social network data with a policy-based understanding of the formation and evolution of social ties? How do machine learning and data mining approaches to social network analysis (e.g. link analysis, group detection, probabilistic relational models, and mining of frequent graph structures) differ from methodologies used in the study of social and organizational behavior? How can these approaches be combined for better understanding and analysis of social networks in policy and management applications?

(W 2/13) Discussion Topic 6: Online Data 1- Utility
Readings: Bond (required), Ferrucci (required), Aral (optional), Radinsky (optional), Salganik (optional).

Some possible questions for discussion: What are the core methods that enable identification of relevant information in response to search queries, and what are the algorithmic advances that allow these methods to scale up to billions of queries per day? How can we model the structure and dynamics of the Web, and what algorithms are effective at mining such huge datasets? Much of the increasingly huge amount of online data available for analysis is only available as unstructured free-text. How can structure and make sense of this mass of data? How can we perform experiments using online social networks such as Facebook, or analyze existing data from these networks, to add to our understanding of human behavior?

(M 2/18) Discussion Topic 7: Online Data 2- Privacy
Readings: Acquisti (required), Verykios (required), Benitez (optional), Goldfarb (optional), Tucker (optional).

Some possible questions for discussion: With the increasing availability of massive amounts of online data including individuals' personal information (e.g. electronic health records), patterns of behavior (e.g. web search queries), and user-generated content (e.g. Facebook, Twitter), mining of massive data sets is increasingly important, but it can be used either for good or for evil (e.g. violating individuals' privacy). What are the ethical issues surrounding online data mining? Are the popular fears that data mining will be used to violate privacy justified? To what extent can data mining be used to recover personal information in undesirable ways (e.g. Alessandro Acquisti's work showing that social security numbers can be inferred from Facebook profiles)? How can we perform useful data mining tasks without violating individual privacy?

(W 2/20) Discussion Topic 8: Machine Learning for the Developing World
Readings: Blair (required), Hellstrom (required), Lane (required), Hasenfratz (optional), Kapoor (optional), Quinn (optional).

Some possible questions for discussion: How can machine learning methods best be applied to improve the quality of life in developing countries (e.g. influencing international health and development policy, disease surveillance and other event monitoring systems). How can ML methods be adapted to account for poor quality and availability of data, and how can they be used to optimize limited resources (e.g. deciding which data to obtain)?

(M 2/25) Discussion Topic 9: Event Detection Using Twitter
Readings: Paul (required), Ritter (required), Sakaki (required), Li (optional), Petrovic (optional).

Some possible questions for discussion: How can geospatial data be used for the public good, e.g. applications to crime prediction and disease outbreak detection? How can fine-grained location data (e.g., from cellular telephones) be used to infer social structure, predict events, etc? Since exhaustive search over all subsets of locations is typically infeasible, most methods either restrict the search space or perform a heuristic (approximate) search. Which of these approaches is preferable for which problems? When is it possible to find the most interesting subset of the data without performing an exhaustive search, and which methods can be used for this (e.g. matroids, submodularity, branch and bound, linear-time subset scanning)?

(M 3/4) Project Presentations
Each student will give a short PowerPoint presentation on their course project. Final project reports are due today at 11:59pm Eastern time.