CMU 15-859(B), Spring 2008
 MACHINE LEARNING THEORY
Time: MW 3:00-4:20, 
Place: Wean 4623
Office hours: After class and Thurs 2-3 are best for me, but
	    also feel free to stop by my office or make an appointment.
Course description: 
This course will focus on theoretical aspects of machine learning. We
will examine questions such as: What kinds of guarantees can one prove
about learning algorithms? What are good algorithms for achieving
certain types of goals? Can we devise models that are both amenable to
mathematical analysis and make sense empirically? What can we say
about the inherent ease or difficulty of learning problems? Addressing
these questions will require pulling in notions and ideas from
statistics, complexity theory, information theory, cryptography, game
    theory, and empirical machine learning research.
Grading will be based on 
6 homework assignments, class
      participation, a small class project, and a take-home final
      (worth about 2 homeworks).  Students from time
      to time will also be asked to help with the grading of
		assignments. 
[2007 version]
Text:
An Introduction to Computational Learning Theory by Michael Kearns 
and Umesh Vazirani, plus papers and notes for topics not in the book.
Handouts
 Lecture Notes & tentative plan
-  01/14: Introduction. PAC model and Occam's
		    razor.
-  01/16: The Mistake-Bound model.  Combining
	  expert advice.  Connections to info theory and game theory.
-  01/23: The Winnow algorithm + applications.
-  01/28: The Perceptron Algorithm, Margins, and Kernel functions.
-  01/30: Uniform convergence, tail
            inequalities (Chernoff/Hoeffding), VC-dimension I. 
            [more notes]
-  02/04: VC-dimension II.
-  02/06: Boosting I: weak vs strong learning, basic issues. 
            [plus some slides]
-  02/11: Boosting II: Adaboost + connection
	      to WM analysis + L_1 margin bounds.
-  02/13: Margins, random projection, kernels, and general similarity functions
-  02/18: Margin bounds, Support Vector Machines.
-  02/20: McDiarmid's inequality & Rademacher bounds.
-  02/25: Maxent and maximum-likelihood exponential models. Connection to winnow.
-  02/27: Cryptographic hardness results
-  03/03: Statistical Query model I
-  03/05: Statistical Query model II
-  03/17: Fourier-based algorithms
-  03/19: Membership Query algorithms
-  03/24: Membership Query algorithms II
-  03/26: Learning finite-state environments
-  03/31: MDPs and reinforcement learning I
-  04/02: MDPs and reinforcement learning II
-  04/07: Semi-supervised learning
-  04/09: Offline->online optimization
-  04/14: online learning and game theory I
-  04/16: online learning and game theory II
-  04/21: Active learning & course retrospective
-  04/23: [No Class Today]
-  04/28: Project presentations
-  04/30: Project presentations
 Additional Readings & More Information
Books and tutorials:
- O. Bousquet, S. Boucheron, and G. Lugosi, Introduction
to Statistical Learning Theory.
- PASCAL video lectures.
- N. Cristianini and J. Shawe-Taylor, 
     Kernel 
     Methods for Pattern Analysis, 2004.
- N. Cristianini and J. Shawe-Taylor, 
An Introduction to Support
Vector Machines (and other kernel-based learning methods), 2000.
- M. Anthony and P. Bartlett. Learning in Neural Networks :
			    Theoretical Foundations. Cambridge
			    University Press, 1999. 
- V. Vapnik. Statistical Learning Theory. Wiley, 1998.
- L. Devroye, L. Györfi, G. Lugosi, A Probabilistic Theory of
		    Pattern Recognition, Springer, New York, 1996.
-  My FOCS'03 tutorial on Machine
				    Learning Theory
 
Online Learning:
-  Nick Littlestone, Learning Quickly 
when Irrelevant Attributes Abound: A New Linear-threshold Algorithm.
Machine Learning 2:285--318, 1987.  (The version pointed to
	      here is the tech report UCSC-CRL-87-28.)
This is the paper that first defined the Mistake-bound model, and
	also introduced the Winnow algorithm.  A great paper.  
 
-  Littlestone and Warmuth, 
      
      The Weighted Majority Algorithm. 
      Information and Computation 108(2):212-261, 1994.
Introduces the weighted majority algorithm, along with a number of
		    variants.  Also a great paper.
 
-  Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David
Helmbold, Robert Schapire, and Manfred Warmuth, How
to use expert advice, Journal of the ACM, 44(3):427-485, May 1997.
Yoav Freund and Robert Schapire,  Adaptive game playing using
multiplicative weights, Games and Economic Behavior, 29:79-103,
			      1999.
Continuing on with line of research in the [LW] paper, these give
tighter analyses of multiplicative-weighting expert algorithms and
give a game-theoretic perspective, as well as address a number of other issues.
 
-  Adam Kalai and Santosh Vempala, Efficient algorithms for the online decision problem, COLT '03.  Martin Zinkevich, 
Online convex
programming and generalized infinitesimal gradient ascent, ICML '03.  
These papers give efficient algorithms for a broad class of settings that one
can view as having exponentially many "experts", but which are represented 
in an implicit compact way. 
 
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert Schapire: The Nonstochastic Multiarmed Bandit Problem, SIAM J. Comput. 32(1): 48-77 (2002).  Brendan McMahan and Avrim Blum: Online Geometric Optimization
	    in the Bandit Setting Against an Adaptive Adversary, COLT '04.
Abie Flaxman, Adam Tauman Kalai, and Brendan McMahan: Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient, SODA '2005.  These papers extend above results to the bandit setting, in which only the loss or gain of the action actually played can be observed at each time step.
 
-  Survey articles: 
      Avrim Blum, On-Line 
      Algorithms in Machine Learning.  From "Online Algorithms: the 
      state of the art", Fiat and Woeginger eds., LNCS #1442, 1998.
      Avrim Blum and Yishay Mansour, Learning, Regret Minimization, and Equilibria, Chapter 4 in "Algorithmic Game Theory," Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay Vazirani, eds. (2007).
 
PAC sample complexity:
-  David Haussler Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical
     Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, containing reprinted material from "Decision Theoretic
     Generalizations of the PAC Model for Neural Net and Other Learning Applications", Information and Computation, Vol. 100,
     September, 1992, pp. 78-150. This is a really nice survey of the PAC
				model and various sample-complexity results.
 
- David Williamson, John Shawe-Taylor, Bernhard Schölkopf, Alex
Smola Sample
Based Generalization Bounds.  Gives tighter generalization bounds
where instead of using "the maximum number of ways of labeling a set of 2m
points" you can use "the number of ways of labeling your actual sample". 
Boosting:
More on Kernels:
Fourier analysis, weak learning, SQ learning:
- Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich,  Weakly Learning DNF and Characterizing Statistical Query Learning Using Fourier Analysis, STOC '94 pp. 253--262.
 
- Y. Mansour. Learning
Boolean Functions via the Fourier Transform. Survey article in
``Theoretical Advances in Neural Computation and Learning", 391--424
(1994).
 
-  A. Blum, C. Burch, and J. Langford, On
Learning Monotone Boolean Functions. Proceedings of the
39th Annual Symposium on Foundations of Computer Science (FOCS '98).
 
-  V. Feldman, P. Gopalan, S. Khot, A. Ponnuswami. New Results for Learning Noisy Parities and Halfspaces, FOCS 2006.
 
Computational hardness results:
 Web sites on maxent: