#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass amsart \language english \inputencoding default \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 2 \paperpagestyle default \layout Title Tutorial on Practical Prediction Theory for Classification \layout Author John Langford, IBM Research \layout Abstract We discuss basic prediction theory and it's impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and practically useful. \layout Abstract There are a two important implications of the results presented here: \layout Abstract (1) Common practices for reporting results in classification should change to use the test set bound. \layout Abstract (2) Train set bounds can sometimes be used to directly motivate learning algorithms. \layout Section Introduction \layout Standard Classifiers are functions which partition a set into two elements (the set of rainy days and the set of sunny days). Classifiers are the most simple nontrivial decision making element so studying the theory of learning classifiers is very fundamental to studying the theory of learning, in general. Classifiers are sufficiently complex that many phenomena observed in machine learning (theoretically or experimentally) can be observed in the classificatio n setting. Yet, classifiers are simple enough to make their analysis easy to understand. This combination of sufficient yet minimal complexity for capturing phenomena makes the study of classifiers especially fruitful. \layout Standard The goal of this paper is an introduction to the theory of prediction for classification. Many of these results have been presented elsewhere, although the style, elemental nature, and generality of the presentation may be new. This is a tutorial, so we limit our presentation to those results which are both theoretically sound and practically useful. \layout Standard There are several important aspects of learning which the theory here casts light on. Perhaps the most important of these is the problem of performance reporting for classifiers. Many people use some form of empirical variance to estimate upper and lower bounds. This is an error-prone practice, and the test set bound in section \begin_inset LatexCommand \ref{sec-test} \end_inset implies a better method by nearly any metric. Hopefully, this will become common practice. \layout Standard After discussing the test set bound we cover the Occam's Razor bound, the simplest train set bound, which explains (and quantifies) the common phenomena of overfitting in simplest form. We also prove that the Occam's Razor bound cannot be improved without incorpora ting extra information and apply the bound to decision trees. \layout Standard Next, we discuss two train set bounds, the PAC-Bayes bound and the Sample Compression bound, which have proved to give practical results for more general classifiers, such as Support Vector Machines and Neural Networks. \layout Standard All of the results here should be easily approachable and understandable. The proofs are simple, and examples are given. Pointers to related work are also given. \layout Standard It is important to note that all of the results presented here fall in the realm of classical statistics. In particular, all randomizations are over draws of the data, and our results have the form of confidence intervals. \layout Standard The layout of this document is as follows: \layout Itemize Section \begin_inset LatexCommand \ref{sec-formal} \end_inset presents the formal model \layout Itemize Section \begin_inset LatexCommand \ref{sec-test} \end_inset presents the test set bound \layout Itemize Section \begin_inset LatexCommand \ref{sec-train} \end_inset presents the Occam's Razor bound \layout Itemize Section \begin_inset LatexCommand \ref{sec:PAC-Bayes-Bound} \end_inset presents the PAC-Bayes bound \layout Itemize Section \begin_inset LatexCommand \ref{sec:Sparsity-Bound} \end_inset presents the Sample Compression bound \layout Standard The formal model and test set bound must be understood in order to appreciate all later results. There is no particular dependency between the various train set bounds we present. \layout Section Formal Model \layout Standard \begin_inset LatexCommand \label{sec-formal} \end_inset \layout Standard There are many somewhat arbitrary choices of learning model. The one we use can (at best) be motivated by its simplicity. Other models such as the online learning model \begin_inset LatexCommand \cite{Online} \end_inset , PAC learning \begin_inset LatexCommand \cite{Valiant} \end_inset , and the uniform convergence model \begin_inset LatexCommand \cite{Vapnik} \end_inset differ in formulation, generality, and in the scope of addressable questions. The strongest motivation for studying the prediction theory model here is simplicity and corresponding generality of results. Appendix section \begin_inset LatexCommand \ref{sec-model} \end_inset discusses the connections between various models. \layout Subsection Basic quantities \layout Standard We are concerned with a learning model in which examples of (input, output) pairs come independently from some unknown distribution. The goal is to find a function capable of predicting the output given the input. There are several mathematical objects we work with. \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Tabular \begin_inset Text \layout Standard Object \end_inset \begin_inset Text \layout Standard Description \end_inset \begin_inset Text \layout Standard \begin_inset Formula $X$ \end_inset \end_inset \begin_inset Text \layout Standard The (arbitrary) space of the input to a classifier \end_inset \begin_inset Text \layout Standard \begin_inset Formula $Y=\{-1,1\}$ \end_inset \end_inset \begin_inset Text \layout Standard The output of a classification. \end_inset \begin_inset Text \layout Standard \begin_inset Formula $D$ \end_inset \end_inset \begin_inset Text \layout Standard An (unknown) distribution over \begin_inset Formula $X\times Y$ \end_inset \end_inset \begin_inset Text \layout Standard \begin_inset Formula $S$ \end_inset \end_inset \begin_inset Text \layout Standard A set of examples drawn independently from \begin_inset Formula $D$ \end_inset . \end_inset \begin_inset Text \layout Standard \begin_inset Formula $m$ \end_inset \end_inset \begin_inset Text \layout Standard \begin_inset Formula $=|S|$ \end_inset the number of examples \end_inset \begin_inset Text \layout Standard \begin_inset Formula $c$ \end_inset \end_inset \begin_inset Text \layout Standard A function mapping \begin_inset Formula $X$ \end_inset to \begin_inset Formula $Y$ \end_inset \end_inset \end_inset \layout Standard There are several aberrations of this model from other (perhaps more familiar) models. There is no mention of a classifier space, because the results do not depend upon a classifier space. Also, the notion of a distribution on \begin_inset Formula $X\times Y$ \end_inset is strictly more general than the \begin_inset Quotes eld \end_inset target concept \begin_inset Quotes erd \end_inset model which assumes that there exists some function \begin_inset Formula $f:X\rightarrow Y$ \end_inset used to generate the label \begin_inset LatexCommand \cite{Valiant} \end_inset . In particular we can model noisy learning problems which do not have a particular \begin_inset Formula $Y$ \end_inset value for each \begin_inset Formula $X$ \end_inset value. This generalization is essentially \begin_inset Quotes eld \end_inset free \begin_inset Quotes erd \end_inset in the sense that it does not add to the complexity of presenting the results. \layout Standard It is worth noting that the \emph on only \emph default unverifiable assumption we make is examples are drawn independently from \begin_inset Formula $D$ \end_inset . The strength of all the results which follow rests upon the correctness of this assumption. \layout Standard Sometimes, we decorate these objects with labels like \begin_inset Formula $S_{\textrm{train}}$ \end_inset (a train set) or \begin_inset Formula $S_{\textrm{test}}$ \end_inset (a test set). These decorations should always be clear. \layout Example Weather prediction: Will it rain today or not? In this case \begin_inset Formula $X=$ \end_inset barometric pressure, observations of cloud cover or other sensory input and \begin_inset Formula $Y=0$ \end_inset if the prediction is \begin_inset Quotes eld \end_inset no rain \begin_inset Quotes erd \end_inset and \begin_inset Formula $1$ \end_inset otherwise. The distribution \begin_inset Formula $D$ \end_inset is over sensory inputs and outcomes. The sample set \begin_inset Formula $S$ \end_inset , might consist of \begin_inset Formula $m=100$ \end_inset (observation, outcome) pairs such as (pressure low, cloudy, rain), (pressure high, cloudy, not rain), etc. A classifier, \begin_inset Formula $c$ \end_inset , is any function which predicts \begin_inset Quotes eld \end_inset rain \begin_inset Quotes erd \end_inset or \begin_inset Quotes eld \end_inset not rain \begin_inset Quotes erd \end_inset based upon the observation. \layout Example Note that the independence assumption here is not perfectly satisfied although it seems to be a reasonable approximation. In any application of this theory, it must be carefully judged whether the independence assumption holds or not. \layout Subsection Derived quantities \layout Standard There are several derived quantities which the results are stated in terms of. \layout Definition (True Error) The true error \begin_inset Formula $c_{D}$ \end_inset of a classifier \begin_inset Formula $c$ \end_inset is defined as the probability that the classifier errs: \begin_inset Formula \[ c_{D}\equiv\Pr_{(x,y)\sim D}(c(x)\neq y)\] \end_inset \layout Standard The true error is sometimes called the \begin_inset Quotes eld \end_inset generalization error \begin_inset Quotes erd \end_inset . Unfortunately, the true error is not an observable quantity in our model because the distribution, \begin_inset Formula $D$ \end_inset , is unknown. However, there is a related quantity which is observable. \layout Definition (Empirical Error) Given a sample set \begin_inset Formula $S$ \end_inset , the \emph on empirical error \emph default , \begin_inset Formula $\hat{c}_{S}$ \end_inset is the observed rate of errors: \begin_inset Formula \[ \hat{c}_{S}\equiv\Pr_{(x,y)\sim S}(c(x)\neq y)=\frac{1}{m}\sum_{i=1}^{m}I(c(x_{i})\neq y_{i})\] \end_inset where \begin_inset Formula $I()$ \end_inset is a function which maps \begin_inset Quotes eld \end_inset true \begin_inset Quotes erd \end_inset to \begin_inset Formula $1$ \end_inset and \begin_inset Quotes eld \end_inset false \begin_inset Quotes erd \end_inset to \begin_inset Formula $0$ \end_inset . Also, \begin_inset Formula $\Pr_{(x,y)\sim S}(...)$ \end_inset is a probability taken with respect to the uniform distribution over the set of examples, \begin_inset Formula $S$ \end_inset . \layout Standard The empirical error is sometimes called the \begin_inset Quotes eld \end_inset training error \begin_inset Quotes erd \end_inset , \begin_inset Quotes eld \end_inset test error \begin_inset Quotes erd \end_inset , or \begin_inset Quotes eld \end_inset observed error \begin_inset Quotes erd \end_inset depending on whether it is the error rate on a training set, test set, or a more general set. \layout Example* (continued) The classifier \begin_inset Formula $c$ \end_inset which always predicts \begin_inset Quotes eld \end_inset not rain \begin_inset Quotes erd \end_inset might have an empirical error of \begin_inset Formula $\frac{38}{100}$ \end_inset and an unknown true error rate (which might in fact be \begin_inset Formula $0.5$ \end_inset ). \layout Subsection Addressable questions \layout Standard Given the true error, \begin_inset Formula $c_{D}$ \end_inset of a classifier \begin_inset Formula $c$ \end_inset we can precisely describe the distribution of success and failure on future examples drawn according to \begin_inset Formula $D$ \end_inset . This quantity is derived from the unknown distribution \begin_inset Formula $D$ \end_inset , so our effort is directed toward upper and lower bounding the value of \begin_inset Formula $c_{D}$ \end_inset for a classifier \begin_inset Formula $c$ \end_inset . \layout Standard The variations in all of the bounds that we present are related to the method of choosing a classifier, \begin_inset Formula $c$ \end_inset . We cover two types of bounds: \layout Enumerate Test: Use examples in a test set which were not used in picking \begin_inset Formula $c$ \end_inset . \layout Enumerate Train: Use examples for both choosing \begin_inset Formula $c$ \end_inset and evaluating \begin_inset Formula $c$ \end_inset . \layout Standard These methods are addressed in the next two sections. \layout Standard It is worth noting that one question that \emph on cannot \emph default be addressed in this model is \begin_inset Quotes eld \end_inset Can learning occur for my problem? \begin_inset Quotes erd \end_inset . Extra assumptions (as in \begin_inset LatexCommand \cite{Valiant} \end_inset \begin_inset LatexCommand \cite{Vapnik} \end_inset ) are inherently necessary. \layout Section The Test Set Method \layout Standard \begin_inset LatexCommand \label{sec-test} \end_inset \layout Standard The simplest bound arises for the classical technique of using \begin_inset Formula $m$ \end_inset fresh examples to evaluate a classifier. This section is organized into two subsections: \layout Itemize Subsection \begin_inset LatexCommand \ref{subsec-test-bound} \end_inset presents the basic upper bound on the true error rate, handy approximations, and a lower bound \layout Itemize Subsection \begin_inset LatexCommand \ref{subsec-test-implication} \end_inset discusses the implications of the test set bound on error reporting practice. A better method for error reporting is applied to several datasets and the results are shown. \layout Subsection The Bound \layout Standard \begin_inset LatexCommand \label{subsec-test-bound} \end_inset \layout Standard Before stating the bound, we note a few basic observations which make the results less surprising. \begin_inset Float figure wide false collapsed false \layout Standard \align center \begin_inset Graphics filename single_binomial.ps display monochrome rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-binom} \end_inset A depiction of the Binomial distribution. The cumulative of the Binomial is the area under the curve up to some point on the horizontal axis. \end_inset The principal observable quantity is the empirical error \begin_inset Formula $\hat{c}_{S}$ \end_inset of a classifier. What is the distribution of the empirical error for a fixed classifier? For each example, our independence assumption implies the probability that the classifier errs is given by the true error, \begin_inset Formula $c_{D}$ \end_inset . This can be modeled by a biased coin flip: heads if you are right and tails if you are wrong. \layout Standard What is the probability of observing \begin_inset Formula $k$ \end_inset errors (heads) out of \begin_inset Formula $m$ \end_inset examples (coin flips)? This is a very familiar distribution in statistics called the Binomial and so it should be unsurprising that the bounds presented here are fundamentally dependent upon the cumulative distribution of a Binomial. \layout Definition (Binomial Tail Distribution) \begin_inset Formula \[ \textrm{Bin}\left(\frac{k}{m},c_{D}\right)\equiv\Pr_{X_{1},...X_{m}\sim c_{D}^{m}}\left(\left.\sum_{i=1}^{m}X_{i}\leq k\right|c_{D}\right)=\sum_{j=0}^{k}\binom{m}{j}c_{D}^{j}(1-c_{D})^{m-j}\] \end_inset = the probability that \begin_inset Formula $m$ \end_inset examples (coins) with error rate (bias) \begin_inset Formula $c_{D}$ \end_inset produce \begin_inset Formula $k$ \end_inset or fewer errors (heads). \layout Definition A depiction of the Binomial distribution is given in figure \begin_inset LatexCommand \ref{fig-binom} \end_inset . \layout Standard For the learning problem, we always choose a bias of \begin_inset Formula $c_{D}$ \end_inset and \begin_inset Formula $X_{i}=$ \end_inset error or not on the \begin_inset Formula $i$ \end_inset th example. With these definitions, we can interpret the Binomial tail as the probability of an empirical error greater than or equal to \begin_inset Formula $\frac{k}{m}$ \end_inset . \layout Standard Since we are interested in calculating a bound on the true error given a confidence \begin_inset Formula $\delta$ \end_inset , and an empirical error \begin_inset Formula $\hat{c}_{S}$ \end_inset , it is handy to define the inversion of a Binomial tail. \layout Definition (Binomial Tail Inversion) \begin_inset Formula \[ \overline{\textrm{Bin}}\left(\frac{k}{m},\delta\right)\equiv\max_{p}\left\{ p:\,\,\textrm{Bin}\left(\frac{k}{m},p\right)\geq\delta\right\} \] \end_inset = the largest true error such that the probability of observing \begin_inset Formula $\frac{k}{m}$ \end_inset or more \begin_inset Quotes eld \end_inset heads \begin_inset Quotes erd \end_inset is at least \begin_inset Formula $\delta$ \end_inset . \layout Standard With these definitions finished, the results are all very simple statements. \layout Theorem \begin_inset LatexCommand \label{th-hb} \end_inset (Test Set Bound) For all classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta\right)\right)\geq1-\delta\] \end_inset \layout Standard Note that \begin_inset Formula $m$ \end_inset in this equation is \begin_inset Formula $m_{\textrm{test}}=|S_{\textrm{test}}|$ \end_inset , the size of the test set. \layout Proof (pictorially in \begin_inset LatexCommand \ref{fig-proof} \end_inset ) The proof is just a simple identification with the Binomial. For any distribution over \begin_inset Formula $(x,y)$ \end_inset pairs and any classifier, \begin_inset Formula $c$ \end_inset , there exists some probability, \begin_inset Formula $c_{D}$ \end_inset , that the classifier predicts incorrectly. We can regard this event as a coin flip with bias \begin_inset Formula $c_{D}$ \end_inset . Since each example is picked independently, the distribution of the empirical error is a Binomial distribution. \layout Proof Whatever our true error \begin_inset Formula $c_{D}$ \end_inset is, with probability \begin_inset Formula $1-\delta$ \end_inset the observation \begin_inset Formula $\hat{c}_{S}$ \end_inset will not fall into a tail of size \begin_inset Formula $\delta$ \end_inset . Assuming (correctly with probability \begin_inset Formula $1-\delta$ \end_inset ) that the empirical error is not in the Binomial tail, we can constrain (and therefore bound) the value of the true error \begin_inset Formula $c_{D}$ \end_inset . \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename binomials.ps display monochrome width 35col% rotateAngle 270 \end_inset \begin_inset Graphics filename cut_binomials.ps display monochrome width 35col% rotateAngle 270 \end_inset \layout Standard \begin_inset Graphics filename consistent_cut_binomials.ps display monochrome width 35col% rotateAngle 270 \end_inset \begin_inset Graphics filename worst_cut.ps display monochrome width 35col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-proof} \end_inset A graphical depiction of the test set bound. The first graph depicts several possible Binomials given their true error rates. The second depicts several Binomials, each with a tail cut. The third figure shows the Binomials consistent with the tail cut and observed test error. The worst case over all true error rates is the consistent Binomial with the largest bias. \end_inset \layout Standard The test set bound is, essentially, perfectly tight. For any classifier with a sufficiently large true error, the bound is violated exactly a \begin_inset Formula $\delta$ \end_inset portion of the time. \layout Subsubsection Approximations \layout Standard \begin_inset LatexCommand \label{sub:Corollaries} \end_inset \layout Standard There are several immediate corollaries of the test set bound ( \begin_inset LatexCommand \ref{th-hb} \end_inset ) which are more convenient when a computer is not handy. The first corollary applies to the limited \begin_inset Quotes eld \end_inset realizable \begin_inset Quotes erd \end_inset setting where you happen to observe \begin_inset Formula $0$ \end_inset test errors. \layout Corollary (Realizable Test Set Bouund) For all classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\hat{c}_{S}=0\Rightarrow c_{D}\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta\] \end_inset \layout Proof Specializing theorem \begin_inset LatexCommand \ref{th-hb} \end_inset to the zero empirical error case, we get: \begin_inset Formula \[ \textrm{Bin}\left(\frac{0}{m},\epsilon\right)=(1-\epsilon)^{m}\leq e^{-\epsilon m}\] \end_inset Setting this equal to \begin_inset Formula $\delta$ \end_inset and solving for \begin_inset Formula $\epsilon$ \end_inset gives us the result. The last inequality can be most simply motivated by comparing graphs as in figure \begin_inset LatexCommand \ref{fig:graphs} \end_inset . \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename exp.ps rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:graphs} \end_inset A graph suggesting \begin_inset Formula $e^{-\epsilon m}\geq(1-\epsilon)^{m}$ \end_inset . \end_inset \layout Standard Approximations which hold for arbitrary (nonzero) error rates rely upon the Chernoff bound which we state next, for completeness. \layout Lemma \begin_inset LatexCommand \label{lem:Relative-Entropy-Chernoff} \end_inset (Relative Entropy Chernoff Bound) For \begin_inset Formula $\frac{k}{m}0$ \end_inset , we have: \layout Proof \begin_inset Formula \[ \textrm{Bin}\left(\frac{k}{m},p\right)=\Pr_{X^{m}\sim p^{m}}\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\leq\frac{k}{m}\right)=\Pr_{X^{m}\sim p^{m}}\left(e^{m\lambda\frac{1}{m}\sum_{i=1}^{m}X_{i}}\leq e^{m\lambda\frac{k}{m}}\right)\] \end_inset \begin_inset Formula \[ =\Pr_{X^{m}\sim p^{m}}\left(e^{-\lambda\sum_{i=1}^{m}X_{i}}\geq e^{-\lambda k}\right)\] \end_inset Using Markov's inequality ( \begin_inset Formula $X\geq0$ \end_inset , \begin_inset Formula $EX=\mu$ \end_inset , \begin_inset Formula $\Rightarrow\Pr(X\geq\delta)\leq\frac{\delta}{\mu}$ \end_inset ): \begin_inset Formula \[ \leq\frac{e^{-\lambda k}}{E_{X^{m}\sim p^{m}}e^{-\lambda\sum_{i=1}^{m}X_{i}}}\] \end_inset Using Jensen's inequality ( \begin_inset Formula $\frac{1}{x}$ \end_inset concave \begin_inset Formula $\Rightarrow$ \end_inset \begin_inset Formula $\frac{1}{EX}\leq E\frac{1}{X}$ \end_inset ), we get: \begin_inset Formula \[ \leq e^{-\lambda k}E_{X^{m}\sim p^{m}}e^{\lambda\sum_{i=1}^{m}X_{i}}\] \end_inset Using independence, we get: \begin_inset Formula \[ =e^{-\lambda k}\left(pe^{\lambda}+(1-p)\right)^{m}\] \end_inset and rewriting, we get: \begin_inset Formula \[ =e^{-mf(\lambda)}\] \end_inset where \begin_inset Formula $f(\lambda)=\lambda\frac{k}{m}-\ln\left(pe^{\lambda}+1-p\right)$ \end_inset . \layout Proof \begin_inset Formula $\lambda$ \end_inset is a free parameter which can be optimized to find the tightest possible bound. To find the optimal value, find \begin_inset Formula $\lambda^{*}$ \end_inset so that \begin_inset Formula $f'(\lambda^{*})=0$ \end_inset . \begin_inset Formula \[ 0=f'(\lambda^{*})=\frac{k}{m}-\frac{pe^{\lambda^{*}}}{pe^{\lambda^{*}}+1-p}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}}{p}\left(pe^{\lambda^{*}}+1-p\right)=e^{\lambda^{*}}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}}{p}-\frac{k}{m}=\left(1-\frac{k}{m}\right)e^{\lambda^{*}}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}=e^{\lambda^{*}}\] \end_inset Using this, we get: \begin_inset Formula \[ f(\lambda^{*})=\frac{k}{m}\ln\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}-\ln\left(\frac{1-p}{1-\frac{k}{m}}\right)\] \end_inset \begin_inset Formula \[ =\frac{k}{m}\ln\frac{\frac{k}{m}}{p}+(1-\frac{k}{m})\ln\left(\frac{1-\frac{k}{m}}{1-p}\right)\] \end_inset \begin_inset Formula \[ =\textrm{KL}(\frac{k}{m}||p)\] \end_inset \layout Standard Using the Chernoff bound, we can loosen the test set bound to achieve a more analytic fom. \layout Corollary (Agnostic Test Set Bound) For all classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\textrm{KL}\left(\hat{c}_{S}||c_{D}\right)\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta\] \end_inset where \begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$ \end_inset is the Kullback-Leibler divergence between two coins of bias \begin_inset Formula $q,p$ \end_inset with \begin_inset Formula $q\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta P(c)\] \end_inset then, we apply the union bound in a nonuniform manner. The union bound says that \begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$ \end_inset . Applying the union bound to every classifier with a positive measure gives a total probability of failure of \begin_inset Formula \[ \sum_{c}\delta P(c)=\delta\sum_{c}P(c)=\delta\] \end_inset which implies \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta\] \end_inset Negating this again completes the proof. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename nonuniform_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \begin_inset Graphics filename deep_cut_nonuniform_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \layout Standard \begin_inset Graphics filename deep_cut_consistent_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \begin_inset Graphics filename deep_cut_bound.ps display monochrome width 30col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-train-proof} \end_inset The sequence of pictures is the pictorial representation of the proof of the Occam's Razor Bound. The first figure shows a set of classifiers, each with a tail cut of some varying depth. The second picture shows an observed training error and the possible Binomial distributions for a chosen classifier. The third picture shows the true errors which are consistent with the observati on and the tail cuts. The fourth picture shows the true error bound. \end_inset \layout Subsubsection Occam's Razor Corollaries \layout Standard Just as with the test set bound, we can relax the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with the Chernoff approximations to get a somewhat more tractable expression. \layout Corollary \begin_inset LatexCommand \label{th-horb} \end_inset (Chernoff Occam's Razor Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over classifiers, for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\leq\hat{c}_{S}+\sqrt{\frac{\ln\frac{1}{P(c)}+\ln\frac{1}{\delta}}{2m}}\right)\geq1-\delta\] \end_inset \layout Proof approximate the Binomial tail with the Chernoff Bound (lemma \begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff} \end_inset ). \layout Standard Many people are more familiar with a degenerate form of this bound where \begin_inset Formula $P(c)=\frac{1}{|H|}$ \end_inset and \begin_inset Formula $H$ \end_inset is some set of classifiers. In that case, simply replace \begin_inset Formula $\ln\frac{1}{P(c)}$ \end_inset with \begin_inset Formula $\ln|H|$ \end_inset . The form presented here is both more general and necessary if the bound is to be used in practice. \layout Standard Other corollaries as in section \begin_inset LatexCommand \ref{sub:Corollaries} \end_inset exist for the Occam's Razor bound. In general, just substitute \begin_inset Formula $\delta\rightarrow\delta P(c)$ \end_inset . \layout Subsubsection Occam's Razor Lower bound \layout Standard Just as for the test set bound, a lower bound of the same form applies. \layout Theorem \begin_inset LatexCommand \label{th-lorb} \end_inset (Occam's Razor Lower Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over the classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall c:\,\, c_{D}\geq\min_{p}\{ p:\,\,1-\textrm{Bin}\left(\hat{c}_{S},p\right)\geq\delta P(c)\}\right)\geq1-\delta\] \end_inset \layout Example* (continued) Suppose that instead of having \begin_inset Formula $100$ \end_inset test examples, we had \begin_inset Formula $100$ \end_inset train examples. Also suppose that before seeing the train examples, we committed to \begin_inset Formula $P(c)=0.1$ \end_inset for \begin_inset Formula $c$ \end_inset the constant classifier which predicts \begin_inset Quotes eld \end_inset no rain \begin_inset Quotes erd \end_inset . Then, the Chernoff approximations of the upper and lower bound give the interval, \begin_inset Formula $c_{D}\in[0.22,0.54]$ \end_inset . With an exact calculation, we get \begin_inset Formula $c_{D}\in[0.26,0.51]$ \end_inset . \layout Subsubsection The state of the art \layout Standard A very large amount of work has been done on train set bounds. In addition to those included here, there is: \layout Enumerate Reinterpretations of uniform convergence \begin_inset LatexCommand \cite{Vapnik} \end_inset results for continuously parameterized classifiers. \layout Enumerate Reinterpretations of PAC convergence \begin_inset LatexCommand \cite{Valiant} \end_inset results. \layout Enumerate Shell bounds \begin_inset LatexCommand \cite{Shell} \end_inset which take advantage of the distribution of true error rates on classifiers. \layout Enumerate Train and Test bounds \begin_inset LatexCommand \cite{TnT} \end_inset which combine train set and test set bounds. \layout Standard Of this large amount of work only a small fraction has been shown to be useful on real-world learning algorithm/learning problem pairs. The looseness of train set based bounds often precludes analytical use. \layout Subsection The Occam's Razor Bound is sometimes Tight \layout Standard \begin_inset LatexCommand \label{sec-lower_upper} \end_inset \layout Standard The question of tightness for train set bounds is important to address, as many of them have been extremely loose. The simplest method to address this tightness is constructive: exhibit a learning problem/algorithm pair for which the bound is almost achieved. For the test set bound, this is trivial as any classifier with a large enough true error will achieve the bound. For the train set bound, this is not so trivial. \layout Standard How tight is the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset )? The answer is \emph on sometimes \emph default tight. In particular, we can exhibit a set of learning problems where the Occam's Razor bound can not be made significantly tighter as a function of the observables, \begin_inset Formula $m$ \end_inset , \begin_inset Formula $\delta$ \end_inset , \begin_inset Formula $P(c)$ \end_inset , and \begin_inset Formula $\hat{c}_{S}$ \end_inset . After fixing the value of these quantities we construct a learning problem exhibiting this near equivalence to the Occam's Razor bound. \layout Theorem \begin_inset LatexCommand \label{th-dhlub} \end_inset (Occam's Razor tightness) For all \begin_inset Formula $P(c)$ \end_inset , \begin_inset Formula $\frac{k}{m}$ \end_inset , \begin_inset Formula $\delta$ \end_inset there exists a learning problem and algorithm such that: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\geq\delta-\delta^{2}\] \end_inset Furthermore, if \begin_inset Formula $c^{*}$ \end_inset is the classifier with minimal training error, then: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(c_{D}^{*}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S}^{*},\delta P(c)\right)\right)\geq\delta-\delta^{2}\] \end_inset \layout Standard Intuitively, this theorem implies that we can not improve significantly on the the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) without using extra information about our learning problem. \layout Proof The proof is constructive: we create a learning problem on which large deviation s are likely. We start with a prior \begin_inset Formula $P(c)$ \end_inset , probability of error \begin_inset Formula $\delta$ \end_inset , \begin_inset Formula $m$ \end_inset , and a targeted empirical error rate, \begin_inset Formula $\frac{k}{m}$ \end_inset . For succinctness we assume that \begin_inset Formula $P(c)$ \end_inset has support on a finite set of size \begin_inset Formula $n$ \end_inset . \layout Proof To define the learning problem, let: \begin_inset Formula $X=\{0,1\}^{n}$ \end_inset and \begin_inset Formula $Y=\{0,1\}$ \end_inset . \layout Proof The distribution \begin_inset Formula $D$ \end_inset can be drawn by first selecting \begin_inset Formula $Y$ \end_inset with a single unbiased coin flip, and then choosing the \begin_inset Formula $i$ \end_inset th component of the vector \begin_inset Formula $X$ \end_inset independently, \begin_inset Formula $\Pr((X_{1},...,X_{n})|Y)=\Pi_{i=1}^{n}\Pr(X_{i}|Y)$ \end_inset . The individual components are chosen so \begin_inset Formula $\Pr(X_{i}=Y|Y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$ \end_inset . \layout Proof The classifiers we consider just use one feature to make their classification: \begin_inset Formula $c_{i}(x)=x_{i}$ \end_inset . The true error of these classifiers is given by: \begin_inset Formula $c_{D}=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$ \end_inset . \layout Proof This particular choice of true errors implies that if any classifier has a too-small train error rate, then the classifier with minimal train error must have a too-small train error. \layout Proof Using this learning problem, we know that: \begin_inset Formula \[ \forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=\delta P(c)\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow\forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=1-\delta P(c)\] \end_inset (independence) \begin_inset Formula \[ \Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\forall c\,\, c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\prod_{c}\left(1-\delta P(c)\right)\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\exists c\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\] \end_inset \begin_inset Formula \[ =\sum_{i=1}^{n}\delta P(c_{i})\left(1-\Pr_{S\sim D^{m}}\left(\exists c\in\{ c_{1},...,c_{i-1}\}\,\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\right)\] \end_inset \begin_inset Formula \[ \geq\sum_{i=1}^{n}\delta P(c_{i})(1-\delta)\] \end_inset \begin_inset Formula \[ =\delta-\delta^{2}\] \end_inset \layout Standard The lower bound theorem implies that we can not improve an Occam's Razor like statement. However, it is important to note that large improvements are possible if we use other sources of information. To see this, just note the case where every single classifier happens to be the same. In this case the \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset bound would the be the \emph on test \emph default set bound, rather than the train set bound. The PAC-Bayes bound and the Sample Compression bound presented in the next sections use other sources of information. \layout Subsection Train set bound implications \layout Standard \begin_inset LatexCommand \label{subsec-train-set-practice} \end_inset \layout Subsubsection results \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \align center \begin_inset Graphics filename test_vs_micro.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-microchoice-holdout} \end_inset This is a plot comparing confidence intervals built based upon the test set bound ( \begin_inset LatexCommand \ref{th-hb} \end_inset ) with an 80%/20% train/test split on the left and the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with all data in the training set on the right. The Occam's razor bound is sometimes superior on the smaller data sets and always nonvacuous (in contrast to many other train set bounds). \end_inset \layout Standard The Occam's Razor bound is strongly related to compression. In particular, for any self-terminating description language, \begin_inset Formula $d(c)$ \end_inset , we can associate a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)=2^{-|d(c)|}$ \end_inset with the property that \begin_inset Formula $\sum_{c}P(c)\leq1$ \end_inset . Consequently, short description length classifiers tend to have a tighter convergence and the penalty term, \begin_inset Formula $\ln\frac{1}{P(c)}$ \end_inset is the number of \begin_inset Quotes eld \end_inset nats \begin_inset Quotes erd \end_inset (bits base e). For any language fixed before seeing the train set, classifiers with shorter description lengths have tighter bounds on the true error rate. \layout Standard One particularly useful description language to consider is the execution trace of a learning algorithm. If we carefully note the sequence of data-dependent choices which a learning algorithm makes, then the output classifier can be specified by a sequence such as \begin_inset Quotes eld \end_inset 2nd choice, third choice, first choice, etc... \begin_inset Quotes erd \end_inset This is the idea behind microchoice bounds \begin_inset LatexCommand \cite{MC_journal} \end_inset . Results for this approach are reported in Figure \begin_inset LatexCommand \ref{fig-microchoice-holdout} \end_inset and are strong enough to act as an empirical existence proof that Occam's Razor bounds can be made tight enough for useful application. \layout Section PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sec:PAC-Bayes-Bound} \end_inset \layout Standard The PAC-Bayes bound \begin_inset LatexCommand \cite{PB} \end_inset is particularly exciting because it can provide quantitatively useful results for classifiers with \emph on real valued \emph default parameters. This includes such commonly used classifiers as Support Vector Machines and Neural Networks. \begin_inset Foot collapsed true \layout Standard There is a caveat here---the bound only applies to stochastic versions of the classifiers. However, the probability that the stochastic classifier differs from the classifier can be made very small. \end_inset This section is divided into three parts: \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:The-PAC-Bayes-Bound} \end_inset States and proves the PAC-Bayes Bound. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:How-Tight-is} \end_inset shows that the PAC-Bayes Bound is nearly as tight as possible given the observations. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:Application-of-the} \end_inset discusses results from the application of the PAC-Bayes bound to support vector machines. \layout Subsection The PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sub:The-PAC-Bayes-Bound} \end_inset \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename pac-bayes.eps scale 50 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:PB-POL} \end_inset The \begin_inset Quotes eld \end_inset interactive proof of learning \begin_inset Quotes erd \end_inset associated with the PAC-Bayes bound. The figure is the same as for the Occam's razor bound, except that instead of committing to a single classifier, the PAC-Bayes bound applies to any distribution over classifiers. \end_inset \layout Standard The PAC-Bayes bound has been improved by tightening \begin_inset LatexCommand \cite{averaging} \end_inset and then with a much simpler proof \begin_inset LatexCommand \cite{Matthias} \end_inset since it was originally stated. The statement and proof presented here incorporate these improvements and improve on them slightly. \layout Standard The PAC-Bayes bound is dependent upon two derived quantities, an average true error: \begin_inset Formula \[ Q_{D}\equiv E_{c\sim Q}c_{D}\] \end_inset and an average train error: \begin_inset Formula \[ \hat{Q}_{S}\equiv E_{c\sim Q}\hat{c}_{S}\] \end_inset These quantities can be interpreted as the train error and true error of the meta-classifier which chooses a classifier according to \begin_inset Formula $Q$ \end_inset every time a classification is made. If we refer to this meta-classifier as \begin_inset Formula $Q$ \end_inset , the notation for error rates is consistent with our earlier notation. \layout Standard The \begin_inset Quotes eld \end_inset interactive proof of learning \begin_inset Quotes erd \end_inset viewpoint of the PAC-Bayes bound is shown in figure \begin_inset LatexCommand \ref{fig:PB-POL} \end_inset . It is essentially the same as for the Occam's Razor bound except for the commitment to the metaclassifier \begin_inset Formula $Q$ \end_inset rather than the classifier \begin_inset Formula $c$ \end_inset . \layout Theorem (PAC-Bayes Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over the classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall Q(c):\,\,\textrm{KL}\left(\hat{Q}_{S}||Q_{D}\right)\leq\frac{\textrm{KL}(Q||P)+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\] \end_inset Here \begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$ \end_inset for \begin_inset Formula $q\frac{\mu}{\delta})\leq\delta$ \end_inset ), we get: \layout Proof \begin_inset Formula \[ \forall P\,\,\Pr_{S\sim D^{m}}\left(E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\leq\frac{m+1}{\delta}\right)\leq\delta\] \end_inset \layout Standard The next lemma shows that a certain expectation is bounded by the Kullback-Leibl er distance between two coin flips, just as for the relative entropy Chernoff bound (Lemma \begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff} \end_inset ). \layout Lemma \begin_inset LatexCommand \label{lemma: KL-approx} \end_inset For all \begin_inset Formula $Q(c)$ \end_inset : \begin_inset Formula $\frac{E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}}{m}\geq\textrm{KL}(\hat{Q}_{S}||Q_{D})$ \end_inset \layout Proof \begin_inset Formula $\frac{1}{m}E_{c\sim Q}\ln\frac{1}{\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)c_{D}^{m\hat{c}_{S}}(1-c_{D})^{m(1-\hat{c}_{S})}}$ \end_inset \begin_inset Formula \[ =E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]-\frac{E_{c\sim Q}\ln\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)}{m}\] \end_inset Jensen's inequality ( \begin_inset Formula $f$ \end_inset concave \begin_inset Formula $\Rightarrow$ \end_inset \begin_inset Formula $Ef(X)\geq f(EX)$ \end_inset ): \begin_inset Formula \[ E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]\geq\hat{Q}_{S}\ln\frac{1}{Q_{D}}+\left(1-\hat{Q}_{S}\right)\ln\frac{1}{1-Q_{D}}\] \end_inset \begin_inset Formula \[ \textrm{ and \,\,\,\,}\frac{E_{c\sim Q}\ln\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)}{m}\leq\frac{E_{c\sim Q}\ln e^{mH\left(\hat{c}_{S}\right)}}{m}=\frac{E_{c\sim Q}mH\left(\hat{c}_{S}\right)}{m}\leq H\left(\hat{Q}_{S}\right)\] \end_inset \layout Standard With these two lemmas, the PAC-Bayes theorem is easy to prove. \layout Proof (Of the PAC-Bayes theorem) Let \begin_inset Formula \[ P_{G}(c,\hat{c}_{S})=\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}}P(c)\] \end_inset \begin_inset Formula \[ \Rightarrow0\leq\textrm{KL}(Q||P_{G})=E_{c\sim Q}\ln\frac{Q(c)}{P(c)}\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset \begin_inset Formula \[ =\textrm{KL}(Q||P)-E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset \begin_inset Formula \[ \Rightarrow E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}\leq\textrm{KL}(Q||P)+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset Now, use lemma \begin_inset LatexCommand \ref{lem:exp-approx} \end_inset on the right hand term and lemma \begin_inset LatexCommand \ref{lemma: KL-approx} \end_inset on the left hand term to finish the proof. \layout Standard Note that one of the last inequalities in the proof is not necessary, and so a slightly tighter bound can be proved. \layout Subsection The PAC-Bayes Bound is sometimes Tight \layout Standard \begin_inset LatexCommand \label{sub:How-Tight-is} \end_inset \layout Standard Since the PAC-Bayes bound is (almost) a generalization of the Occam's Razor bound, the tightness result for Occam's Razor also applies to PAC-Bayes bounds. \layout Subsection Application of the PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sub:Application-of-the} \end_inset \layout Standard Applying the PAC-Bayes bound requires some significant specialization \begin_inset LatexCommand \cite{PB-margin} \end_inset . Here, we specialize to classifiers of the form: \begin_inset Formula \[ c(x)=\textrm{sign}\left(\vec{w}\cdot\vec{x}\right)\] \end_inset Note that via the kernel trick, Support Vector Machines also have this form. \layout Standard The specialization is naturally expressed in terms of a few derived quantities: \layout Enumerate The cumulative distribution of a Gaussian. Let \begin_inset Formula $\bar{F}(x)=\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$ \end_inset . Here we use \begin_inset Formula $\bar{F}$ \end_inset rather than \begin_inset Formula $F$ \end_inset to denote the fact that we integrate from \begin_inset Formula $\infty$ \end_inset to \begin_inset Formula $x$ \end_inset rather than \begin_inset Formula $-\infty$ \end_inset to \begin_inset Formula $x$ \end_inset . \layout Enumerate A \begin_inset Quotes eld \end_inset posterior \begin_inset Quotes erd \end_inset distribution \begin_inset Formula $Q(\vec{w},\mu)$ \end_inset which is \begin_inset Formula $N(\mu,1)$ \end_inset for some \begin_inset Formula $\mu>0$ \end_inset in the direction of \begin_inset Formula $\vec{w}$ \end_inset and \begin_inset Formula $N(0,1)$ \end_inset in all perpendicular directions. \layout Enumerate The normalized margin of the examples \begin_inset Formula \[ \gamma(\vec{x},y)=\frac{y\vec{w}\cdot\vec{x}}{||\vec{w}||||\vec{x}||}\] \end_inset \layout Enumerate A stochastic error rate, \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset \layout Standard This last quantity in particular is very important to understand. Consider the case as \begin_inset Formula $\mu$ \end_inset approaches \begin_inset Formula $\infty$ \end_inset . When the margin is negative (indicating an incorrect classification), \begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset approaches \begin_inset Formula $1$ \end_inset . When the margin is positive \begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset approaches \begin_inset Formula $0$ \end_inset . Thus, \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$ \end_inset is a softened form of the empirical error \begin_inset Formula $\hat{c}_{S}$ \end_inset which takes into account the margin. \layout Corollary (PAC-Bayes Margin Bound) For all distributions \begin_inset Formula $D$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset , we have: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall\vec{w},\mu:\,\,\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||Q(\vec{w},\mu)_{D}\right)\leq\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\] \end_inset \layout Proof The proof is very simple. We just choose the prior \begin_inset Formula $P=N(0,1)^{n}$ \end_inset and work out the implications. \layout Proof Since the Gaussian distribution is the same in every direction, we can reorient the coordinate system of the prior to have one dimension parallel to \begin_inset Formula $w$ \end_inset . Since the draws in the parallel and perpendicular directions are independent, we have: \begin_inset Formula \[ \textrm{KL}(Q||P)=\textrm{KL}(Q_{\perp}||P_{\perp})+\textrm{KL}(N(\mu,1)||N(0,1))\] \end_inset \begin_inset Formula \[ =\frac{\mu^{2}}{2}\] \end_inset as required. \layout Proof All that remains is calculating the stochastic error rate \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$ \end_inset . Fix a particular example, \begin_inset Formula $(\vec{x},y)$ \end_inset . This example has a natural decomposition \begin_inset Formula $\vec{x}=\vec{x}_{||}+\vec{x}_{\perp}$ \end_inset into a component, \begin_inset Formula $\vec{x}_{||}$ \end_inset parallel to the weight vector \begin_inset Formula $\vec{w}$ \end_inset and a component \begin_inset Formula $\vec{x}_{\perp}$ \end_inset perpendicular to the weight vector. \layout Proof To classify, we draw weight vector \begin_inset Formula $\vec{w}^{'}$ \end_inset from \begin_inset Formula $\hat{Q}(\vec{w},\mu)$ \end_inset . This \begin_inset Formula $\vec{w}^{'}$ \end_inset consists of \begin_inset Formula $3$ \end_inset components, \begin_inset Formula $\vec{w}^{'}=\vec{w}_{||}^{'}+\vec{w}_{\perp}^{'}+\vec{w}_{\perp\perp}^{'}$ \end_inset . Here \begin_inset Formula $\vec{w}_{||}^{'}\sim N(\mu,1)$ \end_inset is parallel to the original weight vector, \begin_inset Formula $\vec{w}_{\perp}^{'}\sim N(0,1)$ \end_inset which is parallel to \begin_inset Formula $\vec{x}_{\perp}$ \end_inset and \begin_inset Formula $\vec{w}_{\perp\perp}^{'}$ \end_inset is perpendicular to both \begin_inset Formula $\vec{w}$ \end_inset and \begin_inset Formula $\vec{x}$ \end_inset . We have: \begin_inset Formula \[ \hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\neq\textrm{sign}\left(\vec{w}^{'}\cdot\vec{x}\right)\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\vec{w}\cdot\vec{x}\leq0\right)\] \end_inset If we let \begin_inset Formula $w_{||}^{'}=||w_{||}^{'}||$ \end_inset , \begin_inset Formula $w_{\perp}^{'}=||w_{\perp}^{'}||$ \end_inset , \begin_inset Formula $x_{||}=||\vec{x}_{||}||$ \end_inset , and \begin_inset Formula $x_{\perp}=||\vec{x}_{\perp}||$ \end_inset , and assume (without loss of generality) that \begin_inset Formula $y=1$ \end_inset we get: \begin_inset Formula \[ =E_{\vec{x},y\sim S,w_{||}^{'}\sim N(\mu,1),w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{w_{||}^{'}\sim N(\mu,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq-yz^{'}-yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] \end_inset Using the symmetricity of the Gaussian, this is: \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq yz^{'}+yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] \end_inset Using the fact that the sum of two Gaussians is a Gaussian: \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{v\sim N\left(0,1+\frac{x_{\perp}^{2}}{x_{||}^{2}}\right)}I\left(y\mu\leq yv\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{v\sim N\left(0,\frac{1}{\gamma(\vec{x},y)^{2}}\right)}I\left(y\mu\leq yv\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)\] \end_inset finishing the proof. \layout Standard Using the corollary, the true error bound \begin_inset Formula $\bar{Q}(\vec{w},\mu)_{D}$ \end_inset satisfies the equation: \begin_inset Formula \[ \textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||\bar{Q}(\vec{w},\mu)_{D}\right)=\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\] \end_inset This is an implicit equation for \begin_inset Formula $\bar{Q}$ \end_inset which can be easily solved numerically. \layout Standard The bound is stated in terms of dot products here, so naturally it is possible to kernelize the result using methods from \begin_inset LatexCommand \cite{kernelize} \end_inset . In kernelized form, the bound applies to classifiers (as output by SVM learning algorithms) of the form: \begin_inset Formula \begin{equation} c(x)=\textrm{sign}\left(\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)\label{eq:class-kernel}\end{equation} \end_inset Since, by assumption, \begin_inset Formula $k$ \end_inset is a kernel, we know that \begin_inset Formula $k(x_{i},x)=\vec{\Phi}(x_{i})\cdot\vec{\Phi}(x)$ \end_inset where \begin_inset Formula $\vec{\Phi}(x)$ \end_inset is some projection into another space. In kernelized form, we get \begin_inset Formula $\vec{w}\cdot\vec{x}=\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)$ \end_inset , \begin_inset Formula $\vec{x}\cdot\vec{x}=k(x,x)$ \end_inset , \begin_inset Formula $\vec{w}\cdot\vec{w}=\sum_{i,j}\alpha_{i}\alpha_{j}k(x_{i},x_{j})$ \end_inset , defining all of the necessary quantities to calculate the normalized margin, \begin_inset Formula \[ \gamma(x,y)=\frac{\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)}{\sqrt{k(x,x)\sum_{i,j=1,1}^{m,m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})}}\] \end_inset \layout Standard One element remains, which is the value of \begin_inset Formula $\mu$ \end_inset . Unfortunately the bound can be nonmonotonic in the value of \begin_inset Formula $\mu$ \end_inset , but it turns out that for classifiers learned by support vector machines on reasonable datasets, there is only one value of \begin_inset Formula $\mu$ \end_inset which is (locally, and thus globally) minimal. A binary search over some reasonable range of \begin_inset Formula $\mu$ \end_inset (say from \begin_inset Formula $1$ \end_inset to \begin_inset Formula $100$ \end_inset ) can find the minima quickly, given the precomputation of the margins. It is worth noting again here that we are not \begin_inset Quotes eld \end_inset cheating \begin_inset Quotes erd \end_inset ---the bound holds for all values of \begin_inset Formula $\mu$ \end_inset simultaneously. \layout Standard The computational time of the bound calculation is dominated by the calculation of the margins which is \begin_inset Formula $O\left(m^{2}\right)$ \end_inset where \begin_inset Formula $m$ \end_inset is the number of support vectors with a nonzero associated \begin_inset Formula $\alpha$ \end_inset . This computational time is typically dominated by the time of the SVM learning algorithm. \layout Subsubsection Results \layout Standard Application of this bound to support vector machines is of significant importanc e because SVMs are reasonably effective and adaptable classifiers in common and widespread use. A SVM learns a kernelized classifier as per equation \begin_inset LatexCommand \ref{eq:class-kernel} \end_inset \begin_inset Foot collapsed false \layout Standard Some SVM learning algorithms actually learn a classifier of the form: \begin_inset Formula $c(x)=\textrm{sign}\left(b+\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)$ \end_inset . We don't handle this form here. \end_inset . \layout Standard We apply the support vector machine to 8 UCI database problems chosen to fit the criteria \begin_inset Quotes eld \end_inset two classes \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset real valued input features \begin_inset Quotes erd \end_inset . The problems vary in size over an order of magnitude from \begin_inset Formula $145$ \end_inset to \begin_inset Formula $1428$ \end_inset examples. In figure \begin_inset LatexCommand \ref{fig:test} \end_inset we use a 70/30 train/test split of the data. \layout Standard In all experiments, we use SVMlight with a Gaussian kernel and the default bandwidth. Results for other choices of the \begin_inset Quotes eld \end_inset C \begin_inset Quotes erd \end_inset , bandwidth, and/or kernel appear to be qualitatively similar (although of course differing quantitatively). \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_and_margin.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:test} \end_inset \layout Standard This figure shows the results of applying SVMlight to 8 datasets with a Gaussian kernel and a 70/30 train/test split. The observed test error rate is graphed as an X. On the test set, we calculate a Binomial confidence interval (probability of bound failure = \begin_inset Formula $0.01$ \end_inset ) which upper bounds the true error rate. On the training set we calculate the PAC-Bayes margin bound for an optimized choice of \begin_inset Formula $\mu$ \end_inset . \end_inset It is important to note that the PAC-Bayes margin bound is \emph on not \emph default precisely a bound (or confidence interval) on the true error rate of the learned classifier. Instead, it is a true error rate bound on an associated stochastic classifier chosen so as to have a similar test error rate. These bounds can be regarded as bounds for the original classifier only under an additional assumption: that picking a classifier according to the majority vote of this stochastic distribution does not worsen the true error rate. This is not true in general, but may be true in practice. \layout Standard It is of course unfair to compare the train set bound with the test set bound on a 70/30 train/test split because a very tight train set bound would imply that it is unnecessary to even have a test set. In figure \begin_inset LatexCommand \ref{fig:full-train} \end_inset we compare the true error bounds on all of the data to the true error bounds generated from the 70/30 train/test split. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_vs_margin.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:full-train} \end_inset \layout Standard In addition to comparing with everything in figure \begin_inset LatexCommand \ref{fig:test} \end_inset , we graph the margin bound when \emph on all \emph default of the data is used for the train set. Note that it improves somewhat on the margin bound calculated using the 70% train set (7/10 margin bound), but not enough to compete with the test set bound. \end_inset The results show that the PAC-Bayes margin bound is tight enough to give useful information, but still not competitive with the test set bounds. This is in strong contrast with a tradition of quantitatively impractical margin bounds. There are several uses available for bounds which provide some information but which are not fully tight. \layout Enumerate They might be combined with a train/test bound \begin_inset LatexCommand \cite{TnT} \end_inset . \layout Enumerate The train set bound might easily become tighter for smaller sample sizes. This was observed in \begin_inset LatexCommand \cite{TnT} \end_inset . \layout Enumerate The train set bound might still have the right \begin_inset Quotes eld \end_inset shape \begin_inset Quotes erd \end_inset for choosing an optimal parameter setting, such as \begin_inset Quotes eld \end_inset C \begin_inset Quotes erd \end_inset in a Support Vector Machine. \layout Section Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sec:Sparsity-Bound} \end_inset \layout Standard The sample compression bound \begin_inset LatexCommand \cite{LW} \end_inset , \begin_inset LatexCommand \cite{FW} \end_inset is like the PAC-Bayes bound in that it applies to arbitrary precision continuou s valued classifiers. Unlike the PAC-Bayes bound, it applies meaningfully to nonstochastic classifier s. Mainstream learning algorithms do not optimize the sample compression metric, so the bound application is somewhat rarer. Nonetheless, there do exist some reasonably competitive learning algorithms for which the sample compression bound produces significant results. \layout Standard The section is organized as follows: \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:The-Sample-Compression-Bound} \end_inset states and proves the sample compression bound. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:SC-tightness} \end_inset shows that the sample compression bound is nearly as tight as possible given the observations. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:SC-application} \end_inset discusses results from the application of the sample compression bound to support vector machines. \layout Subsection The Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sub:The-Sample-Compression-Bound} \end_inset \layout Standard The Sample Compression bound stated here is differs somewhat from other results ( \begin_inset LatexCommand \cite{LW} \end_inset \begin_inset LatexCommand \cite{FW} \end_inset and others) by generalization and simplification but the bound behavior is qualitatively identical. \layout Standard Suppose we have a learning algorithm, \begin_inset Formula $A(S)$ \end_inset whose training is \begin_inset Quotes eld \end_inset sparse \begin_inset Quotes erd \end_inset \begin_inset Foot collapsed true \layout Standard This is satisfied, for example, by the Support Vector Machine algorithm which only depends upon the set of support vectors. \end_inset in the sense that the output classifier is dependent upon only a subset of the data, \begin_inset Formula $A(S)=A(S')$ \end_inset for \begin_inset Formula $S'\subseteq S$ \end_inset . The sample compression bound is dependent on the error rate, \begin_inset Formula $\hat{c}_{S-S'}$ \end_inset on the subset \begin_inset Formula $S-S'$ \end_inset . The motivation here is that the examples which the learning algorithm does \emph on not \emph default depend upon are \begin_inset Quotes eld \end_inset almost \begin_inset Quotes erd \end_inset independent and so we can \begin_inset Quotes eld \end_inset almost \begin_inset Quotes erd \end_inset get a test set bound. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename sample_compression.eps scale 50 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:SC-POL} \end_inset The interactive proof of learning for the sample compression bound. Note that the learning algorithm is arbitrary here, similar to the test set bound. \end_inset \layout Standard Viewed as an interactive proof of learning (in figure \begin_inset LatexCommand \ref{fig:SC-POL} \end_inset ), the sample compression bound is unique amongst training set bounds because it does not require \emph on any \emph default initial commitment to a measure over the classifiers. \layout Theorem (Sample Compression Bound) For all \begin_inset Formula $\delta\in(0,1]$ \end_inset , \begin_inset Formula $D$ \end_inset , \begin_inset Formula $A$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\delta\] \end_inset \layout Proof Suppose we knew in advance that the learning algorithm will not depend upon some subset of the examples. Then, the \begin_inset Quotes eld \end_inset undependent \begin_inset Quotes erd \end_inset subset acts like a test set and gives us a test set bound: \begin_inset Formula \[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset (Note that, technically, it is possible to refer to \begin_inset Formula $S'$ \end_inset unambiguously before randomizing over \begin_inset Formula $S$ \end_inset by specifying the indices of \begin_inset Formula $S$ \end_inset contained in \begin_inset Formula $S'$ \end_inset .) Negating this, we get: \layout Proof \begin_inset Formula \[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset and using the union bound ( \begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$ \end_inset over each possible subset, \begin_inset Formula $S'$ \end_inset , we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\delta\] \end_inset Negating this again gives us the proof. \layout Subsection The Sample Compression Bound is Sometimes Tight \layout Standard \begin_inset LatexCommand \label{sub:SC-tightness} \end_inset \layout Standard We can construct a learning algorithm/learning problem pair such that the Sample compression bound is provably near optimal, as a function of it's observables. \layout Theorem (Sample Compression Tightness) For all \begin_inset Formula $\delta\in(0,1]$ \end_inset , \begin_inset Formula $\frac{k}{m}$ \end_inset , there exists a distribution \begin_inset Formula $D$ \end_inset and learning algorithm \begin_inset Formula $A$ \end_inset s.t. \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)>\delta-\delta^{2}\] \end_inset furthermore, if \begin_inset Formula $S^{*}$ \end_inset minimizes \begin_inset Formula $\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)$ \end_inset , then \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(c^{*}=A(S^{*}):\,\,\, c_{D}^{*}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S^{*}}^{*},\frac{\delta}{m{{m \choose |S-S^{*}|}}}\right)\right)>\delta-\delta^{2}\] \end_inset \layout Proof The proof is constructive and similar to the Occam's Razor tightness result. In particular, we show how to construct a learning algorithm which outputs classifiers that err independently depending on the subset \begin_inset Formula $S'$ \end_inset used. \layout Proof Consider an input space \begin_inset Formula $X=\{0,1\}^{2^{m}}$ \end_inset . Each variable in the input space \begin_inset Formula $x_{S'}$ \end_inset can be thought of as indexing a unique subset \begin_inset Formula $S'\subseteq S$ \end_inset of the examples. In the rest of the proof, we index variables by the subset they correspond to. \layout Proof Draws from the distribution \begin_inset Formula $D$ \end_inset can be made by first flipping an unbiased coin to get \begin_inset Formula $y=1$ \end_inset with probability \begin_inset Formula $0.5$ \end_inset and \begin_inset Formula $y=-1$ \end_inset with probability \begin_inset Formula $0.5$ \end_inset . The distribution on \begin_inset Formula $X$ \end_inset consists of a set of independent values after conditioning on \begin_inset Formula $y$ \end_inset . Choose \begin_inset Formula \[ \Pr(x_{S'}\neq y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset Now, the learning algorithm \begin_inset Formula $A(S')$ \end_inset is very simple---it just outputs the classifier \begin_inset Formula $c(x)=x_{S'}$ \end_inset . On the set \begin_inset Formula $S-S'$ \end_inset , we have: \begin_inset Formula \[ \forall S'\,\,\,\Pr_{S\sim D^{m}}\left(\hat{c}_{S-S'}\geq\frac{k}{m}\right)=1-\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset Using independence, we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}\geq\frac{k}{m}\right)=\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset Negating, we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}<\frac{k}{m}\right)=1-\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset and doing some algebra, we get the result. \layout Subsection Application of the Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sub:SC-application} \end_inset \layout Standard One obvious application of the Sample Compression bound is to support vector machines, since the learned classifier is only dependent on the set of support vectors. If \begin_inset Formula $S'$ \end_inset is the set of support vectors then \begin_inset Formula $S-S'$ \end_inset is the set of nonsupport vectors. Unfortunately, it turns out that this does not work so well, as observed in figure \begin_inset LatexCommand \ref{fig:sc-svm} \end_inset . \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_and_compress.ps display color width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:sc-svm} \end_inset The sample compression bound applied to the output of a support vector machine with a Gaussian kernel. Here we use \begin_inset Formula $\delta=0.01$ \end_inset \end_inset \layout Standard There are other less common learning algorithms for which the sample compression bound works well. The Set Covering machine \begin_inset LatexCommand \cite{SCM} \end_inset has an associated bound which is a variant of the Sample Compression Bound. \layout Section Discussion \layout Subsection Learning algorithm design \layout Standard \emph on Every \emph default train set bound implies a learning algorithm: choose the classifier which minimizes the true error bound. This sounds like a rich source of learning algorithms, but there are some severe caveats to that statement. \layout Enumerate It is important to note that the form of a train set bound does \emph on not \emph default imply that this minimization is a good idea. Choosing between two classifiers based upon their true error bound implies a better worst-case bound on the true error. It does not imply an improved true error. In many situations, there is some other metric of comparison (such as train error rate) which in fact creates better behavior. \layout Enumerate Another strong caveat is that, historically, train set bounds have simply not been tight enough on real datasets for a nonvacuous application. This is changing with new results, but more progress is necessary. \layout Enumerate Often the optimization problem is simply not very tractable. In addition to sample complexity, learning algorithms must be concerned with run time and space usage. \layout Subsection Philosophy \layout Standard Train set bounds teach us about ways in which verifiable learning is possible, a subject which borders on philosophy. The train set bound presented here essentially shows that a reasonable person will be convinced of learning success when a short-description classifie r does well on train set data. The results here do \emph on not \emph default imply that this is the only way to convincingly learn. In fact, the (sometimes large) looseness of the Occam's Razor bound suggests that other methods for convincing learning processes exist. This observation is partially shown by other train set bound results which are presented next. \layout Subsection Conclusion \layout Standard This introduction to prediction theory covered two styles of bound: the test set bound and the train set bound. There are two important lessons here: \layout Enumerate Test set bounds provide a better way to report error rates and confidence intervals on future error rates than some current methods. \layout Enumerate Train set bounds can provide useful information. \layout Standard It is important to note that the train set bound and test set bound techniques are not mutually exclusive. It is possible to use both simultaneously \begin_inset LatexCommand \cite{TnT} \end_inset , and doing so is often desirable. Test set bounds are improved by the \begin_inset Quotes eld \end_inset free \begin_inset Quotes erd \end_inset information about the training error and train set bounds become applicable, even when not always tight. \layout Acknowledgement I am indebted to many people for helpful suggestions assembling the content. This includes some anonymous reviewers, several coauthors on previous papers, Imre Kondor, Arindam Banerjee, and Alina Beygelzimer. \layout Section* Appendix \layout Standard \begin_inset LatexCommand \label{sec-model} \end_inset \layout Standard For those interested in comparing models, the uniform convergence model \begin_inset LatexCommand \cite{Vapnik} \end_inset requires the additional assumption of the axiom of choice (to achieve empirical risk minimization) and a bound on the hypothesis space complexity. Typical theorems are of the form \begin_inset Quotes eld \end_inset after \begin_inset Formula $m$ \end_inset examples, all training errors are near to true errors \begin_inset Quotes erd \end_inset . \layout Standard The PAC learning model \begin_inset LatexCommand \cite{Valiant} \end_inset requires a polynomial time complexity learning algorithm and the assumption that the learning problem comes from some class. Theorems are of the form \begin_inset Quotes eld \end_inset after \begin_inset Formula $m$ \end_inset examples learning will be achieved \begin_inset Quotes erd \end_inset . \layout Standard Both of these models can support stronger statements than the basic prediction theory model presented here. Results from both of these models can apply here after appropriate massaging of results. \layout Standard The online learning model \begin_inset LatexCommand \cite{Online} \end_inset makes \emph on no \emph default assumptions. Typical theorems have the form \begin_inset Quotes eld \end_inset This learning algorithm's performance will be nearly as good as anyone of a set of classifiers. \begin_inset Quotes erd \end_inset The online learning model has very general results and no ability to answer questions about future performance as we address here. \layout Standard The prediction theory model can most simply be understood as a slight refinement of the information theory model. \layout Bibliography \bibitem {BEHW} A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth. \begin_inset Quotes eld \end_inset Occam's Razor. \begin_inset Quotes erd \end_inset Information Processing Letters 24: 377-380, 1987. \layout Bibliography \bibitem {Progressive} Avrim Blum, Adam Kalai, and John Langford 1999. Beating the Holdout: Bounds for K-Fold and Progressive Cross-Validation. COLT99. http://www.cs.cmu.edu/~jcl/papers/progressive_validation/coltfinal.ps \layout Bibliography \bibitem {chernoff} H. Chernoff, \begin_inset Quotes eld \end_inset A Measure of the Asymptotic Efficiency of Tests of a Hypothesis Based Upon the Sum of the Observations \begin_inset Quotes erd \end_inset , Annals of Mathematical Statistics 23:493-507, 1952. \layout Bibliography \bibitem {Devroye} Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, \begin_inset Quotes eld \end_inset A Probabilistic Theory of Pattern Recognition \begin_inset Quotes erd \end_inset Springer-Verlag New York, 1996. \layout Bibliography \bibitem {FW} Sally Floyd and Manfred Warmuth, \begin_inset Quotes eld \end_inset Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension \begin_inset Quotes erd \end_inset , Machine Learning, Vol.21 (3), pp. 269--304, December 1995 http://www.cse.ucsc.edu/~manfred/pubs/sallycompr.forgalleys. ps \layout Bibliography \bibitem {kernelize} R. Herbrich and T. Graepel. \begin_inset Quotes eld \end_inset Large scale Bayes point machines \begin_inset Quotes erd \end_inset In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information System Processing 13, pages 528-534, Cambridge, MA, 2001. \layout Bibliography \bibitem {Sanity} Michael Kearns and Dana Ron, \begin_inset Quotes eld \end_inset Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation. \begin_inset Quotes erd \end_inset Neural Computation 11(6), pages 1427-1453, 1999. Also in Proceedings of the Tenth Annual Conference on Computational Learning Theory, ACM Press, 1997, pages 152--162. http://www.cis.upenn.edu/~mkearns/papers/multi.ps \layout Bibliography \bibitem {Online} J. Kivinen and M. Warmuth, "Additive Versus Exponentiated Gradient Updates for Linear Prediction, " in Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, January 1997. http://www.cse.ucsc.edu/~manfred/pubs/lin.ps \layout Bibliography \bibitem {bound} John Langford, Program \begin_inset Quotes eld \end_inset bound \begin_inset Quotes erd \end_inset , http://www-2.cs.cmu.edu/~jcl/programs/bound/bound.html \layout Bibliography \bibitem {MC_journal} John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding learning algorithms. Machine Learning Journal. http://www.cs.cmu.edu/~jcl/papers/microchoice/journal/journal_final.ps \layout Bibliography \bibitem {Shell} John Langford and David McAllester, \begin_inset Quotes eld \end_inset Computable Shell Decomposition Bounds \begin_inset Quotes erd \end_inset COLT 2000. http://www-2.cs.cmu.edu/~jcl/papers/computable_shell/colt_final.ps \layout Bibliography \bibitem {averaging} John Langford and Matthias Seeger, \begin_inset Quotes eld \end_inset Bounds for Averaging Classifiers \begin_inset Quotes erd \end_inset Technical report, Carnegie Mellon 2001 http://www-2.cs.cmu.edu/~jcl/papers/averagi ng/averaging_tech.ps \layout Bibliography \bibitem {PB-margin} John Langford and John Shawe-Taylor, \begin_inset Quotes eld \end_inset PAC-Bayes & Margins \begin_inset Quotes erd \end_inset , NIPS 2002. http://www-2.cs.cmu.edu/~jcl/papers/PB-margin/PB-margin.ps \layout Bibliography \bibitem {TnT} John Langford, \begin_inset Quotes eld \end_inset Combining Train Set and Test Set Bounds \begin_inset Quotes erd \end_inset , ICML2002. http://www-2.cs.cmu.edu/~jcl/papers/tnt_icml/train_n_test_final.ps \layout Bibliography \bibitem {LW} Nick Littlestone and Manfred Warmuth, \begin_inset Quotes eld \end_inset Relating Data Compression and Learnability \begin_inset Quotes erd \end_inset Unpublished Manuscript http://www.cse.ucsc.edu/~manfred/pubs/lrnk-olivier.ps \layout Bibliography \bibitem {PB} David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999. http://www.autoreason.com/posterior01.ps \layout Bibliography \bibitem {SCM} Mario Marchand and John Shawe-Taylor, \begin_inset Quotes eld \end_inset The Set Covering Machine \begin_inset Quotes erd \end_inset , ICML 2001. http://www.ai.mit.edu/projects/jmlr/papers/volume3/marchand02a/marchand02a.pdf \layout Bibliography \bibitem {Matthias} Matthias Seeger, \begin_inset Quotes eld \end_inset PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification \begin_inset Quotes erd \end_inset , Journal of Machine Learning Research 3 (2002), 233--269. http://www.dai.ed.ac.uk/homes/seeger/papers/seeger02a.ps.gz \layout Bibliography \bibitem {Chernoff} Sebastian Seung, Unpublished notes \layout Bibliography \bibitem {Valiant} L.G. Valiant. \begin_inset Quotes eld \end_inset A Theory of the Leuarnable. \begin_inset Quotes erd \end_inset Communications of the ACM 27(11):1134-1142, November 1984 \layout Bibliography \bibitem {Vapnik} V. N. Vapnik and A. Y. Chervonenkis. \begin_inset Quotes eld \end_inset On the uniform convergence of relative frequencies of events to their probabilit ies. \begin_inset Quotes erd \end_inset Theory of Probab. and its Applications, 16(2):264-280, 1971. \the_end