#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass amsart
\language english
\inputencoding default
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 2
\paperpagestyle default
\layout Title
Tutorial on Practical Prediction Theory for Classification
\layout Author
John Langford, IBM Research
\layout Abstract
We discuss basic prediction theory and it's impact on classification success
evaluation, implications for learning algorithm design, and uses in learning
algorithm execution.
This tutorial is meant to be a comprehensive compilation of results which
are both theoretically rigorous and practically useful.
\layout Abstract
There are a two important implications of the results presented here:
\layout Abstract
(1) Common practices for reporting results in classification should change
to use the test set bound.
\layout Abstract
(2) Train set bounds can sometimes be used to directly motivate learning
algorithms.
\layout Section
Introduction
\layout Standard
Classifiers are functions which partition a set into two elements (the set
of rainy days and the set of sunny days).
Classifiers are the most simple nontrivial decision making element so studying
the theory of learning classifiers is very fundamental to studying the
theory of learning, in general.
Classifiers are sufficiently complex that many phenomena observed in machine
learning (theoretically or experimentally) can be observed in the classificatio
n setting.
Yet, classifiers are simple enough to make their analysis easy to understand.
This combination of sufficient yet minimal complexity for capturing phenomena
makes the study of classifiers especially fruitful.
\layout Standard
The goal of this paper is an introduction to the theory of prediction for
classification.
Many of these results have been presented elsewhere, although the style,
elemental nature, and generality of the presentation may be new.
This is a tutorial, so we limit our presentation to those results which
are both theoretically sound and practically useful.
\layout Standard
There are several important aspects of learning which the theory here casts
light on.
Perhaps the most important of these is the problem of performance reporting
for classifiers.
Many people use some form of empirical variance to estimate upper and lower
bounds.
This is an error-prone practice, and the test set bound in section
\begin_inset LatexCommand \ref{sec-test}
\end_inset
implies a better method by nearly any metric.
Hopefully, this will become common practice.
\layout Standard
After discussing the test set bound we cover the Occam's Razor bound, the
simplest train set bound, which explains (and quantifies) the common phenomena
of overfitting in simplest form.
We also prove that the Occam's Razor bound cannot be improved without incorpora
ting extra information and apply the bound to decision trees.
\layout Standard
Next, we discuss two train set bounds, the PAC-Bayes bound and the Sample
Compression bound, which have proved to give practical results for more
general classifiers, such as Support Vector Machines and Neural Networks.
\layout Standard
All of the results here should be easily approachable and understandable.
The proofs are simple, and examples are given.
Pointers to related work are also given.
\layout Standard
It is important to note that all of the results presented here fall in the
realm of classical statistics.
In particular, all randomizations are over draws of the data, and our results
have the form of confidence intervals.
\layout Standard
The layout of this document is as follows:
\layout Itemize
Section
\begin_inset LatexCommand \ref{sec-formal}
\end_inset
presents the formal model
\layout Itemize
Section
\begin_inset LatexCommand \ref{sec-test}
\end_inset
presents the test set bound
\layout Itemize
Section
\begin_inset LatexCommand \ref{sec-train}
\end_inset
presents the Occam's Razor bound
\layout Itemize
Section
\begin_inset LatexCommand \ref{sec:PAC-Bayes-Bound}
\end_inset
presents the PAC-Bayes bound
\layout Itemize
Section
\begin_inset LatexCommand \ref{sec:Sparsity-Bound}
\end_inset
presents the Sample Compression bound
\layout Standard
The formal model and test set bound must be understood in order to appreciate
all later results.
There is no particular dependency between the various train set bounds
we present.
\layout Section
Formal Model
\layout Standard
\begin_inset LatexCommand \label{sec-formal}
\end_inset
\layout Standard
There are many somewhat arbitrary choices of learning model.
The one we use can (at best) be motivated by its simplicity.
Other models such as the online learning model
\begin_inset LatexCommand \cite{Online}
\end_inset
, PAC learning
\begin_inset LatexCommand \cite{Valiant}
\end_inset
, and the uniform convergence model
\begin_inset LatexCommand \cite{Vapnik}
\end_inset
differ in formulation, generality, and in the scope of addressable questions.
The strongest motivation for studying the prediction theory model here
is simplicity and corresponding generality of results.
Appendix section
\begin_inset LatexCommand \ref{sec-model}
\end_inset
discusses the connections between various models.
\layout Subsection
Basic quantities
\layout Standard
We are concerned with a learning model in which examples of (input, output)
pairs come independently from some unknown distribution.
The goal is to find a function capable of predicting the output given the
input.
There are several mathematical objects we work with.
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Tabular
0$ \end_inset , we have: \layout Proof \begin_inset Formula \[ \textrm{Bin}\left(\frac{k}{m},p\right)=\Pr_{X^{m}\sim p^{m}}\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\leq\frac{k}{m}\right)=\Pr_{X^{m}\sim p^{m}}\left(e^{m\lambda\frac{1}{m}\sum_{i=1}^{m}X_{i}}\leq e^{m\lambda\frac{k}{m}}\right)\] \end_inset \begin_inset Formula \[ =\Pr_{X^{m}\sim p^{m}}\left(e^{-\lambda\sum_{i=1}^{m}X_{i}}\geq e^{-\lambda k}\right)\] \end_inset Using Markov's inequality ( \begin_inset Formula $X\geq0$ \end_inset , \begin_inset Formula $EX=\mu$ \end_inset , \begin_inset Formula $\Rightarrow\Pr(X\geq\delta)\leq\frac{\delta}{\mu}$ \end_inset ): \begin_inset Formula \[ \leq\frac{e^{-\lambda k}}{E_{X^{m}\sim p^{m}}e^{-\lambda\sum_{i=1}^{m}X_{i}}}\] \end_inset Using Jensen's inequality ( \begin_inset Formula $\frac{1}{x}$ \end_inset concave \begin_inset Formula $\Rightarrow$ \end_inset \begin_inset Formula $\frac{1}{EX}\leq E\frac{1}{X}$ \end_inset ), we get: \begin_inset Formula \[ \leq e^{-\lambda k}E_{X^{m}\sim p^{m}}e^{\lambda\sum_{i=1}^{m}X_{i}}\] \end_inset Using independence, we get: \begin_inset Formula \[ =e^{-\lambda k}\left(pe^{\lambda}+(1-p)\right)^{m}\] \end_inset and rewriting, we get: \begin_inset Formula \[ =e^{-mf(\lambda)}\] \end_inset where \begin_inset Formula $f(\lambda)=\lambda\frac{k}{m}-\ln\left(pe^{\lambda}+1-p\right)$ \end_inset . \layout Proof \begin_inset Formula $\lambda$ \end_inset is a free parameter which can be optimized to find the tightest possible bound. To find the optimal value, find \begin_inset Formula $\lambda^{*}$ \end_inset so that \begin_inset Formula $f'(\lambda^{*})=0$ \end_inset . \begin_inset Formula \[ 0=f'(\lambda^{*})=\frac{k}{m}-\frac{pe^{\lambda^{*}}}{pe^{\lambda^{*}}+1-p}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}}{p}\left(pe^{\lambda^{*}}+1-p\right)=e^{\lambda^{*}}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}}{p}-\frac{k}{m}=\left(1-\frac{k}{m}\right)e^{\lambda^{*}}\] \end_inset \begin_inset Formula \[ \Rightarrow\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}=e^{\lambda^{*}}\] \end_inset Using this, we get: \begin_inset Formula \[ f(\lambda^{*})=\frac{k}{m}\ln\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}-\ln\left(\frac{1-p}{1-\frac{k}{m}}\right)\] \end_inset \begin_inset Formula \[ =\frac{k}{m}\ln\frac{\frac{k}{m}}{p}+(1-\frac{k}{m})\ln\left(\frac{1-\frac{k}{m}}{1-p}\right)\] \end_inset \begin_inset Formula \[ =\textrm{KL}(\frac{k}{m}||p)\] \end_inset \layout Standard Using the Chernoff bound, we can loosen the test set bound to achieve a more analytic fom. \layout Corollary (Agnostic Test Set Bound) For all classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\textrm{KL}\left(\hat{c}_{S}||c_{D}\right)\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta\] \end_inset where \begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$ \end_inset is the Kullback-Leibler divergence between two coins of bias \begin_inset Formula $q,p$ \end_inset with \begin_inset Formula $q
\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta P(c)\] \end_inset then, we apply the union bound in a nonuniform manner. The union bound says that \begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$ \end_inset . Applying the union bound to every classifier with a positive measure gives a total probability of failure of \begin_inset Formula \[ \sum_{c}\delta P(c)=\delta\sum_{c}P(c)=\delta\] \end_inset which implies \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta\] \end_inset Negating this again completes the proof. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename nonuniform_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \begin_inset Graphics filename deep_cut_nonuniform_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \layout Standard \begin_inset Graphics filename deep_cut_consistent_binomials.ps display monochrome width 30col% rotateAngle 270 \end_inset \begin_inset Graphics filename deep_cut_bound.ps display monochrome width 30col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-train-proof} \end_inset The sequence of pictures is the pictorial representation of the proof of the Occam's Razor Bound. The first figure shows a set of classifiers, each with a tail cut of some varying depth. The second picture shows an observed training error and the possible Binomial distributions for a chosen classifier. The third picture shows the true errors which are consistent with the observati on and the tail cuts. The fourth picture shows the true error bound. \end_inset \layout Subsubsection Occam's Razor Corollaries \layout Standard Just as with the test set bound, we can relax the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with the Chernoff approximations to get a somewhat more tractable expression. \layout Corollary \begin_inset LatexCommand \label{th-horb} \end_inset (Chernoff Occam's Razor Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over classifiers, for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\leq\hat{c}_{S}+\sqrt{\frac{\ln\frac{1}{P(c)}+\ln\frac{1}{\delta}}{2m}}\right)\geq1-\delta\] \end_inset \layout Proof approximate the Binomial tail with the Chernoff Bound (lemma \begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff} \end_inset ). \layout Standard Many people are more familiar with a degenerate form of this bound where \begin_inset Formula $P(c)=\frac{1}{|H|}$ \end_inset and \begin_inset Formula $H$ \end_inset is some set of classifiers. In that case, simply replace \begin_inset Formula $\ln\frac{1}{P(c)}$ \end_inset with \begin_inset Formula $\ln|H|$ \end_inset . The form presented here is both more general and necessary if the bound is to be used in practice. \layout Standard Other corollaries as in section \begin_inset LatexCommand \ref{sub:Corollaries} \end_inset exist for the Occam's Razor bound. In general, just substitute \begin_inset Formula $\delta\rightarrow\delta P(c)$ \end_inset . \layout Subsubsection Occam's Razor Lower bound \layout Standard Just as for the test set bound, a lower bound of the same form applies. \layout Theorem \begin_inset LatexCommand \label{th-lorb} \end_inset (Occam's Razor Lower Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over the classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall c:\,\, c_{D}\geq\min_{p}\{ p:\,\,1-\textrm{Bin}\left(\hat{c}_{S},p\right)\geq\delta P(c)\}\right)\geq1-\delta\] \end_inset \layout Example* (continued) Suppose that instead of having \begin_inset Formula $100$ \end_inset test examples, we had \begin_inset Formula $100$ \end_inset train examples. Also suppose that before seeing the train examples, we committed to \begin_inset Formula $P(c)=0.1$ \end_inset for \begin_inset Formula $c$ \end_inset the constant classifier which predicts \begin_inset Quotes eld \end_inset no rain \begin_inset Quotes erd \end_inset . Then, the Chernoff approximations of the upper and lower bound give the interval, \begin_inset Formula $c_{D}\in[0.22,0.54]$ \end_inset . With an exact calculation, we get \begin_inset Formula $c_{D}\in[0.26,0.51]$ \end_inset . \layout Subsubsection The state of the art \layout Standard A very large amount of work has been done on train set bounds. In addition to those included here, there is: \layout Enumerate Reinterpretations of uniform convergence \begin_inset LatexCommand \cite{Vapnik} \end_inset results for continuously parameterized classifiers. \layout Enumerate Reinterpretations of PAC convergence \begin_inset LatexCommand \cite{Valiant} \end_inset results. \layout Enumerate Shell bounds \begin_inset LatexCommand \cite{Shell} \end_inset which take advantage of the distribution of true error rates on classifiers. \layout Enumerate Train and Test bounds \begin_inset LatexCommand \cite{TnT} \end_inset which combine train set and test set bounds. \layout Standard Of this large amount of work only a small fraction has been shown to be useful on real-world learning algorithm/learning problem pairs. The looseness of train set based bounds often precludes analytical use. \layout Subsection The Occam's Razor Bound is sometimes Tight \layout Standard \begin_inset LatexCommand \label{sec-lower_upper} \end_inset \layout Standard The question of tightness for train set bounds is important to address, as many of them have been extremely loose. The simplest method to address this tightness is constructive: exhibit a learning problem/algorithm pair for which the bound is almost achieved. For the test set bound, this is trivial as any classifier with a large enough true error will achieve the bound. For the train set bound, this is not so trivial. \layout Standard How tight is the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset )? The answer is \emph on sometimes \emph default tight. In particular, we can exhibit a set of learning problems where the Occam's Razor bound can not be made significantly tighter as a function of the observables, \begin_inset Formula $m$ \end_inset , \begin_inset Formula $\delta$ \end_inset , \begin_inset Formula $P(c)$ \end_inset , and \begin_inset Formula $\hat{c}_{S}$ \end_inset . After fixing the value of these quantities we construct a learning problem exhibiting this near equivalence to the Occam's Razor bound. \layout Theorem \begin_inset LatexCommand \label{th-dhlub} \end_inset (Occam's Razor tightness) For all \begin_inset Formula $P(c)$ \end_inset , \begin_inset Formula $\frac{k}{m}$ \end_inset , \begin_inset Formula $\delta$ \end_inset there exists a learning problem and algorithm such that: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\geq\delta-\delta^{2}\] \end_inset Furthermore, if \begin_inset Formula $c^{*}$ \end_inset is the classifier with minimal training error, then: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(c_{D}^{*}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S}^{*},\delta P(c)\right)\right)\geq\delta-\delta^{2}\] \end_inset \layout Standard Intuitively, this theorem implies that we can not improve significantly on the the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) without using extra information about our learning problem. \layout Proof The proof is constructive: we create a learning problem on which large deviation s are likely. We start with a prior \begin_inset Formula $P(c)$ \end_inset , probability of error \begin_inset Formula $\delta$ \end_inset , \begin_inset Formula $m$ \end_inset , and a targeted empirical error rate, \begin_inset Formula $\frac{k}{m}$ \end_inset . For succinctness we assume that \begin_inset Formula $P(c)$ \end_inset has support on a finite set of size \begin_inset Formula $n$ \end_inset . \layout Proof To define the learning problem, let: \begin_inset Formula $X=\{0,1\}^{n}$ \end_inset and \begin_inset Formula $Y=\{0,1\}$ \end_inset . \layout Proof The distribution \begin_inset Formula $D$ \end_inset can be drawn by first selecting \begin_inset Formula $Y$ \end_inset with a single unbiased coin flip, and then choosing the \begin_inset Formula $i$ \end_inset th component of the vector \begin_inset Formula $X$ \end_inset independently, \begin_inset Formula $\Pr((X_{1},...,X_{n})|Y)=\Pi_{i=1}^{n}\Pr(X_{i}|Y)$ \end_inset . The individual components are chosen so \begin_inset Formula $\Pr(X_{i}=Y|Y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$ \end_inset . \layout Proof The classifiers we consider just use one feature to make their classification: \begin_inset Formula $c_{i}(x)=x_{i}$ \end_inset . The true error of these classifiers is given by: \begin_inset Formula $c_{D}=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$ \end_inset . \layout Proof This particular choice of true errors implies that if any classifier has a too-small train error rate, then the classifier with minimal train error must have a too-small train error. \layout Proof Using this learning problem, we know that: \begin_inset Formula \[ \forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=\delta P(c)\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow\forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=1-\delta P(c)\] \end_inset (independence) \begin_inset Formula \[ \Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\forall c\,\, c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\prod_{c}\left(1-\delta P(c)\right)\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\exists c\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\] \end_inset \begin_inset Formula \[ =\sum_{i=1}^{n}\delta P(c_{i})\left(1-\Pr_{S\sim D^{m}}\left(\exists c\in\{ c_{1},...,c_{i-1}\}\,\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\right)\] \end_inset \begin_inset Formula \[ \geq\sum_{i=1}^{n}\delta P(c_{i})(1-\delta)\] \end_inset \begin_inset Formula \[ =\delta-\delta^{2}\] \end_inset \layout Standard The lower bound theorem implies that we can not improve an Occam's Razor like statement. However, it is important to note that large improvements are possible if we use other sources of information. To see this, just note the case where every single classifier happens to be the same. In this case the \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset bound would the be the \emph on test \emph default set bound, rather than the train set bound. The PAC-Bayes bound and the Sample Compression bound presented in the next sections use other sources of information. \layout Subsection Train set bound implications \layout Standard \begin_inset LatexCommand \label{subsec-train-set-practice} \end_inset \layout Subsubsection results \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \align center \begin_inset Graphics filename test_vs_micro.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-microchoice-holdout} \end_inset This is a plot comparing confidence intervals built based upon the test set bound ( \begin_inset LatexCommand \ref{th-hb} \end_inset ) with an 80%/20% train/test split on the left and the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with all data in the training set on the right. The Occam's razor bound is sometimes superior on the smaller data sets and always nonvacuous (in contrast to many other train set bounds). \end_inset \layout Standard The Occam's Razor bound is strongly related to compression. In particular, for any self-terminating description language, \begin_inset Formula $d(c)$ \end_inset , we can associate a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)=2^{-|d(c)|}$ \end_inset with the property that \begin_inset Formula $\sum_{c}P(c)\leq1$ \end_inset . Consequently, short description length classifiers tend to have a tighter convergence and the penalty term, \begin_inset Formula $\ln\frac{1}{P(c)}$ \end_inset is the number of \begin_inset Quotes eld \end_inset nats \begin_inset Quotes erd \end_inset (bits base e). For any language fixed before seeing the train set, classifiers with shorter description lengths have tighter bounds on the true error rate. \layout Standard One particularly useful description language to consider is the execution trace of a learning algorithm. If we carefully note the sequence of data-dependent choices which a learning algorithm makes, then the output classifier can be specified by a sequence such as \begin_inset Quotes eld \end_inset 2nd choice, third choice, first choice, etc... \begin_inset Quotes erd \end_inset This is the idea behind microchoice bounds \begin_inset LatexCommand \cite{MC_journal} \end_inset . Results for this approach are reported in Figure \begin_inset LatexCommand \ref{fig-microchoice-holdout} \end_inset and are strong enough to act as an empirical existence proof that Occam's Razor bounds can be made tight enough for useful application. \layout Section PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sec:PAC-Bayes-Bound} \end_inset \layout Standard The PAC-Bayes bound \begin_inset LatexCommand \cite{PB} \end_inset is particularly exciting because it can provide quantitatively useful results for classifiers with \emph on real valued \emph default parameters. This includes such commonly used classifiers as Support Vector Machines and Neural Networks. \begin_inset Foot collapsed true \layout Standard There is a caveat here---the bound only applies to stochastic versions of the classifiers. However, the probability that the stochastic classifier differs from the classifier can be made very small. \end_inset This section is divided into three parts: \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:The-PAC-Bayes-Bound} \end_inset States and proves the PAC-Bayes Bound. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:How-Tight-is} \end_inset shows that the PAC-Bayes Bound is nearly as tight as possible given the observations. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:Application-of-the} \end_inset discusses results from the application of the PAC-Bayes bound to support vector machines. \layout Subsection The PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sub:The-PAC-Bayes-Bound} \end_inset \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename pac-bayes.eps scale 50 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:PB-POL} \end_inset The \begin_inset Quotes eld \end_inset interactive proof of learning \begin_inset Quotes erd \end_inset associated with the PAC-Bayes bound. The figure is the same as for the Occam's razor bound, except that instead of committing to a single classifier, the PAC-Bayes bound applies to any distribution over classifiers. \end_inset \layout Standard The PAC-Bayes bound has been improved by tightening \begin_inset LatexCommand \cite{averaging} \end_inset and then with a much simpler proof \begin_inset LatexCommand \cite{Matthias} \end_inset since it was originally stated. The statement and proof presented here incorporate these improvements and improve on them slightly. \layout Standard The PAC-Bayes bound is dependent upon two derived quantities, an average true error: \begin_inset Formula \[ Q_{D}\equiv E_{c\sim Q}c_{D}\] \end_inset and an average train error: \begin_inset Formula \[ \hat{Q}_{S}\equiv E_{c\sim Q}\hat{c}_{S}\] \end_inset These quantities can be interpreted as the train error and true error of the meta-classifier which chooses a classifier according to \begin_inset Formula $Q$ \end_inset every time a classification is made. If we refer to this meta-classifier as \begin_inset Formula $Q$ \end_inset , the notation for error rates is consistent with our earlier notation. \layout Standard The \begin_inset Quotes eld \end_inset interactive proof of learning \begin_inset Quotes erd \end_inset viewpoint of the PAC-Bayes bound is shown in figure \begin_inset LatexCommand \ref{fig:PB-POL} \end_inset . It is essentially the same as for the Occam's Razor bound except for the commitment to the metaclassifier \begin_inset Formula $Q$ \end_inset rather than the classifier \begin_inset Formula $c$ \end_inset . \layout Theorem (PAC-Bayes Bound) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula $P(c)$ \end_inset over the classifiers, \begin_inset Formula $c$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall Q(c):\,\,\textrm{KL}\left(\hat{Q}_{S}||Q_{D}\right)\leq\frac{\textrm{KL}(Q||P)+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\] \end_inset Here \begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$ \end_inset for \begin_inset Formula $q
\frac{\mu}{\delta})\leq\delta$ \end_inset ), we get: \layout Proof \begin_inset Formula \[ \forall P\,\,\Pr_{S\sim D^{m}}\left(E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\leq\frac{m+1}{\delta}\right)\leq\delta\] \end_inset \layout Standard The next lemma shows that a certain expectation is bounded by the Kullback-Leibl er distance between two coin flips, just as for the relative entropy Chernoff bound (Lemma \begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff} \end_inset ). \layout Lemma \begin_inset LatexCommand \label{lemma: KL-approx} \end_inset For all \begin_inset Formula $Q(c)$ \end_inset : \begin_inset Formula $\frac{E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}}{m}\geq\textrm{KL}(\hat{Q}_{S}||Q_{D})$ \end_inset \layout Proof \begin_inset Formula $\frac{1}{m}E_{c\sim Q}\ln\frac{1}{\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)c_{D}^{m\hat{c}_{S}}(1-c_{D})^{m(1-\hat{c}_{S})}}$ \end_inset \begin_inset Formula \[ =E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]-\frac{E_{c\sim Q}\ln\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)}{m}\] \end_inset Jensen's inequality ( \begin_inset Formula $f$ \end_inset concave \begin_inset Formula $\Rightarrow$ \end_inset \begin_inset Formula $Ef(X)\geq f(EX)$ \end_inset ): \begin_inset Formula \[ E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]\geq\hat{Q}_{S}\ln\frac{1}{Q_{D}}+\left(1-\hat{Q}_{S}\right)\ln\frac{1}{1-Q_{D}}\] \end_inset \begin_inset Formula \[ \textrm{ and \,\,\,\,}\frac{E_{c\sim Q}\ln\left(\begin{array}{c} m\\ m\hat{c}_{S}\end{array}\right)}{m}\leq\frac{E_{c\sim Q}\ln e^{mH\left(\hat{c}_{S}\right)}}{m}=\frac{E_{c\sim Q}mH\left(\hat{c}_{S}\right)}{m}\leq H\left(\hat{Q}_{S}\right)\] \end_inset \layout Standard With these two lemmas, the PAC-Bayes theorem is easy to prove. \layout Proof (Of the PAC-Bayes theorem) Let \begin_inset Formula \[ P_{G}(c,\hat{c}_{S})=\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}}P(c)\] \end_inset \begin_inset Formula \[ \Rightarrow0\leq\textrm{KL}(Q||P_{G})=E_{c\sim Q}\ln\frac{Q(c)}{P(c)}\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset \begin_inset Formula \[ =\textrm{KL}(Q||P)-E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset \begin_inset Formula \[ \Rightarrow E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}\leq\textrm{KL}(Q||P)+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\] \end_inset Now, use lemma \begin_inset LatexCommand \ref{lem:exp-approx} \end_inset on the right hand term and lemma \begin_inset LatexCommand \ref{lemma: KL-approx} \end_inset on the left hand term to finish the proof. \layout Standard Note that one of the last inequalities in the proof is not necessary, and so a slightly tighter bound can be proved. \layout Subsection The PAC-Bayes Bound is sometimes Tight \layout Standard \begin_inset LatexCommand \label{sub:How-Tight-is} \end_inset \layout Standard Since the PAC-Bayes bound is (almost) a generalization of the Occam's Razor bound, the tightness result for Occam's Razor also applies to PAC-Bayes bounds. \layout Subsection Application of the PAC-Bayes Bound \layout Standard \begin_inset LatexCommand \label{sub:Application-of-the} \end_inset \layout Standard Applying the PAC-Bayes bound requires some significant specialization \begin_inset LatexCommand \cite{PB-margin} \end_inset . Here, we specialize to classifiers of the form: \begin_inset Formula \[ c(x)=\textrm{sign}\left(\vec{w}\cdot\vec{x}\right)\] \end_inset Note that via the kernel trick, Support Vector Machines also have this form. \layout Standard The specialization is naturally expressed in terms of a few derived quantities: \layout Enumerate The cumulative distribution of a Gaussian. Let \begin_inset Formula $\bar{F}(x)=\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$ \end_inset . Here we use \begin_inset Formula $\bar{F}$ \end_inset rather than \begin_inset Formula $F$ \end_inset to denote the fact that we integrate from \begin_inset Formula $\infty$ \end_inset to \begin_inset Formula $x$ \end_inset rather than \begin_inset Formula $-\infty$ \end_inset to \begin_inset Formula $x$ \end_inset . \layout Enumerate A \begin_inset Quotes eld \end_inset posterior \begin_inset Quotes erd \end_inset distribution \begin_inset Formula $Q(\vec{w},\mu)$ \end_inset which is \begin_inset Formula $N(\mu,1)$ \end_inset for some \begin_inset Formula $\mu>0$ \end_inset in the direction of \begin_inset Formula $\vec{w}$ \end_inset and \begin_inset Formula $N(0,1)$ \end_inset in all perpendicular directions. \layout Enumerate The normalized margin of the examples \begin_inset Formula \[ \gamma(\vec{x},y)=\frac{y\vec{w}\cdot\vec{x}}{||\vec{w}||||\vec{x}||}\] \end_inset \layout Enumerate A stochastic error rate, \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset \layout Standard This last quantity in particular is very important to understand. Consider the case as \begin_inset Formula $\mu$ \end_inset approaches \begin_inset Formula $\infty$ \end_inset . When the margin is negative (indicating an incorrect classification), \begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset approaches \begin_inset Formula $1$ \end_inset . When the margin is positive \begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end_inset approaches \begin_inset Formula $0$ \end_inset . Thus, \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$ \end_inset is a softened form of the empirical error \begin_inset Formula $\hat{c}_{S}$ \end_inset which takes into account the margin. \layout Corollary (PAC-Bayes Margin Bound) For all distributions \begin_inset Formula $D$ \end_inset , for all \begin_inset Formula $\delta\in(0,1]$ \end_inset , we have: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall\vec{w},\mu:\,\,\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||Q(\vec{w},\mu)_{D}\right)\leq\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\] \end_inset \layout Proof The proof is very simple. We just choose the prior \begin_inset Formula $P=N(0,1)^{n}$ \end_inset and work out the implications. \layout Proof Since the Gaussian distribution is the same in every direction, we can reorient the coordinate system of the prior to have one dimension parallel to \begin_inset Formula $w$ \end_inset . Since the draws in the parallel and perpendicular directions are independent, we have: \begin_inset Formula \[ \textrm{KL}(Q||P)=\textrm{KL}(Q_{\perp}||P_{\perp})+\textrm{KL}(N(\mu,1)||N(0,1))\] \end_inset \begin_inset Formula \[ =\frac{\mu^{2}}{2}\] \end_inset as required. \layout Proof All that remains is calculating the stochastic error rate \begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$ \end_inset . Fix a particular example, \begin_inset Formula $(\vec{x},y)$ \end_inset . This example has a natural decomposition \begin_inset Formula $\vec{x}=\vec{x}_{||}+\vec{x}_{\perp}$ \end_inset into a component, \begin_inset Formula $\vec{x}_{||}$ \end_inset parallel to the weight vector \begin_inset Formula $\vec{w}$ \end_inset and a component \begin_inset Formula $\vec{x}_{\perp}$ \end_inset perpendicular to the weight vector. \layout Proof To classify, we draw weight vector \begin_inset Formula $\vec{w}^{'}$ \end_inset from \begin_inset Formula $\hat{Q}(\vec{w},\mu)$ \end_inset . This \begin_inset Formula $\vec{w}^{'}$ \end_inset consists of \begin_inset Formula $3$ \end_inset components, \begin_inset Formula $\vec{w}^{'}=\vec{w}_{||}^{'}+\vec{w}_{\perp}^{'}+\vec{w}_{\perp\perp}^{'}$ \end_inset . Here \begin_inset Formula $\vec{w}_{||}^{'}\sim N(\mu,1)$ \end_inset is parallel to the original weight vector, \begin_inset Formula $\vec{w}_{\perp}^{'}\sim N(0,1)$ \end_inset which is parallel to \begin_inset Formula $\vec{x}_{\perp}$ \end_inset and \begin_inset Formula $\vec{w}_{\perp\perp}^{'}$ \end_inset is perpendicular to both \begin_inset Formula $\vec{w}$ \end_inset and \begin_inset Formula $\vec{x}$ \end_inset . We have: \begin_inset Formula \[ \hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\neq\textrm{sign}\left(\vec{w}^{'}\cdot\vec{x}\right)\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\vec{w}\cdot\vec{x}\leq0\right)\] \end_inset If we let \begin_inset Formula $w_{||}^{'}=||w_{||}^{'}||$ \end_inset , \begin_inset Formula $w_{\perp}^{'}=||w_{\perp}^{'}||$ \end_inset , \begin_inset Formula $x_{||}=||\vec{x}_{||}||$ \end_inset , and \begin_inset Formula $x_{\perp}=||\vec{x}_{\perp}||$ \end_inset , and assume (without loss of generality) that \begin_inset Formula $y=1$ \end_inset we get: \begin_inset Formula \[ =E_{\vec{x},y\sim S,w_{||}^{'}\sim N(\mu,1),w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{w_{||}^{'}\sim N(\mu,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq-yz^{'}-yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] \end_inset Using the symmetricity of the Gaussian, this is: \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq yz^{'}+yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] \end_inset Using the fact that the sum of two Gaussians is a Gaussian: \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{v\sim N\left(0,1+\frac{x_{\perp}^{2}}{x_{||}^{2}}\right)}I\left(y\mu\leq yv\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}E_{v\sim N\left(0,\frac{1}{\gamma(\vec{x},y)^{2}}\right)}I\left(y\mu\leq yv\right)\] \end_inset \begin_inset Formula \[ =E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)\] \end_inset finishing the proof. \layout Standard Using the corollary, the true error bound \begin_inset Formula $\bar{Q}(\vec{w},\mu)_{D}$ \end_inset satisfies the equation: \begin_inset Formula \[ \textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||\bar{Q}(\vec{w},\mu)_{D}\right)=\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\] \end_inset This is an implicit equation for \begin_inset Formula $\bar{Q}$ \end_inset which can be easily solved numerically. \layout Standard The bound is stated in terms of dot products here, so naturally it is possible to kernelize the result using methods from \begin_inset LatexCommand \cite{kernelize} \end_inset . In kernelized form, the bound applies to classifiers (as output by SVM learning algorithms) of the form: \begin_inset Formula \begin{equation} c(x)=\textrm{sign}\left(\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)\label{eq:class-kernel}\end{equation} \end_inset Since, by assumption, \begin_inset Formula $k$ \end_inset is a kernel, we know that \begin_inset Formula $k(x_{i},x)=\vec{\Phi}(x_{i})\cdot\vec{\Phi}(x)$ \end_inset where \begin_inset Formula $\vec{\Phi}(x)$ \end_inset is some projection into another space. In kernelized form, we get \begin_inset Formula $\vec{w}\cdot\vec{x}=\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)$ \end_inset , \begin_inset Formula $\vec{x}\cdot\vec{x}=k(x,x)$ \end_inset , \begin_inset Formula $\vec{w}\cdot\vec{w}=\sum_{i,j}\alpha_{i}\alpha_{j}k(x_{i},x_{j})$ \end_inset , defining all of the necessary quantities to calculate the normalized margin, \begin_inset Formula \[ \gamma(x,y)=\frac{\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)}{\sqrt{k(x,x)\sum_{i,j=1,1}^{m,m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})}}\] \end_inset \layout Standard One element remains, which is the value of \begin_inset Formula $\mu$ \end_inset . Unfortunately the bound can be nonmonotonic in the value of \begin_inset Formula $\mu$ \end_inset , but it turns out that for classifiers learned by support vector machines on reasonable datasets, there is only one value of \begin_inset Formula $\mu$ \end_inset which is (locally, and thus globally) minimal. A binary search over some reasonable range of \begin_inset Formula $\mu$ \end_inset (say from \begin_inset Formula $1$ \end_inset to \begin_inset Formula $100$ \end_inset ) can find the minima quickly, given the precomputation of the margins. It is worth noting again here that we are not \begin_inset Quotes eld \end_inset cheating \begin_inset Quotes erd \end_inset ---the bound holds for all values of \begin_inset Formula $\mu$ \end_inset simultaneously. \layout Standard The computational time of the bound calculation is dominated by the calculation of the margins which is \begin_inset Formula $O\left(m^{2}\right)$ \end_inset where \begin_inset Formula $m$ \end_inset is the number of support vectors with a nonzero associated \begin_inset Formula $\alpha$ \end_inset . This computational time is typically dominated by the time of the SVM learning algorithm. \layout Subsubsection Results \layout Standard Application of this bound to support vector machines is of significant importanc e because SVMs are reasonably effective and adaptable classifiers in common and widespread use. A SVM learns a kernelized classifier as per equation \begin_inset LatexCommand \ref{eq:class-kernel} \end_inset \begin_inset Foot collapsed false \layout Standard Some SVM learning algorithms actually learn a classifier of the form: \begin_inset Formula $c(x)=\textrm{sign}\left(b+\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)$ \end_inset . We don't handle this form here. \end_inset . \layout Standard We apply the support vector machine to 8 UCI database problems chosen to fit the criteria \begin_inset Quotes eld \end_inset two classes \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset real valued input features \begin_inset Quotes erd \end_inset . The problems vary in size over an order of magnitude from \begin_inset Formula $145$ \end_inset to \begin_inset Formula $1428$ \end_inset examples. In figure \begin_inset LatexCommand \ref{fig:test} \end_inset we use a 70/30 train/test split of the data. \layout Standard In all experiments, we use SVMlight with a Gaussian kernel and the default bandwidth. Results for other choices of the \begin_inset Quotes eld \end_inset C \begin_inset Quotes erd \end_inset , bandwidth, and/or kernel appear to be qualitatively similar (although of course differing quantitatively). \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_and_margin.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:test} \end_inset \layout Standard This figure shows the results of applying SVMlight to 8 datasets with a Gaussian kernel and a 70/30 train/test split. The observed test error rate is graphed as an X. On the test set, we calculate a Binomial confidence interval (probability of bound failure = \begin_inset Formula $0.01$ \end_inset ) which upper bounds the true error rate. On the training set we calculate the PAC-Bayes margin bound for an optimized choice of \begin_inset Formula $\mu$ \end_inset . \end_inset It is important to note that the PAC-Bayes margin bound is \emph on not \emph default precisely a bound (or confidence interval) on the true error rate of the learned classifier. Instead, it is a true error rate bound on an associated stochastic classifier chosen so as to have a similar test error rate. These bounds can be regarded as bounds for the original classifier only under an additional assumption: that picking a classifier according to the majority vote of this stochastic distribution does not worsen the true error rate. This is not true in general, but may be true in practice. \layout Standard It is of course unfair to compare the train set bound with the test set bound on a 70/30 train/test split because a very tight train set bound would imply that it is unnecessary to even have a test set. In figure \begin_inset LatexCommand \ref{fig:full-train} \end_inset we compare the true error bounds on all of the data to the true error bounds generated from the 70/30 train/test split. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_vs_margin.ps display monochrome width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:full-train} \end_inset \layout Standard In addition to comparing with everything in figure \begin_inset LatexCommand \ref{fig:test} \end_inset , we graph the margin bound when \emph on all \emph default of the data is used for the train set. Note that it improves somewhat on the margin bound calculated using the 70% train set (7/10 margin bound), but not enough to compete with the test set bound. \end_inset The results show that the PAC-Bayes margin bound is tight enough to give useful information, but still not competitive with the test set bounds. This is in strong contrast with a tradition of quantitatively impractical margin bounds. There are several uses available for bounds which provide some information but which are not fully tight. \layout Enumerate They might be combined with a train/test bound \begin_inset LatexCommand \cite{TnT} \end_inset . \layout Enumerate The train set bound might easily become tighter for smaller sample sizes. This was observed in \begin_inset LatexCommand \cite{TnT} \end_inset . \layout Enumerate The train set bound might still have the right \begin_inset Quotes eld \end_inset shape \begin_inset Quotes erd \end_inset for choosing an optimal parameter setting, such as \begin_inset Quotes eld \end_inset C \begin_inset Quotes erd \end_inset in a Support Vector Machine. \layout Section Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sec:Sparsity-Bound} \end_inset \layout Standard The sample compression bound \begin_inset LatexCommand \cite{LW} \end_inset , \begin_inset LatexCommand \cite{FW} \end_inset is like the PAC-Bayes bound in that it applies to arbitrary precision continuou s valued classifiers. Unlike the PAC-Bayes bound, it applies meaningfully to nonstochastic classifier s. Mainstream learning algorithms do not optimize the sample compression metric, so the bound application is somewhat rarer. Nonetheless, there do exist some reasonably competitive learning algorithms for which the sample compression bound produces significant results. \layout Standard The section is organized as follows: \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:The-Sample-Compression-Bound} \end_inset states and proves the sample compression bound. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:SC-tightness} \end_inset shows that the sample compression bound is nearly as tight as possible given the observations. \layout Enumerate Subsection \begin_inset LatexCommand \ref{sub:SC-application} \end_inset discusses results from the application of the sample compression bound to support vector machines. \layout Subsection The Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sub:The-Sample-Compression-Bound} \end_inset \layout Standard The Sample Compression bound stated here is differs somewhat from other results ( \begin_inset LatexCommand \cite{LW} \end_inset \begin_inset LatexCommand \cite{FW} \end_inset and others) by generalization and simplification but the bound behavior is qualitatively identical. \layout Standard Suppose we have a learning algorithm, \begin_inset Formula $A(S)$ \end_inset whose training is \begin_inset Quotes eld \end_inset sparse \begin_inset Quotes erd \end_inset \begin_inset Foot collapsed true \layout Standard This is satisfied, for example, by the Support Vector Machine algorithm which only depends upon the set of support vectors. \end_inset in the sense that the output classifier is dependent upon only a subset of the data, \begin_inset Formula $A(S)=A(S')$ \end_inset for \begin_inset Formula $S'\subseteq S$ \end_inset . The sample compression bound is dependent on the error rate, \begin_inset Formula $\hat{c}_{S-S'}$ \end_inset on the subset \begin_inset Formula $S-S'$ \end_inset . The motivation here is that the examples which the learning algorithm does \emph on not \emph default depend upon are \begin_inset Quotes eld \end_inset almost \begin_inset Quotes erd \end_inset independent and so we can \begin_inset Quotes eld \end_inset almost \begin_inset Quotes erd \end_inset get a test set bound. \layout Standard \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename sample_compression.eps scale 50 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:SC-POL} \end_inset The interactive proof of learning for the sample compression bound. Note that the learning algorithm is arbitrary here, similar to the test set bound. \end_inset \layout Standard Viewed as an interactive proof of learning (in figure \begin_inset LatexCommand \ref{fig:SC-POL} \end_inset ), the sample compression bound is unique amongst training set bounds because it does not require \emph on any \emph default initial commitment to a measure over the classifiers. \layout Theorem (Sample Compression Bound) For all \begin_inset Formula $\delta\in(0,1]$ \end_inset , \begin_inset Formula $D$ \end_inset , \begin_inset Formula $A$ \end_inset : \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\delta\] \end_inset \layout Proof Suppose we knew in advance that the learning algorithm will not depend upon some subset of the examples. Then, the \begin_inset Quotes eld \end_inset undependent \begin_inset Quotes erd \end_inset subset acts like a test set and gives us a test set bound: \begin_inset Formula \[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset (Note that, technically, it is possible to refer to \begin_inset Formula $S'$ \end_inset unambiguously before randomizing over \begin_inset Formula $S$ \end_inset by specifying the indices of \begin_inset Formula $S$ \end_inset contained in \begin_inset Formula $S'$ \end_inset .) Negating this, we get: \layout Proof \begin_inset Formula \[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset and using the union bound ( \begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$ \end_inset over each possible subset, \begin_inset Formula $S'$ \end_inset , we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\delta\] \end_inset Negating this again gives us the proof. \layout Subsection The Sample Compression Bound is Sometimes Tight \layout Standard \begin_inset LatexCommand \label{sub:SC-tightness} \end_inset \layout Standard We can construct a learning algorithm/learning problem pair such that the Sample compression bound is provably near optimal, as a function of it's observables. \layout Theorem (Sample Compression Tightness) For all \begin_inset Formula $\delta\in(0,1]$ \end_inset , \begin_inset Formula $\frac{k}{m}$ \end_inset , there exists a distribution \begin_inset Formula $D$ \end_inset and learning algorithm \begin_inset Formula $A$ \end_inset s.t. \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)>\delta-\delta^{2}\] \end_inset furthermore, if \begin_inset Formula $S^{*}$ \end_inset minimizes \begin_inset Formula $\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)$ \end_inset , then \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(c^{*}=A(S^{*}):\,\,\, c_{D}^{*}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S^{*}}^{*},\frac{\delta}{m{{m \choose |S-S^{*}|}}}\right)\right)>\delta-\delta^{2}\] \end_inset \layout Proof The proof is constructive and similar to the Occam's Razor tightness result. In particular, we show how to construct a learning algorithm which outputs classifiers that err independently depending on the subset \begin_inset Formula $S'$ \end_inset used. \layout Proof Consider an input space \begin_inset Formula $X=\{0,1\}^{2^{m}}$ \end_inset . Each variable in the input space \begin_inset Formula $x_{S'}$ \end_inset can be thought of as indexing a unique subset \begin_inset Formula $S'\subseteq S$ \end_inset of the examples. In the rest of the proof, we index variables by the subset they correspond to. \layout Proof Draws from the distribution \begin_inset Formula $D$ \end_inset can be made by first flipping an unbiased coin to get \begin_inset Formula $y=1$ \end_inset with probability \begin_inset Formula $0.5$ \end_inset and \begin_inset Formula $y=-1$ \end_inset with probability \begin_inset Formula $0.5$ \end_inset . The distribution on \begin_inset Formula $X$ \end_inset consists of a set of independent values after conditioning on \begin_inset Formula $y$ \end_inset . Choose \begin_inset Formula \[ \Pr(x_{S'}\neq y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset Now, the learning algorithm \begin_inset Formula $A(S')$ \end_inset is very simple---it just outputs the classifier \begin_inset Formula $c(x)=x_{S'}$ \end_inset . On the set \begin_inset Formula $S-S'$ \end_inset , we have: \begin_inset Formula \[ \forall S'\,\,\,\Pr_{S\sim D^{m}}\left(\hat{c}_{S-S'}\geq\frac{k}{m}\right)=1-\frac{\delta}{m{{m \choose |S-S'|}}}\] \end_inset Using independence, we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}\geq\frac{k}{m}\right)=\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset Negating, we get: \begin_inset Formula \[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}<\frac{k}{m}\right)=1-\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] \end_inset and doing some algebra, we get the result. \layout Subsection Application of the Sample Compression Bound \layout Standard \begin_inset LatexCommand \label{sub:SC-application} \end_inset \layout Standard One obvious application of the Sample Compression bound is to support vector machines, since the learned classifier is only dependent on the set of support vectors. If \begin_inset Formula $S'$ \end_inset is the set of support vectors then \begin_inset Formula $S-S'$ \end_inset is the set of nonsupport vectors. Unfortunately, it turns out that this does not work so well, as observed in figure \begin_inset LatexCommand \ref{fig:sc-svm} \end_inset . \begin_inset Float figure wide false collapsed false \layout Standard \begin_inset Graphics filename test_and_compress.ps display color width 60col% rotateAngle 270 \end_inset \layout Caption \begin_inset LatexCommand \label{fig:sc-svm} \end_inset The sample compression bound applied to the output of a support vector machine with a Gaussian kernel. Here we use \begin_inset Formula $\delta=0.01$ \end_inset \end_inset \layout Standard There are other less common learning algorithms for which the sample compression bound works well. The Set Covering machine \begin_inset LatexCommand \cite{SCM} \end_inset has an associated bound which is a variant of the Sample Compression Bound. \layout Section Discussion \layout Subsection Learning algorithm design \layout Standard \emph on Every \emph default train set bound implies a learning algorithm: choose the classifier which minimizes the true error bound. This sounds like a rich source of learning algorithms, but there are some severe caveats to that statement. \layout Enumerate It is important to note that the form of a train set bound does \emph on not \emph default imply that this minimization is a good idea. Choosing between two classifiers based upon their true error bound implies a better worst-case bound on the true error. It does not imply an improved true error. In many situations, there is some other metric of comparison (such as train error rate) which in fact creates better behavior. \layout Enumerate Another strong caveat is that, historically, train set bounds have simply not been tight enough on real datasets for a nonvacuous application. This is changing with new results, but more progress is necessary. \layout Enumerate Often the optimization problem is simply not very tractable. In addition to sample complexity, learning algorithms must be concerned with run time and space usage. \layout Subsection Philosophy \layout Standard Train set bounds teach us about ways in which verifiable learning is possible, a subject which borders on philosophy. The train set bound presented here essentially shows that a reasonable person will be convinced of learning success when a short-description classifie r does well on train set data. The results here do \emph on not \emph default imply that this is the only way to convincingly learn. In fact, the (sometimes large) looseness of the Occam's Razor bound suggests that other methods for convincing learning processes exist. This observation is partially shown by other train set bound results which are presented next. \layout Subsection Conclusion \layout Standard This introduction to prediction theory covered two styles of bound: the test set bound and the train set bound. There are two important lessons here: \layout Enumerate Test set bounds provide a better way to report error rates and confidence intervals on future error rates than some current methods. \layout Enumerate Train set bounds can provide useful information. \layout Standard It is important to note that the train set bound and test set bound techniques are not mutually exclusive. It is possible to use both simultaneously \begin_inset LatexCommand \cite{TnT} \end_inset , and doing so is often desirable. Test set bounds are improved by the \begin_inset Quotes eld \end_inset free \begin_inset Quotes erd \end_inset information about the training error and train set bounds become applicable, even when not always tight. \layout Acknowledgement I am indebted to many people for helpful suggestions assembling the content. This includes some anonymous reviewers, several coauthors on previous papers, Imre Kondor, Arindam Banerjee, and Alina Beygelzimer. \layout Section* Appendix \layout Standard \begin_inset LatexCommand \label{sec-model} \end_inset \layout Standard For those interested in comparing models, the uniform convergence model \begin_inset LatexCommand \cite{Vapnik} \end_inset requires the additional assumption of the axiom of choice (to achieve empirical risk minimization) and a bound on the hypothesis space complexity. Typical theorems are of the form \begin_inset Quotes eld \end_inset after \begin_inset Formula $m$ \end_inset examples, all training errors are near to true errors \begin_inset Quotes erd \end_inset . \layout Standard The PAC learning model \begin_inset LatexCommand \cite{Valiant} \end_inset requires a polynomial time complexity learning algorithm and the assumption that the learning problem comes from some class. Theorems are of the form \begin_inset Quotes eld \end_inset after \begin_inset Formula $m$ \end_inset examples learning will be achieved \begin_inset Quotes erd \end_inset . \layout Standard Both of these models can support stronger statements than the basic prediction theory model presented here. Results from both of these models can apply here after appropriate massaging of results. \layout Standard The online learning model \begin_inset LatexCommand \cite{Online} \end_inset makes \emph on no \emph default assumptions. Typical theorems have the form \begin_inset Quotes eld \end_inset This learning algorithm's performance will be nearly as good as anyone of a set of classifiers. \begin_inset Quotes erd \end_inset The online learning model has very general results and no ability to answer questions about future performance as we address here. \layout Standard The prediction theory model can most simply be understood as a slight refinement of the information theory model. \layout Bibliography \bibitem {BEHW} A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth. \begin_inset Quotes eld \end_inset Occam's Razor. \begin_inset Quotes erd \end_inset Information Processing Letters 24: 377-380, 1987. \layout Bibliography \bibitem {Progressive} Avrim Blum, Adam Kalai, and John Langford 1999. Beating the Holdout: Bounds for K-Fold and Progressive Cross-Validation. COLT99. http://www.cs.cmu.edu/~jcl/papers/progressive_validation/coltfinal.ps \layout Bibliography \bibitem {chernoff} H. Chernoff, \begin_inset Quotes eld \end_inset A Measure of the Asymptotic Efficiency of Tests of a Hypothesis Based Upon the Sum of the Observations \begin_inset Quotes erd \end_inset , Annals of Mathematical Statistics 23:493-507, 1952. \layout Bibliography \bibitem {Devroye} Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, \begin_inset Quotes eld \end_inset A Probabilistic Theory of Pattern Recognition \begin_inset Quotes erd \end_inset Springer-Verlag New York, 1996. \layout Bibliography \bibitem {FW} Sally Floyd and Manfred Warmuth, \begin_inset Quotes eld \end_inset Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension \begin_inset Quotes erd \end_inset , Machine Learning, Vol.21 (3), pp. 269--304, December 1995 http://www.cse.ucsc.edu/~manfred/pubs/sallycompr.forgalleys. ps \layout Bibliography \bibitem {kernelize} R. Herbrich and T. Graepel. \begin_inset Quotes eld \end_inset Large scale Bayes point machines \begin_inset Quotes erd \end_inset In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information System Processing 13, pages 528-534, Cambridge, MA, 2001. \layout Bibliography \bibitem {Sanity} Michael Kearns and Dana Ron, \begin_inset Quotes eld \end_inset Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation. \begin_inset Quotes erd \end_inset Neural Computation 11(6), pages 1427-1453, 1999. Also in Proceedings of the Tenth Annual Conference on Computational Learning Theory, ACM Press, 1997, pages 152--162. http://www.cis.upenn.edu/~mkearns/papers/multi.ps \layout Bibliography \bibitem {Online} J. Kivinen and M. Warmuth, "Additive Versus Exponentiated Gradient Updates for Linear Prediction, " in Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, January 1997. http://www.cse.ucsc.edu/~manfred/pubs/lin.ps \layout Bibliography \bibitem {bound} John Langford, Program \begin_inset Quotes eld \end_inset bound \begin_inset Quotes erd \end_inset , http://www-2.cs.cmu.edu/~jcl/programs/bound/bound.html \layout Bibliography \bibitem {MC_journal} John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding learning algorithms. Machine Learning Journal. http://www.cs.cmu.edu/~jcl/papers/microchoice/journal/journal_final.ps \layout Bibliography \bibitem {Shell} John Langford and David McAllester, \begin_inset Quotes eld \end_inset Computable Shell Decomposition Bounds \begin_inset Quotes erd \end_inset COLT 2000. http://www-2.cs.cmu.edu/~jcl/papers/computable_shell/colt_final.ps \layout Bibliography \bibitem {averaging} John Langford and Matthias Seeger, \begin_inset Quotes eld \end_inset Bounds for Averaging Classifiers \begin_inset Quotes erd \end_inset Technical report, Carnegie Mellon 2001 http://www-2.cs.cmu.edu/~jcl/papers/averagi ng/averaging_tech.ps \layout Bibliography \bibitem {PB-margin} John Langford and John Shawe-Taylor, \begin_inset Quotes eld \end_inset PAC-Bayes & Margins \begin_inset Quotes erd \end_inset , NIPS 2002. http://www-2.cs.cmu.edu/~jcl/papers/PB-margin/PB-margin.ps \layout Bibliography \bibitem {TnT} John Langford, \begin_inset Quotes eld \end_inset Combining Train Set and Test Set Bounds \begin_inset Quotes erd \end_inset , ICML2002. http://www-2.cs.cmu.edu/~jcl/papers/tnt_icml/train_n_test_final.ps \layout Bibliography \bibitem {LW} Nick Littlestone and Manfred Warmuth, \begin_inset Quotes eld \end_inset Relating Data Compression and Learnability \begin_inset Quotes erd \end_inset Unpublished Manuscript http://www.cse.ucsc.edu/~manfred/pubs/lrnk-olivier.ps \layout Bibliography \bibitem {PB} David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999. http://www.autoreason.com/posterior01.ps \layout Bibliography \bibitem {SCM} Mario Marchand and John Shawe-Taylor, \begin_inset Quotes eld \end_inset The Set Covering Machine \begin_inset Quotes erd \end_inset , ICML 2001. http://www.ai.mit.edu/projects/jmlr/papers/volume3/marchand02a/marchand02a.pdf \layout Bibliography \bibitem {Matthias} Matthias Seeger, \begin_inset Quotes eld \end_inset PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification \begin_inset Quotes erd \end_inset , Journal of Machine Learning Research 3 (2002), 233--269. http://www.dai.ed.ac.uk/homes/seeger/papers/seeger02a.ps.gz \layout Bibliography \bibitem {Chernoff} Sebastian Seung, Unpublished notes \layout Bibliography \bibitem {Valiant} L.G. Valiant. \begin_inset Quotes eld \end_inset A Theory of the Leuarnable. \begin_inset Quotes erd \end_inset Communications of the ACM 27(11):1134-1142, November 1984 \layout Bibliography \bibitem {Vapnik} V. N. Vapnik and A. Y. Chervonenkis. \begin_inset Quotes eld \end_inset On the uniform convergence of relative frequencies of events to their probabilit ies. \begin_inset Quotes erd \end_inset Theory of Probab. and its Applications, 16(2):264-280, 1971. \the_end