#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass amsart
\language english
\inputencoding default
\fontscheme default
\graphics default
\paperfontsize default
\spacing single 
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 2
\paperpagestyle default

\layout Title

Tutorial on Practical Prediction Theory for Classification
\layout Author

John Langford, IBM Research
\layout Abstract

We discuss basic prediction theory and it's impact on classification success
 evaluation, implications for learning algorithm design, and uses in learning
 algorithm execution.
 This tutorial is meant to be a comprehensive compilation of results which
 are both theoretically rigorous and practically useful.
 
\layout Abstract

There are a two important implications of the results presented here: 
\layout Abstract

(1) Common practices for reporting results in classification should change
 to use the test set bound.
\layout Abstract

(2) Train set bounds can sometimes be used to directly motivate learning
 algorithms.
 
\layout Section

Introduction
\layout Standard

Classifiers are functions which partition a set into two elements (the set
 of rainy days and the set of sunny days).
 Classifiers are the most simple nontrivial decision making element so studying
 the theory of learning classifiers is very fundamental to studying the
 theory of learning, in general.
 Classifiers are sufficiently complex that many phenomena observed in machine
 learning (theoretically or experimentally) can be observed in the classificatio
n setting.
 Yet, classifiers are simple enough to make their analysis easy to understand.
 This combination of sufficient yet minimal complexity for capturing phenomena
 makes the study of classifiers especially fruitful.
 
\layout Standard

The goal of this paper is an introduction to the theory of prediction for
 classification.
 Many of these results have been presented elsewhere, although the style,
 elemental nature, and generality of the presentation may be new.
 This is a tutorial, so we limit our presentation to those results which
 are both theoretically sound and practically useful.
\layout Standard

There are several important aspects of learning which the theory here casts
 light on.
 Perhaps the most important of these is the problem of performance reporting
 for classifiers.
 Many people use some form of empirical variance to estimate upper and lower
 bounds.
 This is an error-prone practice, and the test set bound in section 
\begin_inset LatexCommand \ref{sec-test}

\end_inset 

 implies a better method by nearly any metric.
 Hopefully, this will become common practice.
 
\layout Standard

After discussing the test set bound we cover the Occam's Razor bound, the
 simplest train set bound, which explains (and quantifies) the common phenomena
 of overfitting in simplest form.
 We also prove that the Occam's Razor bound cannot be improved without incorpora
ting extra information and apply the bound to decision trees.
\layout Standard

Next, we discuss two train set bounds, the PAC-Bayes bound and the Sample
 Compression bound, which have proved to give practical results for more
 general classifiers, such as Support Vector Machines and Neural Networks.
\layout Standard

All of the results here should be easily approachable and understandable.
 The proofs are simple, and examples are given.
 Pointers to related work are also given.
 
\layout Standard

It is important to note that all of the results presented here fall in the
 realm of classical statistics.
 In particular, all randomizations are over draws of the data, and our results
 have the form of confidence intervals.
 
\layout Standard

The layout of this document is as follows:
\layout Itemize

Section 
\begin_inset LatexCommand \ref{sec-formal}

\end_inset 

 presents the formal model 
\layout Itemize

Section 
\begin_inset LatexCommand \ref{sec-test}

\end_inset 

 presents the test set bound
\layout Itemize

Section 
\begin_inset LatexCommand \ref{sec-train}

\end_inset 

 presents the Occam's Razor bound
\layout Itemize

Section 
\begin_inset LatexCommand \ref{sec:PAC-Bayes-Bound}

\end_inset 

 presents the PAC-Bayes bound
\layout Itemize

Section 
\begin_inset LatexCommand \ref{sec:Sparsity-Bound}

\end_inset 

 presents the Sample Compression bound
\layout Standard

The formal model and test set bound must be understood in order to appreciate
 all later results.
 There is no particular dependency between the various train set bounds
 we present.
\layout Section

Formal Model
\layout Standard


\begin_inset LatexCommand \label{sec-formal}

\end_inset 


\layout Standard

There are many somewhat arbitrary choices of learning model.
 The one we use can (at best) be motivated by its simplicity.
 Other models such as the online learning model 
\begin_inset LatexCommand \cite{Online}

\end_inset 

, PAC learning 
\begin_inset LatexCommand \cite{Valiant}

\end_inset 

, and the uniform convergence model 
\begin_inset LatexCommand \cite{Vapnik}

\end_inset 

 differ in formulation, generality, and in the scope of addressable questions.
 The strongest motivation for studying the prediction theory model here
 is simplicity and corresponding generality of results.
 Appendix section 
\begin_inset LatexCommand \ref{sec-model}

\end_inset 

 discusses the connections between various models.
\layout Subsection

Basic quantities
\layout Standard

We are concerned with a learning model in which examples of (input, output)
 pairs come independently from some unknown distribution.
 The goal is to find a function capable of predicting the output given the
 input.
 There are several mathematical objects we work with.
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center 

\begin_inset  Tabular
<lyxtabular version="3" rows="7" columns="2">
<features>
<column alignment="center" valignment="top" leftline="true" width="0pt">
<column alignment="center" valignment="top" leftline="true" rightline="true" width="0pt">
<row topline="true" bottomline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard

Object
\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

Description
\end_inset 
</cell>
</row>
<row topline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $X$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

The (arbitrary) space of the input to a classifier
\end_inset 
</cell>
</row>
<row topline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $Y=\{-1,1\}$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

The output of a classification.
\end_inset 
</cell>
</row>
<row topline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $D$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

An (unknown) distribution over 
\begin_inset Formula $X\times Y$
\end_inset 


\end_inset 
</cell>
</row>
<row topline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $S$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

A set of examples drawn independently from 
\begin_inset Formula $D$
\end_inset 

.
\end_inset 
</cell>
</row>
<row topline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $m$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $=|S|$
\end_inset 

 the number of examples
\end_inset 
</cell>
</row>
<row topline="true" bottomline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\layout Standard


\begin_inset Formula $c$
\end_inset 


\end_inset 
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\layout Standard

A function mapping 
\begin_inset Formula $X$
\end_inset 

 to 
\begin_inset Formula $Y$
\end_inset 


\end_inset 
</cell>
</row>
</lyxtabular>

\end_inset 


\layout Standard

There are several aberrations of this model from other (perhaps more familiar)
 models.
 There is no mention of a classifier space, because the results do not depend
 upon a classifier space.
 Also, the notion of a distribution on 
\begin_inset Formula $X\times Y$
\end_inset 

 is strictly more general than the 
\begin_inset Quotes eld
\end_inset 

target concept
\begin_inset Quotes erd
\end_inset 

 model which assumes that there exists some function 
\begin_inset Formula $f:X\rightarrow Y$
\end_inset 

 used to generate the label 
\begin_inset LatexCommand \cite{Valiant}

\end_inset 

.
 In particular we can model noisy learning problems which do not have a
 particular 
\begin_inset Formula $Y$
\end_inset 

 value for each 
\begin_inset Formula $X$
\end_inset 

 value.
 This generalization is essentially 
\begin_inset Quotes eld
\end_inset 

free
\begin_inset Quotes erd
\end_inset 

 in the sense that it does not add to the complexity of presenting the results.
\layout Standard

It is worth noting that the 
\emph on 
only 
\emph default 
unverifiable assumption we make is examples are drawn independently from
 
\begin_inset Formula $D$
\end_inset 

.
 The strength of all the results which follow rests upon the correctness
 of this assumption.
 
\layout Standard

Sometimes, we decorate these objects with labels like 
\begin_inset Formula $S_{\textrm{train}}$
\end_inset 

 (a train set) or 
\begin_inset Formula $S_{\textrm{test}}$
\end_inset 

 (a test set).
 These decorations should always be clear.
 
\layout Example

Weather prediction: Will it rain today or not? In this case 
\begin_inset Formula $X=$
\end_inset 

 barometric pressure, observations of cloud cover or other sensory input
 and 
\begin_inset Formula $Y=0$
\end_inset 

 if the prediction is 
\begin_inset Quotes eld
\end_inset 

no rain
\begin_inset Quotes erd
\end_inset 

 and 
\begin_inset Formula $1$
\end_inset 

 otherwise.
 The distribution 
\begin_inset Formula $D$
\end_inset 

 is over sensory inputs and outcomes.
 The sample set 
\begin_inset Formula $S$
\end_inset 

, might consist of 
\begin_inset Formula $m=100$
\end_inset 

 (observation, outcome) pairs such as (pressure low, cloudy, rain), (pressure
 high, cloudy, not rain), etc.
 A classifier, 
\begin_inset Formula $c$
\end_inset 

, is any function which predicts 
\begin_inset Quotes eld
\end_inset 

rain
\begin_inset Quotes erd
\end_inset 

 or 
\begin_inset Quotes eld
\end_inset 

not rain
\begin_inset Quotes erd
\end_inset 

 based upon the observation.
\layout Example

Note that the independence assumption here is not perfectly satisfied although
 it seems to be a reasonable approximation.
 In any application of this theory, it must be carefully judged whether
 the independence assumption holds or not.
 
\layout Subsection

Derived quantities
\layout Standard

There are several derived quantities which the results are stated in terms
 of.
 
\layout Definition

(True Error) The true error 
\begin_inset Formula $c_{D}$
\end_inset 

 of a classifier 
\begin_inset Formula $c$
\end_inset 

 is defined as the probability that the classifier errs:
\begin_inset Formula \[
c_{D}\equiv\Pr_{(x,y)\sim D}(c(x)\neq y)\]

\end_inset 


\layout Standard

The true error is sometimes called the 
\begin_inset Quotes eld
\end_inset 

generalization error
\begin_inset Quotes erd
\end_inset 

.
 Unfortunately, the true error is not an observable quantity in our model
 because the distribution, 
\begin_inset Formula $D$
\end_inset 

, is unknown.
 However, there is a related quantity which is observable.
 
\layout Definition

(Empirical Error) Given a sample set 
\begin_inset Formula $S$
\end_inset 

, the 
\emph on 
empirical error
\emph default 
, 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

 is the observed rate of errors:
\begin_inset Formula \[
\hat{c}_{S}\equiv\Pr_{(x,y)\sim S}(c(x)\neq y)=\frac{1}{m}\sum_{i=1}^{m}I(c(x_{i})\neq y_{i})\]

\end_inset 

 where 
\begin_inset Formula $I()$
\end_inset 

 is a function which maps 
\begin_inset Quotes eld
\end_inset 

true
\begin_inset Quotes erd
\end_inset 

 to 
\begin_inset Formula $1$
\end_inset 

 and 
\begin_inset Quotes eld
\end_inset 

false
\begin_inset Quotes erd
\end_inset 

 to 
\begin_inset Formula $0$
\end_inset 

.
 Also, 
\begin_inset Formula $\Pr_{(x,y)\sim S}(...)$
\end_inset 

 is a probability taken with respect to the uniform distribution over the
 set of examples, 
\begin_inset Formula $S$
\end_inset 

.
\layout Standard

The empirical error is sometimes called the 
\begin_inset Quotes eld
\end_inset 

training error
\begin_inset Quotes erd
\end_inset 

, 
\begin_inset Quotes eld
\end_inset 

test error
\begin_inset Quotes erd
\end_inset 

, or 
\begin_inset Quotes eld
\end_inset 

observed error
\begin_inset Quotes erd
\end_inset 

 depending on whether it is the error rate on a training set, test set,
 or a more general set.
\layout Example*

(continued) The classifier 
\begin_inset Formula $c$
\end_inset 

 which always predicts 
\begin_inset Quotes eld
\end_inset 

not rain
\begin_inset Quotes erd
\end_inset 

 might have an empirical error of 
\begin_inset Formula $\frac{38}{100}$
\end_inset 

 and an unknown true error rate (which might in fact be 
\begin_inset Formula $0.5$
\end_inset 

).
 
\layout Subsection

Addressable questions
\layout Standard

Given the true error, 
\begin_inset Formula $c_{D}$
\end_inset 

 of a classifier 
\begin_inset Formula $c$
\end_inset 

 we can precisely describe the distribution of success and failure on future
 examples drawn according to 
\begin_inset Formula $D$
\end_inset 

.
 This quantity is derived from the unknown distribution 
\begin_inset Formula $D$
\end_inset 

, so our effort is directed toward upper and lower bounding the value of
 
\begin_inset Formula $c_{D}$
\end_inset 

 for a classifier 
\begin_inset Formula $c$
\end_inset 

.
 
\layout Standard

The variations in all of the bounds that we present are related to the method
 of choosing a classifier, 
\begin_inset Formula $c$
\end_inset 

.
 We cover two types of bounds:
\layout Enumerate

Test: Use examples in a test set which were not used in picking 
\begin_inset Formula $c$
\end_inset 

.
 
\layout Enumerate

Train: Use examples for both choosing 
\begin_inset Formula $c$
\end_inset 

 and evaluating 
\begin_inset Formula $c$
\end_inset 

.
 
\layout Standard

These methods are addressed in the next two sections.
\layout Standard

It is worth noting that one question that 
\emph on 
cannot 
\emph default 
be addressed in this model is 
\begin_inset Quotes eld
\end_inset 

Can learning occur for my problem?
\begin_inset Quotes erd
\end_inset 

.
 Extra assumptions (as in 
\begin_inset LatexCommand \cite{Valiant}

\end_inset 

 
\begin_inset LatexCommand \cite{Vapnik}

\end_inset 

) are inherently necessary.
 
\layout Section

The Test Set Method
\layout Standard


\begin_inset LatexCommand \label{sec-test}

\end_inset 


\layout Standard

The simplest bound arises for the classical technique of using 
\begin_inset Formula $m$
\end_inset 

 fresh examples to evaluate a classifier.
 This section is organized into two subsections:
\layout Itemize

Subsection 
\begin_inset LatexCommand \ref{subsec-test-bound}

\end_inset 

 presents the basic upper bound on the true error rate, handy approximations,
 and a lower bound
\layout Itemize

Subsection 
\begin_inset LatexCommand \ref{subsec-test-implication}

\end_inset 

 discusses the implications of the test set bound on error reporting practice.
 A better method for error reporting is applied to several datasets and
 the results are shown.
\layout Subsection

The Bound
\layout Standard


\begin_inset LatexCommand \label{subsec-test-bound}

\end_inset 


\layout Standard

Before stating the bound, we note a few basic observations which make the
 results less surprising.
\begin_inset Float figure
wide false
collapsed false

\layout Standard
\align center 

\begin_inset Graphics
	filename single_binomial.ps
	display monochrome
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-binom}

\end_inset 

 A depiction of the Binomial distribution.
 The cumulative of the Binomial is the area under the curve up to some point
 on the horizontal axis.
\end_inset 

The principal observable quantity is the empirical error 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

 of a classifier.
 What is the distribution of the empirical error for a fixed classifier?
 For each example, our independence assumption implies the probability that
 the classifier errs is given by the true error, 
\begin_inset Formula $c_{D}$
\end_inset 

.
 This can be modeled by a biased coin flip: heads if you are right and tails
 if you are wrong.
 
\layout Standard

What is the probability of observing 
\begin_inset Formula $k$
\end_inset 

 errors (heads) out of 
\begin_inset Formula $m$
\end_inset 

 examples (coin flips)? This is a very familiar distribution in statistics
 called the Binomial and so it should be unsurprising that the bounds presented
 here are fundamentally dependent upon the cumulative distribution of a
 Binomial.
 
\layout Definition

(Binomial Tail Distribution) 
\begin_inset Formula \[
\textrm{Bin}\left(\frac{k}{m},c_{D}\right)\equiv\Pr_{X_{1},...X_{m}\sim c_{D}^{m}}\left(\left.\sum_{i=1}^{m}X_{i}\leq k\right|c_{D}\right)=\sum_{j=0}^{k}\binom{m}{j}c_{D}^{j}(1-c_{D})^{m-j}\]

\end_inset 

= the probability that 
\begin_inset Formula $m$
\end_inset 

 examples (coins) with error rate (bias) 
\begin_inset Formula $c_{D}$
\end_inset 

 produce 
\begin_inset Formula $k$
\end_inset 

 or fewer errors (heads).
\layout Definition

A depiction of the Binomial distribution is given in figure 
\begin_inset LatexCommand \ref{fig-binom}

\end_inset 

.
\layout Standard

For the learning problem, we always choose a bias of 
\begin_inset Formula $c_{D}$
\end_inset 

 and 
\begin_inset Formula $X_{i}=$
\end_inset 

error or not on the 
\begin_inset Formula $i$
\end_inset 

th example.
 With these definitions, we can interpret the Binomial tail as the probability
 of an empirical error greater than or equal to 
\begin_inset Formula $\frac{k}{m}$
\end_inset 

.
 
\layout Standard

Since we are interested in calculating a bound on the true error given a
 confidence 
\begin_inset Formula $\delta$
\end_inset 

, and an empirical error 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

, it is handy to define the inversion of a Binomial tail.
\layout Definition

(Binomial Tail Inversion) 
\begin_inset Formula \[
\overline{\textrm{Bin}}\left(\frac{k}{m},\delta\right)\equiv\max_{p}\left\{ p:\,\,\textrm{Bin}\left(\frac{k}{m},p\right)\geq\delta\right\} \]

\end_inset 

= the largest true error such that the probability of observing 
\begin_inset Formula $\frac{k}{m}$
\end_inset 

 or more 
\begin_inset Quotes eld
\end_inset 

heads
\begin_inset Quotes erd
\end_inset 

 is at least 
\begin_inset Formula $\delta$
\end_inset 

.
\layout Standard

With these definitions finished, the results are all very simple statements.
 
\layout Theorem


\begin_inset LatexCommand \label{th-hb}

\end_inset 

(Test Set Bound) For all classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 


\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta\right)\right)\geq1-\delta\]

\end_inset 


\layout Standard

Note that 
\begin_inset Formula $m$
\end_inset 

 in this equation is 
\begin_inset Formula $m_{\textrm{test}}=|S_{\textrm{test}}|$
\end_inset 

, the size of the test set.
 
\layout Proof

(pictorially in 
\begin_inset LatexCommand \ref{fig-proof}

\end_inset 

) The proof is just a simple identification with the Binomial.
 For any distribution over 
\begin_inset Formula $(x,y)$
\end_inset 

 pairs and any classifier, 
\begin_inset Formula $c$
\end_inset 

, there exists some probability, 
\begin_inset Formula $c_{D}$
\end_inset 

, that the classifier predicts incorrectly.
 We can regard this event as a coin flip with bias 
\begin_inset Formula $c_{D}$
\end_inset 

.
 Since each example is picked independently, the distribution of the empirical
 error is a Binomial distribution.
 
\layout Proof

Whatever our true error 
\begin_inset Formula $c_{D}$
\end_inset 

 is, with probability 
\begin_inset Formula $1-\delta$
\end_inset 

 the observation 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

 will not fall into a tail of size 
\begin_inset Formula $\delta$
\end_inset 

.
 Assuming (correctly with probability 
\begin_inset Formula $1-\delta$
\end_inset 

) that the empirical error is not in the Binomial tail, we can constrain
 (and therefore bound) the value of the true error 
\begin_inset Formula $c_{D}$
\end_inset 

.
 
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename binomials.ps
	display monochrome
	width 35col%
	rotateAngle 270

\end_inset 


\begin_inset Graphics
	filename cut_binomials.ps
	display monochrome
	width 35col%
	rotateAngle 270

\end_inset 


\layout Standard


\begin_inset Graphics
	filename consistent_cut_binomials.ps
	display monochrome
	width 35col%
	rotateAngle 270

\end_inset 


\begin_inset Graphics
	filename worst_cut.ps
	display monochrome
	width 35col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-proof}

\end_inset 

 A graphical depiction of the test set bound.
 The first graph depicts several possible Binomials given their true error
 rates.
 The second depicts several Binomials, each with a tail cut.
 The third figure shows the Binomials consistent with the tail cut and observed
 test error.
 The worst case over all true error rates is the consistent Binomial with
 the largest bias.
 
\end_inset 

 
\layout Standard

The test set bound is, essentially, perfectly tight.
 For any classifier with a sufficiently large true error, the bound is violated
 exactly a 
\begin_inset Formula $\delta$
\end_inset 

 portion of the time.
 
\layout Subsubsection

Approximations
\layout Standard


\begin_inset LatexCommand \label{sub:Corollaries}

\end_inset 


\layout Standard

There are several immediate corollaries of the test set bound (
\begin_inset LatexCommand \ref{th-hb}

\end_inset 

) which are more convenient when a computer is not handy.
 The first corollary applies to the limited 
\begin_inset Quotes eld
\end_inset 

realizable
\begin_inset Quotes erd
\end_inset 

 setting where you happen to observe 
\begin_inset Formula $0$
\end_inset 

 test errors.
\layout Corollary

(Realizable Test Set Bouund) For all classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 


\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\hat{c}_{S}=0\Rightarrow c_{D}\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta\]

\end_inset 


\layout Proof

Specializing theorem 
\begin_inset LatexCommand \ref{th-hb}

\end_inset 

 to the zero empirical error case, we get:
\begin_inset Formula \[
\textrm{Bin}\left(\frac{0}{m},\epsilon\right)=(1-\epsilon)^{m}\leq e^{-\epsilon m}\]

\end_inset 

Setting this equal to 
\begin_inset Formula $\delta$
\end_inset 

 and solving for 
\begin_inset Formula $\epsilon$
\end_inset 

 gives us the result.
 The last inequality can be most simply motivated by comparing graphs as
 in figure 
\begin_inset LatexCommand \ref{fig:graphs}

\end_inset 

.
 
\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename exp.ps
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:graphs}

\end_inset 

 A graph suggesting 
\begin_inset Formula $e^{-\epsilon m}\geq(1-\epsilon)^{m}$
\end_inset 

.
\end_inset 


\layout Standard

Approximations which hold for arbitrary (nonzero) error rates rely upon
 the Chernoff bound which we state next, for completeness.
 
\layout Lemma


\begin_inset LatexCommand \label{lem:Relative-Entropy-Chernoff}

\end_inset 

(Relative Entropy Chernoff Bound) For 
\begin_inset Formula $\frac{k}{m}<p$
\end_inset 

: 
\begin_inset Formula \[
\textrm{Bin}\left(\frac{k}{m},p\right)\leq e^{-m\textrm{KL}\left(\frac{k}{m}||p\right)}\]

\end_inset 


\layout Proof

(Based on 
\begin_inset LatexCommand \cite{Chernoff}

\end_inset 

) For all 
\begin_inset Formula $\lambda>0$
\end_inset 

, we have:
\layout Proof


\begin_inset Formula \[
\textrm{Bin}\left(\frac{k}{m},p\right)=\Pr_{X^{m}\sim p^{m}}\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\leq\frac{k}{m}\right)=\Pr_{X^{m}\sim p^{m}}\left(e^{m\lambda\frac{1}{m}\sum_{i=1}^{m}X_{i}}\leq e^{m\lambda\frac{k}{m}}\right)\]

\end_inset 


\begin_inset Formula \[
=\Pr_{X^{m}\sim p^{m}}\left(e^{-\lambda\sum_{i=1}^{m}X_{i}}\geq e^{-\lambda k}\right)\]

\end_inset 

Using Markov's inequality (
\begin_inset Formula $X\geq0$
\end_inset 

, 
\begin_inset Formula $EX=\mu$
\end_inset 

, 
\begin_inset Formula $\Rightarrow\Pr(X\geq\delta)\leq\frac{\delta}{\mu}$
\end_inset 

): 
\begin_inset Formula \[
\leq\frac{e^{-\lambda k}}{E_{X^{m}\sim p^{m}}e^{-\lambda\sum_{i=1}^{m}X_{i}}}\]

\end_inset 

Using Jensen's inequality (
\begin_inset Formula $\frac{1}{x}$
\end_inset 

 concave 
\begin_inset Formula $\Rightarrow$
\end_inset 

 
\begin_inset Formula $\frac{1}{EX}\leq E\frac{1}{X}$
\end_inset 

), we get:
\begin_inset Formula \[
\leq e^{-\lambda k}E_{X^{m}\sim p^{m}}e^{\lambda\sum_{i=1}^{m}X_{i}}\]

\end_inset 

Using independence, we get:
\begin_inset Formula \[
=e^{-\lambda k}\left(pe^{\lambda}+(1-p)\right)^{m}\]

\end_inset 

and rewriting, we get:
\begin_inset Formula \[
=e^{-mf(\lambda)}\]

\end_inset 

where 
\begin_inset Formula $f(\lambda)=\lambda\frac{k}{m}-\ln\left(pe^{\lambda}+1-p\right)$
\end_inset 

.
\layout Proof


\begin_inset Formula $\lambda$
\end_inset 

 is a free parameter which can be optimized to find the tightest possible
 bound.
 To find the optimal value, find 
\begin_inset Formula $\lambda^{*}$
\end_inset 

 so that 
\begin_inset Formula $f'(\lambda^{*})=0$
\end_inset 

.
 
\begin_inset Formula \[
0=f'(\lambda^{*})=\frac{k}{m}-\frac{pe^{\lambda^{*}}}{pe^{\lambda^{*}}+1-p}\]

\end_inset 


\begin_inset Formula \[
\Rightarrow\frac{\frac{k}{m}}{p}\left(pe^{\lambda^{*}}+1-p\right)=e^{\lambda^{*}}\]

\end_inset 


\begin_inset Formula \[
\Rightarrow\frac{\frac{k}{m}}{p}-\frac{k}{m}=\left(1-\frac{k}{m}\right)e^{\lambda^{*}}\]

\end_inset 


\begin_inset Formula \[
\Rightarrow\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}=e^{\lambda^{*}}\]

\end_inset 

Using this, we get: 
\begin_inset Formula \[
f(\lambda^{*})=\frac{k}{m}\ln\frac{\frac{k}{m}(1-p)}{p\left(1-\frac{k}{m}\right)}-\ln\left(\frac{1-p}{1-\frac{k}{m}}\right)\]

\end_inset 


\begin_inset Formula \[
=\frac{k}{m}\ln\frac{\frac{k}{m}}{p}+(1-\frac{k}{m})\ln\left(\frac{1-\frac{k}{m}}{1-p}\right)\]

\end_inset 


\begin_inset Formula \[
=\textrm{KL}(\frac{k}{m}||p)\]

\end_inset 


\layout Standard

Using the Chernoff bound, we can loosen the test set bound to achieve a
 more analytic fom.
\layout Corollary

(Agnostic Test Set Bound) For all classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 


\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\textrm{KL}\left(\hat{c}_{S}||c_{D}\right)\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta\]

\end_inset 

where 
\begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$
\end_inset 

 is the Kullback-Leibler divergence between two coins of bias 
\begin_inset Formula $q,p$
\end_inset 

 with 
\begin_inset Formula $q<p$
\end_inset 

.
\layout Proof

Loosening theorem 
\begin_inset LatexCommand \ref{th-hb}

\end_inset 

 with the Chernoff approximation for 
\begin_inset Formula $\frac{k}{m}<c_{D}$
\end_inset 

 we get:
\begin_inset Formula \[
\textrm{Bin}\left(\frac{k}{m},c_{D}\right)\leq e^{-m\textrm{KL}\left(\frac{k}{m}||c_{D}\right)}\]

\end_inset 

 setting this equal to 
\begin_inset Formula $\delta$
\end_inset 

, and solving for 
\begin_inset Formula $\epsilon$
\end_inset 

 gives the result.
\layout Standard

The agnostic test set bound can be further loosened by bounding the value
 of 
\begin_inset Formula $\textrm{KL}(q||p)$
\end_inset 

.
\layout Corollary

(Agnostic Test Set Bound II) For all classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 


\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(c_{D}\leq\hat{c}_{S}+\sqrt{\frac{\ln\frac{1}{\delta}}{2m}}\right)\geq1-\delta\]

\end_inset 


\layout Proof

Use the approximation:
\begin_inset Formula \[
\textrm{KL}(\frac{k}{m},c_{D})\geq2(c_{D}-\frac{k}{m})^{2}\]

\end_inset 

with the Chernoff bound and Test set bounds to get the result.
\layout Standard

There has been some confusion elsewhere caused by differences between results
 in the realizable and the agnostic case.
 The differences are fundamental, and are related to the decrease in the
 variance of a Binomial as the bias (i.e.
 true error) approaches 
\begin_inset Formula $0$
\end_inset 

.
 Note that this implies using the exact Binomial tail calculation can result
 in 
\emph on 
functional 
\emph default 
(rather than merely constant) improvements on the above corollary.
 
\layout Subsubsection

A Test Set Lower Bound
\layout Standard

The true error can be lower bounded using a symmetric application of the
 same techniques.
\layout Theorem


\begin_inset LatexCommand \label{th-lb}

\end_inset 

(Test Set Lower Bound) For all classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 


\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(c_{D}\geq\min_{p}\left\{ p:\,\,1-\textrm{Bin}\left(\hat{c}_{S},p\right)\geq\delta\right\} \right)\geq1-\delta\]

\end_inset 


\layout Standard

The proof is completely symmetric.
 Note that both bounds hold with probability 
\begin_inset Formula $1-2\delta$
\end_inset 

 since 
\begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$
\end_inset 

.
 This is particularly convenient when the square-root version of the Chernoff
 approximation is used in both directions to get: 
\begin_inset Formula \[
\forall c\,\,\Pr_{S\sim D^{m}}\left(|c_{D}-\hat{c}_{S}|\leq\sqrt{\frac{\ln\frac{2}{\delta}}{2m}}\right)\geq1-\delta\]

\end_inset 


\layout Example*

(continued) let 
\begin_inset Formula $\delta=0.1$
\end_inset 

.
 Using the the square root Chernoff bound with 
\begin_inset Formula $\hat{c}_{S}=\frac{38}{100}$
\end_inset 

, we get the confidence interval 
\begin_inset Formula $c_{D}\in[0.26,0.50]$
\end_inset 

.
 Using an exact calculation for the Binomial tail, we get: 
\begin_inset Formula $c_{D}\in[0.30,0.47]$
\end_inset 

.
 In general, as the observed error moves toward 
\begin_inset Formula $0$
\end_inset 

, the exact calculation provides a tighter confidence interval than the
 agnostic approximation.
 
\layout Subsubsection

The state of the art
\layout Standard

Although the test set bound is very well understood, the same cannot be
 said of other testing methods.
 Only weak general results in this model are known for some variants of
 cross validation (see 
\begin_inset LatexCommand \cite{Progressive}

\end_inset 

).
 For specific learning algorithms (such as nearest neighbor), stronger results
 are known (see 
\begin_inset LatexCommand \cite{Devroye}

\end_inset 

).
 There are a wide range of essentially unanalyzed methods and a successful
 analysis seems particularly tricky although very worthwhile if completed.
 
\layout Subsection

Test Set Bound Implications
\layout Standard


\begin_inset LatexCommand \label{subsec-test-implication}

\end_inset 


\layout Standard

There are some common practices in machine learning which can be improved
 by application of the test set bound.
 When attempting to calculate a confidence interval on the true error rate
 given the test set, many people follow a standard statistical prescription:
\layout Enumerate

Calculate the empirical mean 
\begin_inset Formula $\hat{\mu}=\hat{c}_{S_{\textrm{test}}}=\frac{1}{m}\sum_{i=1}^{m}I(h(x_{i})\neq y_{i})$
\end_inset 

.
\layout Enumerate

Calculate the empirical variance 
\begin_inset Formula $\hat{\sigma}^{2}=\frac{1}{m-1}\sum_{i=1}^{m}(I(c(x_{i})=y_{i})-\hat{\mu})^{2}$
\end_inset 

.
\layout Enumerate

Pretend that the distribution is Gaussian with the above variance and construct
 a confidence interval by cutting the tails of the Gaussian cumulative distribut
ion at the 
\begin_inset Formula $2\hat{\sigma}$
\end_inset 

 (or some other) point.
\layout Standard

This approach is motivated by the fact that for any 
\emph on 
fixed 
\emph default 
true error rate, the distribution of the observed accuracy behaves like
 a Gaussian 
\emph on 
asymptotically.
 
\emph default 
Here, asymptotically means 
\begin_inset Quotes eld
\end_inset 

in the limit as the number of test examples goes to infinity
\begin_inset Quotes erd
\end_inset 

.
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard
\align center 

\begin_inset Graphics
	filename test_vs_two_sigma.ps
	display monochrome
	width 60col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-two-sigma-test}

\end_inset 

This is a graph of the confidence intervals implied by the test set bound
 (
\begin_inset LatexCommand \ref{th-hb}

\end_inset 

) on the left, and the approximate confidence intervals implied using the
 common two sigma rule motivated by asymptotic normality on the right.
 The upper bounds of the test set bound have 
\begin_inset Formula $\delta=0.025$
\end_inset 

 failure rate, so as to be comparable with the 
\begin_inset Formula $2$
\end_inset 

-sigma approach.
 The test set bound is better behaved as the confidence interval is confined
 to the interval 
\begin_inset Formula $[0,1]$
\end_inset 

 and is never over-optimistic.
 
\end_inset 

The problem with this approach is that it leads to fundamentally misleading
 results as shown in figure 
\begin_inset LatexCommand \ref{fig-two-sigma-test}

\end_inset 

.
 This 
\begin_inset Quotes eld
\end_inset 

misleading
\begin_inset Quotes erd
\end_inset 

 is both pessimistic and (much worse) optimistic.
 The pessimism can be seen by intervals with boundaries less than 
\begin_inset Formula $0$
\end_inset 

 or greater than 
\begin_inset Formula $1$
\end_inset 

 and the optimism by observing what happens when the test error rate is
 
\begin_inset Formula $0$
\end_inset 

.
 When we observe perfect classification, our confidence interval should
 
\emph on 
not 
\emph default 
have size 
\begin_inset Formula $0$
\end_inset 

 for any finite 
\begin_inset Formula $m$
\end_inset 

.
 
\layout Standard

The basic problem with this approach is that the Binomial distribution is
 not similar to a Gaussian when the error rate is near 
\begin_inset Formula $0$
\end_inset 

.
 Since our goal is finding a classifier with a small true error, it is essential
 that the means we use to evaluate classifiers work in this regime.
 The test set bound can satisfy this requirement (and, in fact, operates
 well for all true error regimes).
 
\layout Enumerate

The test set bound approach is 
\emph on 
never 
\emph default 
optimistic.
 
\layout Enumerate

The test set bound based confidence interval always returns an upper and
 lower bound in 
\begin_inset Formula $[0,1]$
\end_inset 

.
\layout Standard

The 
\begin_inset Formula $2\hat{\sigma}$
\end_inset 

 method is a relic of times when computational effort was expensive.
 It is now simple and easy to calculate a bound based upon the cumulative
 distribution of the Binomial (see 
\begin_inset LatexCommand \cite{bound}

\end_inset 

 for a program which does this).
 
\layout Standard

The test set bound can be thought of as a game where a 
\begin_inset Quotes eld
\end_inset 

Learner
\begin_inset Quotes erd
\end_inset 

 attempts to convince a reasonable 
\begin_inset Quotes eld
\end_inset 

Verifier
\begin_inset Quotes erd
\end_inset 

 of the amount of learning which has occurred.
 Pictorially we can represent this as in figure 
\begin_inset LatexCommand \ref{fig-holdout-protocol}

\end_inset 

.
\begin_inset Float figure
wide false
collapsed false

\layout Standard
\align center 

\begin_inset Graphics
	filename test_set.eps
	display monochrome
	width 90col%

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-holdout-protocol}

\end_inset 

 For this diagram 
\begin_inset Quotes eld
\end_inset 

increasing time
\begin_inset Quotes erd
\end_inset 

 is pointing downwards.
 The only requirement for applying this bound is that the learner must commit
 to a classifier without knowledge of the test examples.
 A similar diagram for train set bounds is presented later (and is somewhat
 more complicated).
 We can think of the bound as a technique by which the 
\begin_inset Quotes eld
\end_inset 

Learner
\begin_inset Quotes erd
\end_inset 

 can convince the 
\begin_inset Quotes eld
\end_inset 

Verifier
\begin_inset Quotes erd
\end_inset 

 that learning has occurred (and the degree to which it has occurred).
 Each of the proofs can be thought of as a communication protocol for an
 interactive proof of learning by the Learner.
\end_inset 

 
\layout Section

The Occam's Razor Bound
\layout Standard


\begin_inset LatexCommand \label{sec-train}

\end_inset 


\layout Standard

Given that the bounds for the simple test set bound work well, why do we
 need to engage in further work? There is one serious drawback to the test
 set technique---it requires 
\begin_inset Formula $m_{\textrm{test}}$
\end_inset 

 otherwise unused examples.
 This can strongly degrade the value of the learned hypothesis because an
 extra 
\begin_inset Formula $m_{\textrm{test}}$
\end_inset 

 examples for the training set increases the true error of the learned hypothesi
s from 
\begin_inset Formula $0$
\end_inset 

 to 
\begin_inset Formula $0.5$
\end_inset 

 for some natural learning algorithm/learning problem pairs.
 
\layout Standard

There is another reason why training set based bounds are important.
 Many learning algorithms implicitly assume that the train set accuracy
 
\begin_inset Quotes eld
\end_inset 

behaves like
\begin_inset Quotes erd
\end_inset 

 the true error in choosing the hypothesis.
 With an inadequate number of training examples, there may be very little
 relationship between the behavior of the train set accuracy and the true
 error.
 Training set based bounds can be used 
\emph on 
in 
\emph default 
the training algorithm and can provide insight into the learning problem
 itself.
 
\layout Standard

This section is organized into three subsections.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{subsec-train-set-upper}

\end_inset 

 states and proves the Occam's Razor bound.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sec-lower_upper}

\end_inset 

 proves that the Occam's Razor bound cannot be improved in general.
 
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{subsec-train-set-practice}

\end_inset 

 discusses implications of the Occam's Razor bound and shows results for
 it's application.
 
\layout Subsection

The Occam's Razor bound
\layout Standard


\begin_inset LatexCommand \label{subsec-train-set-upper}

\end_inset 


\layout Standard

This bound in more approximate forms has appeared elsewhere 
\begin_inset LatexCommand \cite{BEHW}

\end_inset 


\begin_inset LatexCommand \cite{PB}

\end_inset 

.
 We use 
\begin_inset Quotes eld
\end_inset 

prior
\begin_inset Quotes erd
\end_inset 

 (with quotes) here because it is an arbitrary probability distribution
 over classifiers and not necessarily a Bayesian prior.
 The distinction is important, because the theory holds regardless of whether
 or not a Bayesian prior is used.
 
\layout Theorem


\begin_inset LatexCommand \label{th-ORB}

\end_inset 

 (Occam's Razor Bound) For all 
\begin_inset Quotes eld
\end_inset 

priors
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)$
\end_inset 

 over the classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\geq1-\delta\]

\end_inset 


\layout Standard

The application of the Occam's Razor bound is somewhat more complicated
 than the application of the test set bound.
 Pictorially, the protocol for bound application is given in figure 
\begin_inset LatexCommand \ref{fig-training-set}

\end_inset 

.
 It is very important to notice that the 
\begin_inset Quotes eld
\end_inset 

prior
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)$
\end_inset 

 must be selected 
\emph on 
before 
\emph default 
seeing the training examples.
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard
\align center 

\begin_inset Graphics
	filename training_set.eps
	display monochrome
	width 90col%

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-training-set}

\end_inset 

 In order to apply the Occam's Razor bound it is necessary that the choice
 of 
\begin_inset Quotes eld
\end_inset 

prior
\begin_inset Quotes erd
\end_inset 

 be made before seeing any training examples.
 Then, the bound is calculated based upon the chosen classifier.
 Note that it 
\emph on 
is 
\emph default 

\begin_inset Quotes eld
\end_inset 

legal
\begin_inset Quotes erd
\end_inset 

 to chose the classifier based upon the prior 
\begin_inset Formula $P(c)$
\end_inset 

 as well as the empirical error 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

.
 
\end_inset 


\layout Proof

(pictorially in figure 
\begin_inset LatexCommand \ref{fig-train-proof}

\end_inset 

) The proof starts with the test set bound:
\begin_inset Formula \[
\forall c\,\,\,\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\geq1-\delta P(c)\]

\end_inset 

Negating this statement, we get:
\begin_inset Formula \[
\forall c\,\,\,\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta P(c)\]

\end_inset 

then, we apply the union bound in a nonuniform manner.
 The union bound says that 
\begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$
\end_inset 

.
 Applying the union bound to every classifier with a positive measure gives
 a total probability of failure of
\begin_inset Formula \[
\sum_{c}\delta P(c)=\delta\sum_{c}P(c)=\delta\]

\end_inset 

which implies
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists c:\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\delta\]

\end_inset 

Negating this again completes the proof.
 
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename nonuniform_binomials.ps
	display monochrome
	width 30col%
	rotateAngle 270

\end_inset 


\begin_inset Graphics
	filename deep_cut_nonuniform_binomials.ps
	display monochrome
	width 30col%
	rotateAngle 270

\end_inset 


\layout Standard


\begin_inset Graphics
	filename deep_cut_consistent_binomials.ps
	display monochrome
	width 30col%
	rotateAngle 270

\end_inset 

 
\begin_inset Graphics
	filename deep_cut_bound.ps
	display monochrome
	width 30col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-train-proof}

\end_inset 

 The sequence of pictures is the pictorial representation of the proof of
 the Occam's Razor Bound.
 The first figure shows a set of classifiers, each with a tail cut of some
 varying depth.
 The second picture shows an observed training error and the possible Binomial
 distributions for a chosen classifier.
 The third picture shows the true errors which are consistent with the observati
on and the tail cuts.
 The fourth picture shows the true error bound.
 
\end_inset 


\layout Subsubsection

Occam's Razor Corollaries
\layout Standard

Just as with the test set bound, we can relax the Occam's Razor bound (theorem
 
\begin_inset LatexCommand \ref{th-ORB}

\end_inset 

) with the Chernoff approximations to get a somewhat more tractable expression.
 
\layout Corollary


\begin_inset LatexCommand \label{th-horb}

\end_inset 

 (Chernoff Occam's Razor Bound) For all 
\begin_inset Quotes eld
\end_inset 

priors
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)$
\end_inset 

 over classifiers, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\leq\hat{c}_{S}+\sqrt{\frac{\ln\frac{1}{P(c)}+\ln\frac{1}{\delta}}{2m}}\right)\geq1-\delta\]

\end_inset 


\layout Proof

approximate the Binomial tail with the Chernoff Bound (lemma 
\begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff}

\end_inset 

).
\layout Standard

Many people are more familiar with a degenerate form of this bound where
 
\begin_inset Formula $P(c)=\frac{1}{|H|}$
\end_inset 

 and 
\begin_inset Formula $H$
\end_inset 

 is some set of classifiers.
 In that case, simply replace 
\begin_inset Formula $\ln\frac{1}{P(c)}$
\end_inset 

 with 
\begin_inset Formula $\ln|H|$
\end_inset 

.
 The form presented here is both more general and necessary if the bound
 is to be used in practice.
 
\layout Standard

Other corollaries as in section 
\begin_inset LatexCommand \ref{sub:Corollaries}

\end_inset 

 exist for the Occam's Razor bound.
 In general, just substitute 
\begin_inset Formula $\delta\rightarrow\delta P(c)$
\end_inset 

.
\layout Subsubsection

Occam's Razor Lower bound
\layout Standard

Just as for the test set bound, a lower bound of the same form applies.
\layout Theorem


\begin_inset LatexCommand \label{th-lorb}

\end_inset 

 (Occam's Razor Lower Bound) For all 
\begin_inset Quotes eld
\end_inset 

priors
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)$
\end_inset 

 over the classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall c:\,\, c_{D}\geq\min_{p}\{ p:\,\,1-\textrm{Bin}\left(\hat{c}_{S},p\right)\geq\delta P(c)\}\right)\geq1-\delta\]

\end_inset 


\layout Example*

(continued) Suppose that instead of having 
\begin_inset Formula $100$
\end_inset 

 test examples, we had 
\begin_inset Formula $100$
\end_inset 

 train examples.
 Also suppose that before seeing the train examples, we committed to 
\begin_inset Formula $P(c)=0.1$
\end_inset 

 for 
\begin_inset Formula $c$
\end_inset 

 the constant classifier which predicts 
\begin_inset Quotes eld
\end_inset 

no rain
\begin_inset Quotes erd
\end_inset 

.
 Then, the Chernoff approximations of the upper and lower bound give the
 interval, 
\begin_inset Formula $c_{D}\in[0.22,0.54]$
\end_inset 

.
 With an exact calculation, we get 
\begin_inset Formula $c_{D}\in[0.26,0.51]$
\end_inset 

.
\layout Subsubsection

The state of the art
\layout Standard

A very large amount of work has been done on train set bounds.
 In addition to those included here, there is: 
\layout Enumerate

Reinterpretations of uniform convergence 
\begin_inset LatexCommand \cite{Vapnik}

\end_inset 

 results for continuously parameterized classifiers.
\layout Enumerate

Reinterpretations of PAC convergence 
\begin_inset LatexCommand \cite{Valiant}

\end_inset 

 results.
\layout Enumerate

Shell bounds 
\begin_inset LatexCommand \cite{Shell}

\end_inset 

 which take advantage of the distribution of true error rates on classifiers.
 
\layout Enumerate

Train and Test bounds 
\begin_inset LatexCommand \cite{TnT}

\end_inset 

 which combine train set and test set bounds.
 
\layout Standard

Of this large amount of work only a small fraction has been shown to be
 useful on real-world learning algorithm/learning problem pairs.
 The looseness of train set based bounds often precludes analytical use.
 
\layout Subsection

The Occam's Razor Bound is sometimes Tight
\layout Standard


\begin_inset LatexCommand \label{sec-lower_upper}

\end_inset 


\layout Standard

The question of tightness for train set bounds is important to address,
 as many of them have been extremely loose.
 The simplest method to address this tightness is constructive: exhibit
 a learning problem/algorithm pair for which the bound is almost achieved.
 For the test set bound, this is trivial as any classifier with a large
 enough true error will achieve the bound.
 For the train set bound, this is not so trivial.
\layout Standard

How tight is the Occam's Razor bound (
\begin_inset LatexCommand \ref{th-ORB}

\end_inset 

)? The answer is 
\emph on 
sometimes 
\emph default 
tight.
 In particular, we can exhibit a set of learning problems where the Occam's
 Razor bound can not be made significantly tighter as a function of the
 observables, 
\begin_inset Formula $m$
\end_inset 

, 
\begin_inset Formula $\delta$
\end_inset 

, 
\begin_inset Formula $P(c)$
\end_inset 

, and 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

.
 After fixing the value of these quantities we construct a learning problem
 exhibiting this near equivalence to the Occam's Razor bound.
\layout Theorem


\begin_inset LatexCommand \label{th-dhlub}

\end_inset 

(Occam's Razor tightness) For all 
\begin_inset Formula $P(c)$
\end_inset 

, 
\begin_inset Formula $\frac{k}{m}$
\end_inset 

, 
\begin_inset Formula $\delta$
\end_inset 

 there exists a learning problem and algorithm such that: 
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists c:\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\geq\delta-\delta^{2}\]

\end_inset 

 Furthermore, if 
\begin_inset Formula $c^{*}$
\end_inset 

 is the classifier with minimal training error, then: 
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(c_{D}^{*}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S}^{*},\delta P(c)\right)\right)\geq\delta-\delta^{2}\]

\end_inset 


\layout Standard

Intuitively, this theorem implies that we can not improve significantly
 on the the Occam's Razor bound (theorem 
\begin_inset LatexCommand \ref{th-ORB}

\end_inset 

) without using extra information about our learning problem.
 
\layout Proof

The proof is constructive: we create a learning problem on which large deviation
s are likely.
 We start with a prior 
\begin_inset Formula $P(c)$
\end_inset 

, probability of error 
\begin_inset Formula $\delta$
\end_inset 

, 
\begin_inset Formula $m$
\end_inset 

, and a targeted empirical error rate, 
\begin_inset Formula $\frac{k}{m}$
\end_inset 

.
 For succinctness we assume that 
\begin_inset Formula $P(c)$
\end_inset 

 has support on a finite set of size 
\begin_inset Formula $n$
\end_inset 

.
 
\layout Proof

To define the learning problem, let: 
\begin_inset Formula $X=\{0,1\}^{n}$
\end_inset 

 and 
\begin_inset Formula $Y=\{0,1\}$
\end_inset 

.
 
\layout Proof

The distribution 
\begin_inset Formula $D$
\end_inset 

 can be drawn by first selecting 
\begin_inset Formula $Y$
\end_inset 

 with a single unbiased coin flip, and then choosing the 
\begin_inset Formula $i$
\end_inset 

th component of the vector 
\begin_inset Formula $X$
\end_inset 

 independently, 
\begin_inset Formula $\Pr((X_{1},...,X_{n})|Y)=\Pi_{i=1}^{n}\Pr(X_{i}|Y)$
\end_inset 

 .
 The individual components are chosen so 
\begin_inset Formula $\Pr(X_{i}=Y|Y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$
\end_inset 

.
 
\layout Proof

The classifiers we consider just use one feature to make their classification:
 
\begin_inset Formula $c_{i}(x)=x_{i}$
\end_inset 

.
 The true error of these classifiers is given by: 
\begin_inset Formula $c_{D}=\overline{\textrm{Bin}}\left(\frac{k}{m},\delta P(c)\right)$
\end_inset 

.
 
\layout Proof

This particular choice of true errors implies that if any classifier has
 a too-small train error rate, then the classifier with minimal train error
 must have a too-small train error.
 
\layout Proof

Using this learning problem, we know that:
\begin_inset Formula \[
\forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=\delta P(c)\]

\end_inset 

(negation)
\begin_inset Formula \[
\Rightarrow\forall c,\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)=1-\delta P(c)\]

\end_inset 

(independence)
\begin_inset Formula \[
\Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\forall c\,\, c_{D}<\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)<\prod_{c}\left(1-\delta P(c)\right)\]

\end_inset 

(negation)
\begin_inset Formula \[
\Rightarrow\forall\delta\in(0,1]:\,\,\Pr_{S\sim D^{m}}\left(\exists c\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\]

\end_inset 


\begin_inset Formula \[
=\sum_{i=1}^{n}\delta P(c_{i})\left(1-\Pr_{S\sim D^{m}}\left(\exists c\in\{ c_{1},...,c_{i-1}\}\,\,\, c_{D}\geq\overline{\textrm{Bin}}\left(\hat{c}_{S},\delta P(c)\right)\right)\right)\]

\end_inset 


\begin_inset Formula \[
\geq\sum_{i=1}^{n}\delta P(c_{i})(1-\delta)\]

\end_inset 


\begin_inset Formula \[
=\delta-\delta^{2}\]

\end_inset 


\layout Standard

The lower bound theorem implies that we can not improve an Occam's Razor
 like statement.
 However, it is important to note that large improvements are possible if
 we use other sources of information.
 To see this, just note the case where every single classifier happens to
 be the same.
 In this case the 
\begin_inset Quotes eld
\end_inset 

right
\begin_inset Quotes erd
\end_inset 

 bound would the be the 
\emph on 
test 
\emph default 
set bound, rather than the train set bound.
 The PAC-Bayes bound and the Sample Compression bound presented in the next
 sections use other sources of information.
\layout Subsection

Train set bound implications 
\layout Standard


\begin_inset LatexCommand \label{subsec-train-set-practice}

\end_inset 


\layout Subsubsection

results
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard
\align center 

\begin_inset Graphics
	filename test_vs_micro.ps
	display monochrome
	width 60col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig-microchoice-holdout}

\end_inset 

This is a plot comparing confidence intervals built based upon the test
 set bound (
\begin_inset LatexCommand \ref{th-hb}

\end_inset 

) with an 80%/20% train/test split on the left and the Occam's Razor bound
 (
\begin_inset LatexCommand \ref{th-ORB}

\end_inset 

) with all data in the training set on the right.
 The Occam's razor bound is sometimes superior on the smaller data sets
 and always nonvacuous (in contrast to many other train set bounds).
 
\end_inset 


\layout Standard

The Occam's Razor bound is strongly related to compression.
 In particular, for any self-terminating description language, 
\begin_inset Formula $d(c)$
\end_inset 

, we can associate a 
\begin_inset Quotes eld
\end_inset 

prior
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)=2^{-|d(c)|}$
\end_inset 

 with the property that 
\begin_inset Formula $\sum_{c}P(c)\leq1$
\end_inset 

.
 Consequently, short description length classifiers tend to have a tighter
 convergence and the penalty term, 
\begin_inset Formula $\ln\frac{1}{P(c)}$
\end_inset 

 is the number of 
\begin_inset Quotes eld
\end_inset 

nats
\begin_inset Quotes erd
\end_inset 

 (bits base e).
 For any language fixed before seeing the train set, classifiers with shorter
 description lengths have tighter bounds on the true error rate.
 
\layout Standard

One particularly useful description language to consider is the execution
 trace of a learning algorithm.
 If we carefully note the sequence of data-dependent choices which a learning
 algorithm makes, then the output classifier can be specified by a sequence
 such as 
\begin_inset Quotes eld
\end_inset 

2nd choice, third choice, first choice, etc...
\begin_inset Quotes erd
\end_inset 

 This is the idea behind microchoice bounds 
\begin_inset LatexCommand \cite{MC_journal}

\end_inset 

.
 Results for this approach are reported in Figure 
\begin_inset LatexCommand \ref{fig-microchoice-holdout}

\end_inset 

 and are strong enough to act as an empirical existence proof that Occam's
 Razor bounds can be made tight enough for useful application.
 
\layout Section

PAC-Bayes Bound
\layout Standard


\begin_inset LatexCommand \label{sec:PAC-Bayes-Bound}

\end_inset 


\layout Standard

The PAC-Bayes bound
\begin_inset LatexCommand \cite{PB}

\end_inset 

 is particularly exciting because it can provide quantitatively useful results
 for classifiers with 
\emph on 
real valued
\emph default 
 parameters.
 This includes such commonly used classifiers as Support Vector Machines
 and Neural Networks.
\begin_inset Foot
collapsed true

\layout Standard

There is a caveat here---the bound only applies to stochastic versions of
 the classifiers.
 However, the probability that the stochastic classifier differs from the
 classifier can be made very small.
\end_inset 

 This section is divided into three parts:
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:The-PAC-Bayes-Bound}

\end_inset 

 States and proves the PAC-Bayes Bound.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:How-Tight-is}

\end_inset 

 shows that the PAC-Bayes Bound is nearly as tight as possible given the
 observations.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:Application-of-the}

\end_inset 

 discusses results from the application of the PAC-Bayes bound to support
 vector machines.
\layout Subsection

The PAC-Bayes Bound
\layout Standard


\begin_inset LatexCommand \label{sub:The-PAC-Bayes-Bound}

\end_inset 


\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename pac-bayes.eps
	scale 50

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:PB-POL}

\end_inset 

 The 
\begin_inset Quotes eld
\end_inset 

interactive proof of learning
\begin_inset Quotes erd
\end_inset 

 associated with the PAC-Bayes bound.
 The figure is the same as for the Occam's razor bound, except that instead
 of committing to a single classifier, the PAC-Bayes bound applies to any
 distribution over classifiers.
 
\end_inset 


\layout Standard

The PAC-Bayes bound has been improved by tightening 
\begin_inset LatexCommand \cite{averaging}

\end_inset 

 and then with a much simpler proof 
\begin_inset LatexCommand \cite{Matthias}

\end_inset 

 since it was originally stated.
 The statement and proof presented here incorporate these improvements and
 improve on them slightly.
\layout Standard

The PAC-Bayes bound is dependent upon two derived quantities, an average
 true error:
\begin_inset Formula \[
Q_{D}\equiv E_{c\sim Q}c_{D}\]

\end_inset 

and an average train error:
\begin_inset Formula \[
\hat{Q}_{S}\equiv E_{c\sim Q}\hat{c}_{S}\]

\end_inset 

These quantities can be interpreted as the train error and true error of
 the meta-classifier which chooses a classifier according to 
\begin_inset Formula $Q$
\end_inset 

 every time a classification is made.
 If we refer to this meta-classifier as 
\begin_inset Formula $Q$
\end_inset 

, the notation for error rates is consistent with our earlier notation.
 
\layout Standard

The 
\begin_inset Quotes eld
\end_inset 

interactive proof of learning
\begin_inset Quotes erd
\end_inset 

 viewpoint of the PAC-Bayes bound is shown in figure 
\begin_inset LatexCommand \ref{fig:PB-POL}

\end_inset 

.
 It is essentially the same as for the Occam's Razor bound except for the
 commitment to the metaclassifier 
\begin_inset Formula $Q$
\end_inset 

 rather than the classifier 
\begin_inset Formula $c$
\end_inset 

.
\layout Theorem

(PAC-Bayes Bound) For all 
\begin_inset Quotes eld
\end_inset 

priors
\begin_inset Quotes erd
\end_inset 

 
\begin_inset Formula $P(c)$
\end_inset 

 over the classifiers, 
\begin_inset Formula $c$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall Q(c):\,\,\textrm{KL}\left(\hat{Q}_{S}||Q_{D}\right)\leq\frac{\textrm{KL}(Q||P)+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\]

\end_inset 

Here 
\begin_inset Formula $\textrm{KL}(q||p)=q\ln\frac{q}{p}+(1-q)\ln\frac{1-q}{1-p}$
\end_inset 

 for 
\begin_inset Formula $q<p$
\end_inset 

 and 
\begin_inset Formula $\textrm{KL}(Q||P)=E_{c\sim Q}\ln\frac{Q(c)}{P(c)}$
\end_inset 

 is the Kullback-Leibler divergence between two distributions.
\layout Standard

Note that the PAC-Bayes bound applies to any 
\emph on 
distribution 
\emph default 
over classifiers.
 When 
\begin_inset Formula $Q$
\end_inset 

 is concentrated on one classifier, we have 
\begin_inset Formula $\textrm{KL}(Q||P)=\ln\frac{1}{P(c)}$
\end_inset 

, just as in the Occam's razor bound
\begin_inset Foot
collapsed true

\layout Standard

As weakened with the relative entropy Chernoff bound (Lemma 
\begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff}

\end_inset 

) on the Binomial.
 
\end_inset 

, with the only distinction being the additive 
\begin_inset Formula $\frac{\ln(m+1)}{m}$
\end_inset 

 term.
 It is somewhat surprising that the bound holds for 
\emph on 
every 
\emph default 
distribution 
\begin_inset Formula $Q$
\end_inset 

 with only the slight worsening by 
\begin_inset Formula $\frac{\ln(m+1)}{m}$
\end_inset 

.
\layout Standard

Since the KL-divergence applies to distributions over continuous valued
 parameters, the PAC-Bayes bound can be nontrivially tight in this setting
 as well.
 This fact is used in the application section.
\layout Standard

We first state a couple simple lemmas that are handy in the proof.
 The intuition behind this lemma is that the expected probability of an
 event is not too small.
 
\layout Lemma


\begin_inset LatexCommand \label{lem:exp-approx}

\end_inset 

For all 
\begin_inset Formula $P(c)$
\end_inset 

, For all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(E_{c\sim P}\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}\leq\frac{m+1}{\delta}\right)\geq1-\delta\]

\end_inset 


\layout Proof

Note that:
\begin_inset Formula \[
\forall c\,\,\, E_{S\sim D^{m}}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}=\sum_{\frac{k}{m}}\Pr_{S\sim D^{m}}\left(\hat{c}_{S}=\frac{k}{m}\right)\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S'}=\frac{k}{m}\right)}\]

\end_inset 


\begin_inset Formula \[
=m+1\]

\end_inset 

Taking the expectation over classifiers according to 
\begin_inset Formula $P$
\end_inset 

 and switching the order of expectation, we get:
\layout Proof


\begin_inset Formula \[
E_{S\sim D^{m}}E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}=m+1\]

\end_inset 

and using the Markov inequality (
\begin_inset Formula $X\geq0$
\end_inset 

, 
\begin_inset Formula $EX=\mu$
\end_inset 

, 
\begin_inset Formula $\Rightarrow\Pr(X>\frac{\mu}{\delta})\leq\delta$
\end_inset 

), we get:
\layout Proof


\begin_inset Formula \[
\forall P\,\,\Pr_{S\sim D^{m}}\left(E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\leq\frac{m+1}{\delta}\right)\leq\delta\]

\end_inset 


\layout Standard

The next lemma shows that a certain expectation is bounded by the Kullback-Leibl
er distance between two coin flips, just as for the relative entropy Chernoff
 bound (Lemma 
\begin_inset LatexCommand \ref{lem:Relative-Entropy-Chernoff}

\end_inset 

).
\layout Lemma


\begin_inset LatexCommand \label{lemma: KL-approx}

\end_inset 

For all 
\begin_inset Formula $Q(c)$
\end_inset 

: 
\begin_inset Formula $\frac{E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}}{m}\geq\textrm{KL}(\hat{Q}_{S}||Q_{D})$
\end_inset 


\layout Proof


\begin_inset Formula $\frac{1}{m}E_{c\sim Q}\ln\frac{1}{\left(\begin{array}{c}
m\\
m\hat{c}_{S}\end{array}\right)c_{D}^{m\hat{c}_{S}}(1-c_{D})^{m(1-\hat{c}_{S})}}$
\end_inset 


\begin_inset Formula \[
=E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]-\frac{E_{c\sim Q}\ln\left(\begin{array}{c}
m\\
m\hat{c}_{S}\end{array}\right)}{m}\]

\end_inset 

Jensen's inequality (
\begin_inset Formula $f$
\end_inset 

 concave 
\begin_inset Formula $\Rightarrow$
\end_inset 

 
\begin_inset Formula $Ef(X)\geq f(EX)$
\end_inset 

):
\begin_inset Formula \[
E_{c\sim Q}\left[\hat{c}_{S}\ln\frac{1}{c_{D}}+\left(1-\hat{c}_{S}\right)\ln\frac{1}{1-c_{D}}\right]\geq\hat{Q}_{S}\ln\frac{1}{Q_{D}}+\left(1-\hat{Q}_{S}\right)\ln\frac{1}{1-Q_{D}}\]

\end_inset 


\begin_inset Formula \[
\textrm{ and \,\,\,\,}\frac{E_{c\sim Q}\ln\left(\begin{array}{c}
m\\
m\hat{c}_{S}\end{array}\right)}{m}\leq\frac{E_{c\sim Q}\ln e^{mH\left(\hat{c}_{S}\right)}}{m}=\frac{E_{c\sim Q}mH\left(\hat{c}_{S}\right)}{m}\leq H\left(\hat{Q}_{S}\right)\]

\end_inset 


\layout Standard

With these two lemmas, the PAC-Bayes theorem is easy to prove.
\layout Proof

(Of the PAC-Bayes theorem) Let 
\begin_inset Formula \[
P_{G}(c,\hat{c}_{S})=\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}}P(c)\]

\end_inset 

 
\begin_inset Formula \[
\Rightarrow0\leq\textrm{KL}(Q||P_{G})=E_{c\sim Q}\ln\frac{Q(c)}{P(c)}\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\]

\end_inset 


\begin_inset Formula \[
=\textrm{KL}(Q||P)-E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\]

\end_inset 


\begin_inset Formula \[
\Rightarrow E_{c\sim Q}\ln\frac{1}{\Pr_{S\sim D^{m}}\left(\hat{c}_{S}\right)}\leq\textrm{KL}(Q||P)+\ln E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}\]

\end_inset 

 Now, use lemma 
\begin_inset LatexCommand \ref{lem:exp-approx}

\end_inset 

 on the right hand term and lemma 
\begin_inset LatexCommand \ref{lemma: KL-approx}

\end_inset 

 on the left hand term to finish the proof.
\layout Standard

Note that one of the last inequalities in the proof is not necessary, and
 so a slightly tighter bound can be proved.
\layout Subsection

The PAC-Bayes Bound is sometimes Tight
\layout Standard


\begin_inset LatexCommand \label{sub:How-Tight-is}

\end_inset 


\layout Standard

Since the PAC-Bayes bound is (almost) a generalization of the Occam's Razor
 bound, the tightness result for Occam's Razor also applies to PAC-Bayes
 bounds.
 
\layout Subsection

Application of the PAC-Bayes Bound
\layout Standard


\begin_inset LatexCommand \label{sub:Application-of-the}

\end_inset 


\layout Standard

Applying the PAC-Bayes bound requires some significant specialization 
\begin_inset LatexCommand \cite{PB-margin}

\end_inset 

.
 Here, we specialize to classifiers of the form: 
\begin_inset Formula \[
c(x)=\textrm{sign}\left(\vec{w}\cdot\vec{x}\right)\]

\end_inset 

Note that via the kernel trick, Support Vector Machines also have this form.
 
\layout Standard

The specialization is naturally expressed in terms of a few derived quantities:
\layout Enumerate

The cumulative distribution of a Gaussian.
 Let 
\begin_inset Formula $\bar{F}(x)=\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$
\end_inset 

.
 Here we use 
\begin_inset Formula $\bar{F}$
\end_inset 

 rather than 
\begin_inset Formula $F$
\end_inset 

 to denote the fact that we integrate from 
\begin_inset Formula $\infty$
\end_inset 

 to 
\begin_inset Formula $x$
\end_inset 

 rather than 
\begin_inset Formula $-\infty$
\end_inset 

 to 
\begin_inset Formula $x$
\end_inset 

.
 
\layout Enumerate

A 
\begin_inset Quotes eld
\end_inset 

posterior
\begin_inset Quotes erd
\end_inset 

 distribution 
\begin_inset Formula $Q(\vec{w},\mu)$
\end_inset 

 which is 
\begin_inset Formula $N(\mu,1)$
\end_inset 

 for some 
\begin_inset Formula $\mu>0$
\end_inset 

 in the direction of 
\begin_inset Formula $\vec{w}$
\end_inset 

 and 
\begin_inset Formula $N(0,1)$
\end_inset 

 in all perpendicular directions.
\layout Enumerate

The normalized margin of the examples
\begin_inset Formula \[
\gamma(\vec{x},y)=\frac{y\vec{w}\cdot\vec{x}}{||\vec{w}||||\vec{x}||}\]

\end_inset 


\layout Enumerate

A stochastic error rate, 
\begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)$
\end_inset 


\layout Standard

This last quantity in particular is very important to understand.
 Consider the case as 
\begin_inset Formula $\mu$
\end_inset 

 approaches 
\begin_inset Formula $\infty$
\end_inset 

.
 When the margin is negative (indicating an incorrect classification), 
\begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$
\end_inset 

 approaches 
\begin_inset Formula $1$
\end_inset 

.
 When the margin is positive 
\begin_inset Formula $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$
\end_inset 

 approaches 
\begin_inset Formula $0$
\end_inset 

.
 Thus, 
\begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$
\end_inset 

 is a softened form of the empirical error 
\begin_inset Formula $\hat{c}_{S}$
\end_inset 

 which takes into account the margin.
 
\layout Corollary

(PAC-Bayes Margin Bound) For all distributions 
\begin_inset Formula $D$
\end_inset 

, for all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

, we have: 
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall\vec{w},\mu:\,\,\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||Q(\vec{w},\mu)_{D}\right)\leq\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\]

\end_inset 


\layout Proof

The proof is very simple.
 We just choose the prior 
\begin_inset Formula $P=N(0,1)^{n}$
\end_inset 

 and work out the implications.
 
\layout Proof

Since the Gaussian distribution is the same in every direction, we can reorient
 the coordinate system of the prior to have one dimension parallel to 
\begin_inset Formula $w$
\end_inset 

.
 Since the draws in the parallel and perpendicular directions are independent,
 we have:
\begin_inset Formula \[
\textrm{KL}(Q||P)=\textrm{KL}(Q_{\perp}||P_{\perp})+\textrm{KL}(N(\mu,1)||N(0,1))\]

\end_inset 


\begin_inset Formula \[
=\frac{\mu^{2}}{2}\]

\end_inset 

as required.
\layout Proof

All that remains is calculating the stochastic error rate 
\begin_inset Formula $\hat{Q}(\vec{w},\mu)_{S}$
\end_inset 

.
 Fix a particular example, 
\begin_inset Formula $(\vec{x},y)$
\end_inset 

.
 This example has a natural decomposition 
\begin_inset Formula $\vec{x}=\vec{x}_{||}+\vec{x}_{\perp}$
\end_inset 

 into a component, 
\begin_inset Formula $\vec{x}_{||}$
\end_inset 

 parallel to the weight vector 
\begin_inset Formula $\vec{w}$
\end_inset 

 and a component 
\begin_inset Formula $\vec{x}_{\perp}$
\end_inset 

 perpendicular to the weight vector.
 
\layout Proof

To classify, we draw weight vector 
\begin_inset Formula $\vec{w}^{'}$
\end_inset 

 from 
\begin_inset Formula $\hat{Q}(\vec{w},\mu)$
\end_inset 

.
 This 
\begin_inset Formula $\vec{w}^{'}$
\end_inset 

 consists of 
\begin_inset Formula $3$
\end_inset 

 components, 
\begin_inset Formula $\vec{w}^{'}=\vec{w}_{||}^{'}+\vec{w}_{\perp}^{'}+\vec{w}_{\perp\perp}^{'}$
\end_inset 

.
 Here 
\begin_inset Formula $\vec{w}_{||}^{'}\sim N(\mu,1)$
\end_inset 

 is parallel to the original weight vector, 
\begin_inset Formula $\vec{w}_{\perp}^{'}\sim N(0,1)$
\end_inset 

 which is parallel to 
\begin_inset Formula $\vec{x}_{\perp}$
\end_inset 

 and 
\begin_inset Formula $\vec{w}_{\perp\perp}^{'}$
\end_inset 

 is perpendicular to both 
\begin_inset Formula $\vec{w}$
\end_inset 

 and 
\begin_inset Formula $\vec{x}$
\end_inset 

.
 We have:
\begin_inset Formula \[
\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\neq\textrm{sign}\left(\vec{w}^{'}\cdot\vec{x}\right)\right)\]

\end_inset 


\begin_inset Formula \[
=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\vec{w}\cdot\vec{x}\leq0\right)\]

\end_inset 

 If we let 
\begin_inset Formula $w_{||}^{'}=||w_{||}^{'}||$
\end_inset 

, 
\begin_inset Formula $w_{\perp}^{'}=||w_{\perp}^{'}||$
\end_inset 

, 
\begin_inset Formula $x_{||}=||\vec{x}_{||}||$
\end_inset 

, and 
\begin_inset Formula $x_{\perp}=||\vec{x}_{\perp}||$
\end_inset 

, and assume (without loss of generality) that 
\begin_inset Formula $y=1$
\end_inset 

 we get: 
\begin_inset Formula \[
=E_{\vec{x},y\sim S,w_{||}^{'}\sim N(\mu,1),w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\]

\end_inset 


\begin_inset Formula \[
=E_{\vec{x},y\sim S}E_{w_{||}^{'}\sim N(\mu,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\]

\end_inset 


\begin_inset Formula \[
=E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq-yz^{'}-yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\]

\end_inset 

Using the symmetricity of the Gaussian, this is:
\begin_inset Formula \[
=E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq yz^{'}+yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\]

\end_inset 

Using the fact that the sum of two Gaussians is a Gaussian:
\begin_inset Formula \[
=E_{\vec{x},y\sim S}E_{v\sim N\left(0,1+\frac{x_{\perp}^{2}}{x_{||}^{2}}\right)}I\left(y\mu\leq yv\right)\]

\end_inset 


\begin_inset Formula \[
=E_{\vec{x},y\sim S}E_{v\sim N\left(0,\frac{1}{\gamma(\vec{x},y)^{2}}\right)}I\left(y\mu\leq yv\right)\]

\end_inset 


\begin_inset Formula \[
=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)\]

\end_inset 

finishing the proof.
\layout Standard

Using the corollary, the true error bound 
\begin_inset Formula $\bar{Q}(\vec{w},\mu)_{D}$
\end_inset 

 satisfies the equation: 
\begin_inset Formula \[
\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||\bar{Q}(\vec{w},\mu)_{D}\right)=\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\]

\end_inset 

This is an implicit equation for 
\begin_inset Formula $\bar{Q}$
\end_inset 

 which can be easily solved numerically.
 
\layout Standard

The bound is stated in terms of dot products here, so naturally it is possible
 to kernelize the result using methods from 
\begin_inset LatexCommand \cite{kernelize}

\end_inset 

.
 In kernelized form, the bound applies to classifiers (as output by SVM
 learning algorithms) of the form: 
\begin_inset Formula \begin{equation}
c(x)=\textrm{sign}\left(\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)\label{eq:class-kernel}\end{equation}

\end_inset 

Since, by assumption, 
\begin_inset Formula $k$
\end_inset 

 is a kernel, we know that 
\begin_inset Formula $k(x_{i},x)=\vec{\Phi}(x_{i})\cdot\vec{\Phi}(x)$
\end_inset 

 where 
\begin_inset Formula $\vec{\Phi}(x)$
\end_inset 

 is some projection into another space.
 In kernelized form, we get 
\begin_inset Formula $\vec{w}\cdot\vec{x}=\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)$
\end_inset 

, 
\begin_inset Formula $\vec{x}\cdot\vec{x}=k(x,x)$
\end_inset 

, 
\begin_inset Formula $\vec{w}\cdot\vec{w}=\sum_{i,j}\alpha_{i}\alpha_{j}k(x_{i},x_{j})$
\end_inset 

, defining all of the necessary quantities to calculate the normalized margin,
 
\begin_inset Formula \[
\gamma(x,y)=\frac{\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)}{\sqrt{k(x,x)\sum_{i,j=1,1}^{m,m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})}}\]

\end_inset 

 
\layout Standard

One element remains, which is the value of 
\begin_inset Formula $\mu$
\end_inset 

.
 Unfortunately the bound can be nonmonotonic in the value of 
\begin_inset Formula $\mu$
\end_inset 

, but it turns out that for classifiers learned by support vector machines
 on reasonable datasets, there is only one value of 
\begin_inset Formula $\mu$
\end_inset 

 which is (locally, and thus globally) minimal.
 A binary search over some reasonable range of 
\begin_inset Formula $\mu$
\end_inset 

 (say from 
\begin_inset Formula $1$
\end_inset 

 to 
\begin_inset Formula $100$
\end_inset 

) can find the minima quickly, given the precomputation of the margins.
 It is worth noting again here that we are not 
\begin_inset Quotes eld
\end_inset 

cheating
\begin_inset Quotes erd
\end_inset 

---the bound holds for all values of 
\begin_inset Formula $\mu$
\end_inset 

 simultaneously.
 
\layout Standard

The computational time of the bound calculation is dominated by the calculation
 of the margins which is 
\begin_inset Formula $O\left(m^{2}\right)$
\end_inset 

 where 
\begin_inset Formula $m$
\end_inset 

 is the number of support vectors with a nonzero associated 
\begin_inset Formula $\alpha$
\end_inset 

.
 This computational time is typically dominated by the time of the SVM learning
 algorithm.
 
\layout Subsubsection

Results
\layout Standard

Application of this bound to support vector machines is of significant importanc
e because SVMs are reasonably effective and adaptable classifiers in common
 and widespread use.
 A SVM learns a kernelized classifier as per equation 
\begin_inset LatexCommand \ref{eq:class-kernel}

\end_inset 


\begin_inset Foot
collapsed false

\layout Standard

Some SVM learning algorithms actually learn a classifier of the form: 
\begin_inset Formula $c(x)=\textrm{sign}\left(b+\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)$
\end_inset 

.
 We don't handle this form here.
\end_inset 

.
 
\layout Standard

We apply the support vector machine to 8 UCI database problems chosen to
 fit the criteria 
\begin_inset Quotes eld
\end_inset 

two classes
\begin_inset Quotes erd
\end_inset 

 and 
\begin_inset Quotes eld
\end_inset 

real valued input features
\begin_inset Quotes erd
\end_inset 

.
 The problems vary in size over an order of magnitude from 
\begin_inset Formula $145$
\end_inset 

 to 
\begin_inset Formula $1428$
\end_inset 

 examples.
 In figure 
\begin_inset LatexCommand \ref{fig:test}

\end_inset 

 we use a 70/30 train/test split of the data.
 
\layout Standard

In all experiments, we use SVMlight with a Gaussian kernel and the default
 bandwidth.
 Results for other choices of the 
\begin_inset Quotes eld
\end_inset 

C
\begin_inset Quotes erd
\end_inset 

, bandwidth, and/or kernel appear to be qualitatively similar (although
 of course differing quantitatively).
 
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename test_and_margin.ps
	display monochrome
	width 60col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:test}

\end_inset 


\layout Standard

This figure shows the results of applying SVMlight to 8 datasets with a
 Gaussian kernel and a 70/30 train/test split.
 The observed test error rate is graphed as an X.
 On the test set, we calculate a Binomial confidence interval (probability
 of bound failure = 
\begin_inset Formula $0.01$
\end_inset 

) which upper bounds the true error rate.
 On the training set we calculate the PAC-Bayes margin bound for an optimized
 choice of 
\begin_inset Formula $\mu$
\end_inset 

.
 
\end_inset 

It is important to note that the PAC-Bayes margin bound is 
\emph on 
not
\emph default 
 precisely a bound (or confidence interval) on the true error rate of the
 learned classifier.
 Instead, it is a true error rate bound on an associated stochastic classifier
 chosen so as to have a similar test error rate.
 These bounds can be regarded as bounds for the original classifier only
 under an additional assumption: that picking a classifier according to
 the majority vote of this stochastic distribution does not worsen the true
 error rate.
 This is not true in general, but may be true in practice.
 
\layout Standard

It is of course unfair to compare the train set bound with the test set
 bound on a 70/30 train/test split because a very tight train set bound
 would imply that it is unnecessary to even have a test set.
 In figure 
\begin_inset LatexCommand \ref{fig:full-train}

\end_inset 

 we compare the true error bounds on all of the data to the true error bounds
 generated from the 70/30 train/test split.
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename test_vs_margin.ps
	display monochrome
	width 60col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:full-train}

\end_inset 


\layout Standard

In addition to comparing with everything in figure 
\begin_inset LatexCommand \ref{fig:test}

\end_inset 

, we graph the margin bound when 
\emph on 
all 
\emph default 
of the data is used for the train set.
 Note that it improves somewhat on the margin bound calculated using the
 70% train set (7/10 margin bound), but not enough to compete with the test
 set bound.
 
\end_inset 

The results show that the PAC-Bayes margin bound is tight enough to give
 useful information, but still not competitive with the test set bounds.
 This is in strong contrast with a tradition of quantitatively impractical
 margin bounds.
 There are several uses available for bounds which provide some information
 but which are not fully tight.
 
\layout Enumerate

They might be combined with a train/test bound 
\begin_inset LatexCommand \cite{TnT}

\end_inset 

.
 
\layout Enumerate

The train set bound might easily become tighter for smaller sample sizes.
 This was observed in 
\begin_inset LatexCommand \cite{TnT}

\end_inset 

.
\layout Enumerate

The train set bound might still have the right 
\begin_inset Quotes eld
\end_inset 

shape
\begin_inset Quotes erd
\end_inset 

 for choosing an optimal parameter setting, such as 
\begin_inset Quotes eld
\end_inset 

C
\begin_inset Quotes erd
\end_inset 

 in a Support Vector Machine.
\layout Section

Sample Compression Bound
\layout Standard


\begin_inset LatexCommand \label{sec:Sparsity-Bound}

\end_inset 


\layout Standard

The sample compression bound 
\begin_inset LatexCommand \cite{LW}

\end_inset 

, 
\begin_inset LatexCommand \cite{FW}

\end_inset 

 is like the PAC-Bayes bound in that it applies to arbitrary precision continuou
s valued classifiers.
 Unlike the PAC-Bayes bound, it applies meaningfully to nonstochastic classifier
s.
 Mainstream learning algorithms do not optimize the sample compression metric,
 so the bound application is somewhat rarer.
 Nonetheless, there do exist some reasonably competitive learning algorithms
 for which the sample compression bound produces significant results.
\layout Standard

The section is organized as follows:
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:The-Sample-Compression-Bound}

\end_inset 

 states and proves the sample compression bound.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:SC-tightness}

\end_inset 

 shows that the sample compression bound is nearly as tight as possible
 given the observations.
\layout Enumerate

Subsection 
\begin_inset LatexCommand \ref{sub:SC-application}

\end_inset 

 discusses results from the application of the sample compression bound
 to support vector machines.
\layout Subsection

The Sample Compression Bound
\layout Standard


\begin_inset LatexCommand \label{sub:The-Sample-Compression-Bound}

\end_inset 


\layout Standard

The Sample Compression bound stated here is differs somewhat from other
 results (
\begin_inset LatexCommand \cite{LW}

\end_inset 


\begin_inset LatexCommand \cite{FW}

\end_inset 

 and others) by generalization and simplification but the bound behavior
 is qualitatively identical.
\layout Standard

Suppose we have a learning algorithm, 
\begin_inset Formula $A(S)$
\end_inset 

 whose training is 
\begin_inset Quotes eld
\end_inset 

sparse
\begin_inset Quotes erd
\end_inset 


\begin_inset Foot
collapsed true

\layout Standard

This is satisfied, for example, by the Support Vector Machine algorithm
 which only depends upon the set of support vectors.
\end_inset 

 in the sense that the output classifier is dependent upon only a subset
 of the data, 
\begin_inset Formula $A(S)=A(S')$
\end_inset 

 for 
\begin_inset Formula $S'\subseteq S$
\end_inset 

.
 The sample compression bound is dependent on the error rate, 
\begin_inset Formula $\hat{c}_{S-S'}$
\end_inset 

 on the subset 
\begin_inset Formula $S-S'$
\end_inset 

.
 The motivation here is that the examples which the learning algorithm does
 
\emph on 
not 
\emph default 
depend upon are 
\begin_inset Quotes eld
\end_inset 

almost
\begin_inset Quotes erd
\end_inset 

 independent and so we can 
\begin_inset Quotes eld
\end_inset 

almost
\begin_inset Quotes erd
\end_inset 

 get a test set bound.
\layout Standard


\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename sample_compression.eps
	scale 50

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:SC-POL}

\end_inset 

 The interactive proof of learning for the sample compression bound.
 Note that the learning algorithm is arbitrary here, similar to the test
 set bound.
\end_inset 


\layout Standard

Viewed as an interactive proof of learning (in figure 
\begin_inset LatexCommand \ref{fig:SC-POL}

\end_inset 

), the sample compression bound is unique amongst training set bounds because
 it does not require 
\emph on 
any 
\emph default 
initial commitment to a measure over the classifiers.
 
\layout Theorem

(Sample Compression Bound) For all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

, 
\begin_inset Formula $D$
\end_inset 

, 
\begin_inset Formula $A$
\end_inset 

:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\delta\]

\end_inset 


\layout Proof

Suppose we knew in advance that the learning algorithm will not depend upon
 some subset of the examples.
 Then, the 
\begin_inset Quotes eld
\end_inset 

undependent
\begin_inset Quotes erd
\end_inset 

 subset acts like a test set and gives us a test set bound:
\begin_inset Formula \[
\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\frac{\delta}{m{{m \choose |S-S'|}}}\]

\end_inset 

(Note that, technically, it is possible to refer to 
\begin_inset Formula $S'$
\end_inset 

 unambiguously before randomizing over 
\begin_inset Formula $S$
\end_inset 

 by specifying the indices of 
\begin_inset Formula $S$
\end_inset 

 contained in 
\begin_inset Formula $S'$
\end_inset 

.) Negating this, we get:
\layout Proof


\begin_inset Formula \[
\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\frac{\delta}{m{{m \choose |S-S'|}}}\]

\end_inset 

and using the union bound (
\begin_inset Formula $\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$
\end_inset 

 over each possible subset, 
\begin_inset Formula $S'$
\end_inset 

, we get:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\delta\]

\end_inset 

Negating this again gives us the proof.
\layout Subsection

The Sample Compression Bound is Sometimes Tight
\layout Standard


\begin_inset LatexCommand \label{sub:SC-tightness}

\end_inset 


\layout Standard

We can construct a learning algorithm/learning problem pair such that the
 Sample compression bound is provably near optimal, as a function of it's
 observables.
 
\layout Theorem

(Sample Compression Tightness) For all 
\begin_inset Formula $\delta\in(0,1]$
\end_inset 

, 
\begin_inset Formula $\frac{k}{m}$
\end_inset 

, there exists a distribution 
\begin_inset Formula $D$
\end_inset 

 and learning algorithm 
\begin_inset Formula $A$
\end_inset 

 s.t.
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)>\delta-\delta^{2}\]

\end_inset 

furthermore, if 
\begin_inset Formula $S^{*}$
\end_inset 

 minimizes 
\begin_inset Formula $\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)$
\end_inset 

, then 
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(c^{*}=A(S^{*}):\,\,\, c_{D}^{*}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S^{*}}^{*},\frac{\delta}{m{{m \choose |S-S^{*}|}}}\right)\right)>\delta-\delta^{2}\]

\end_inset 


\layout Proof

The proof is constructive and similar to the Occam's Razor tightness result.
 In particular, we show how to construct a learning algorithm which outputs
 classifiers that err independently depending on the subset 
\begin_inset Formula $S'$
\end_inset 

 used.
 
\layout Proof

Consider an input space 
\begin_inset Formula $X=\{0,1\}^{2^{m}}$
\end_inset 

.
 Each variable in the input space 
\begin_inset Formula $x_{S'}$
\end_inset 

 can be thought of as indexing a unique subset 
\begin_inset Formula $S'\subseteq S$
\end_inset 

 of the examples.
 In the rest of the proof, we index variables by the subset they correspond
 to.
\layout Proof

Draws from the distribution 
\begin_inset Formula $D$
\end_inset 

 can be made by first flipping an unbiased coin to get 
\begin_inset Formula $y=1$
\end_inset 

 with probability 
\begin_inset Formula $0.5$
\end_inset 

 and 
\begin_inset Formula $y=-1$
\end_inset 

 with probability 
\begin_inset Formula $0.5$
\end_inset 

.
 The distribution on 
\begin_inset Formula $X$
\end_inset 

 consists of a set of independent values after conditioning on 
\begin_inset Formula $y$
\end_inset 

.
 Choose 
\begin_inset Formula \[
\Pr(x_{S'}\neq y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\]

\end_inset 

Now, the learning algorithm 
\begin_inset Formula $A(S')$
\end_inset 

 is very simple---it just outputs the classifier 
\begin_inset Formula $c(x)=x_{S'}$
\end_inset 

.
 On the set 
\begin_inset Formula $S-S'$
\end_inset 

, we have:
\begin_inset Formula \[
\forall S'\,\,\,\Pr_{S\sim D^{m}}\left(\hat{c}_{S-S'}\geq\frac{k}{m}\right)=1-\frac{\delta}{m{{m \choose |S-S'|}}}\]

\end_inset 

Using independence, we get:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}\geq\frac{k}{m}\right)=\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\]

\end_inset 

Negating, we get:
\begin_inset Formula \[
\Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}<\frac{k}{m}\right)=1-\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\]

\end_inset 

and doing some algebra, we get the result.
\layout Subsection

Application of the Sample Compression Bound
\layout Standard


\begin_inset LatexCommand \label{sub:SC-application}

\end_inset 


\layout Standard

One obvious application of the Sample Compression bound is to support vector
 machines, since the learned classifier is only dependent on the set of
 support vectors.
 If 
\begin_inset Formula $S'$
\end_inset 

 is the set of support vectors then 
\begin_inset Formula $S-S'$
\end_inset 

 is the set of nonsupport vectors.
 Unfortunately, it turns out that this does not work so well, as observed
 in figure 
\begin_inset LatexCommand \ref{fig:sc-svm}

\end_inset 

.
 
\begin_inset Float figure
wide false
collapsed false

\layout Standard


\begin_inset Graphics
	filename test_and_compress.ps
	display color
	width 60col%
	rotateAngle 270

\end_inset 


\layout Caption


\begin_inset LatexCommand \label{fig:sc-svm}

\end_inset 

The sample compression bound applied to the output of a support vector machine
 with a Gaussian kernel.
 Here we use 
\begin_inset Formula $\delta=0.01$
\end_inset 


\end_inset 


\layout Standard

There are other less common learning algorithms for which the sample compression
 bound works well.
 The Set Covering machine 
\begin_inset LatexCommand \cite{SCM}

\end_inset 

 has an associated bound which is a variant of the Sample Compression Bound.
 
\layout Section

Discussion
\layout Subsection

Learning algorithm design
\layout Standard


\emph on 
Every 
\emph default 
train set bound implies a learning algorithm: choose the classifier which
 minimizes the true error bound.
 This sounds like a rich source of learning algorithms, but there are some
 severe caveats to that statement.
 
\layout Enumerate

It is important to note that the form of a train set bound does 
\emph on 
not 
\emph default 
imply that this minimization is a good idea.
 Choosing between two classifiers based upon their true error bound implies
 a better worst-case bound on the true error.
 It does not imply an improved true error.
 In many situations, there is some other metric of comparison (such as train
 error rate) which in fact creates better behavior.
\layout Enumerate

Another strong caveat is that, historically, train set bounds have simply
 not been tight enough on real datasets for a nonvacuous application.
 This is changing with new results, but more progress is necessary.
\layout Enumerate

Often the optimization problem is simply not very tractable.
 In addition to sample complexity, learning algorithms must be concerned
 with run time and space usage.
 
\layout Subsection

Philosophy
\layout Standard

Train set bounds teach us about ways in which verifiable learning is possible,
 a subject which borders on philosophy.
 The train set bound presented here essentially shows that a reasonable
 person will be convinced of learning success when a short-description classifie
r does well on train set data.
 The results here do 
\emph on 
not 
\emph default 
imply that this is the only way to convincingly learn.
 In fact, the (sometimes large) looseness of the Occam's Razor bound suggests
 that other methods for convincing learning processes exist.
 This observation is partially shown by other train set bound results which
 are presented next.
\layout Subsection

Conclusion
\layout Standard

This introduction to prediction theory covered two styles of bound: the
 test set bound and the train set bound.
 There are two important lessons here:
\layout Enumerate

Test set bounds provide a better way to report error rates and confidence
 intervals on future error rates than some current methods.
 
\layout Enumerate

Train set bounds can provide useful information.
 
\layout Standard

It is important to note that the train set bound and test set bound techniques
 are not mutually exclusive.
 It is possible to use both simultaneously 
\begin_inset LatexCommand \cite{TnT}

\end_inset 

, and doing so is often desirable.
 Test set bounds are improved by the 
\begin_inset Quotes eld
\end_inset 

free
\begin_inset Quotes erd
\end_inset 

 information about the training error and train set bounds become applicable,
 even when not always tight.
 
\layout Acknowledgement

I am indebted to many people for helpful suggestions assembling the content.
 This includes some anonymous reviewers, several coauthors on previous papers,
 Imre Kondor, Arindam Banerjee, and Alina Beygelzimer.
\layout Section*

Appendix
\layout Standard


\begin_inset LatexCommand \label{sec-model}

\end_inset 


\layout Standard

For those interested in comparing models, the uniform convergence model
 
\begin_inset LatexCommand \cite{Vapnik}

\end_inset 

 requires the additional assumption of the axiom of choice (to achieve empirical
 risk minimization) and a bound on the hypothesis space complexity.
 Typical theorems are of the form 
\begin_inset Quotes eld
\end_inset 

after 
\begin_inset Formula $m$
\end_inset 

 examples, all training errors are near to true errors
\begin_inset Quotes erd
\end_inset 

.
\layout Standard

The PAC learning model
\begin_inset LatexCommand \cite{Valiant}

\end_inset 

 requires a polynomial time complexity learning algorithm and the assumption
 that the learning problem comes from some class.
 Theorems are of the form 
\begin_inset Quotes eld
\end_inset 

after 
\begin_inset Formula $m$
\end_inset 

 examples learning will be achieved
\begin_inset Quotes erd
\end_inset 

.
\layout Standard

Both of these models can support stronger statements than the basic prediction
 theory model presented here.
 Results from both of these models can apply here after appropriate massaging
 of results.
 
\layout Standard

The online learning model 
\begin_inset LatexCommand \cite{Online}

\end_inset 

 makes 
\emph on 
no 
\emph default 
assumptions.
 Typical theorems have the form 
\begin_inset Quotes eld
\end_inset 

This learning algorithm's performance will be nearly as good as anyone of
 a set of classifiers.
\begin_inset Quotes erd
\end_inset 

 The online learning model has very general results and no ability to answer
 questions about future performance as we address here.
\layout Standard

The prediction theory model can most simply be understood as a slight refinement
 of the information theory model.
 
\layout Bibliography
\bibitem {BEHW}

A.
 Blumer, A.
 Ehrenfeucht, D.
 Haussler, M.
 Warmuth.
 
\begin_inset Quotes eld
\end_inset 

Occam's Razor.
\begin_inset Quotes erd
\end_inset 

 Information Processing Letters 24: 377-380, 1987.
\layout Bibliography
\bibitem {Progressive}

Avrim Blum, Adam Kalai, and John Langford 1999.
 Beating the Holdout: Bounds for K-Fold and Progressive Cross-Validation.
 COLT99.
 http://www.cs.cmu.edu/~jcl/papers/progressive_validation/coltfinal.ps
\layout Bibliography
\bibitem {chernoff}

H.
 Chernoff, 
\begin_inset Quotes eld
\end_inset 

A Measure of the Asymptotic Efficiency of Tests of a Hypothesis Based Upon
 the Sum of the Observations
\begin_inset Quotes erd
\end_inset 

, Annals of Mathematical Statistics 23:493-507, 1952.
\layout Bibliography
\bibitem {Devroye}

Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, 
\begin_inset Quotes eld
\end_inset 

A Probabilistic Theory of Pattern Recognition
\begin_inset Quotes erd
\end_inset 

 Springer-Verlag New York, 1996.
\layout Bibliography
\bibitem {FW}

Sally Floyd and Manfred Warmuth, 
\begin_inset Quotes eld
\end_inset 

Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension
\begin_inset Quotes erd
\end_inset 

, Machine Learning, Vol.21 (3), pp.
 269--304, December 1995 http://www.cse.ucsc.edu/~manfred/pubs/sallycompr.forgalleys.
ps
\layout Bibliography
\bibitem {kernelize}

R.
 Herbrich and T.
 Graepel.
 
\begin_inset Quotes eld
\end_inset 

Large scale Bayes point machines
\begin_inset Quotes erd
\end_inset 

 In T.
 K.
 Leen, T.
 G.
 Dietterich, and V.
 Tresp, editors, Advances in Neural Information System Processing 13, pages
 528-534, Cambridge, MA, 2001.
\layout Bibliography
\bibitem {Sanity}

Michael Kearns and Dana Ron, 
\begin_inset Quotes eld
\end_inset 

Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.
\begin_inset Quotes erd
\end_inset 

 Neural Computation 11(6), pages 1427-1453, 1999.
 Also in Proceedings of the Tenth Annual Conference on Computational Learning
 Theory, ACM Press, 1997, pages 152--162.
 http://www.cis.upenn.edu/~mkearns/papers/multi.ps
\layout Bibliography
\bibitem {Online}

J.
 Kivinen and M.
 Warmuth, "Additive Versus Exponentiated Gradient Updates for Linear Prediction,
" in Journal of Information and Computation, vol.
 132, no.
 1, pp.
 1-64, January 1997.
 http://www.cse.ucsc.edu/~manfred/pubs/lin.ps
\layout Bibliography
\bibitem {bound}

John Langford, Program 
\begin_inset Quotes eld
\end_inset 

bound
\begin_inset Quotes erd
\end_inset 

, http://www-2.cs.cmu.edu/~jcl/programs/bound/bound.html
\layout Bibliography
\bibitem {MC_journal}

John Langford and Avrim Blum 1999.
 Microchoice Bounds and Self Bounding learning algorithms.
 Machine Learning Journal.
 http://www.cs.cmu.edu/~jcl/papers/microchoice/journal/journal_final.ps
\layout Bibliography
\bibitem {Shell}

John Langford and David McAllester, 
\begin_inset Quotes eld
\end_inset 

Computable Shell Decomposition Bounds
\begin_inset Quotes erd
\end_inset 

 COLT 2000.
 http://www-2.cs.cmu.edu/~jcl/papers/computable_shell/colt_final.ps
\layout Bibliography
\bibitem {averaging}

John Langford and Matthias Seeger, 
\begin_inset Quotes eld
\end_inset 

Bounds for Averaging Classifiers
\begin_inset Quotes erd
\end_inset 

 Technical report, Carnegie Mellon 2001 http://www-2.cs.cmu.edu/~jcl/papers/averagi
ng/averaging_tech.ps
\layout Bibliography
\bibitem {PB-margin}

John Langford and John Shawe-Taylor, 
\begin_inset Quotes eld
\end_inset 

PAC-Bayes & Margins
\begin_inset Quotes erd
\end_inset 

, NIPS 2002.
 http://www-2.cs.cmu.edu/~jcl/papers/PB-margin/PB-margin.ps
\layout Bibliography
\bibitem {TnT}

John Langford, 
\begin_inset Quotes eld
\end_inset 

Combining Train Set and Test Set Bounds
\begin_inset Quotes erd
\end_inset 

, ICML2002.
 http://www-2.cs.cmu.edu/~jcl/papers/tnt_icml/train_n_test_final.ps
\layout Bibliography
\bibitem {LW}

Nick Littlestone and Manfred Warmuth, 
\begin_inset Quotes eld
\end_inset 

Relating Data Compression and Learnability
\begin_inset Quotes erd
\end_inset 

 Unpublished Manuscript http://www.cse.ucsc.edu/~manfred/pubs/lrnk-olivier.ps
\layout Bibliography
\bibitem {PB}

David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999.
 http://www.autoreason.com/posterior01.ps
\layout Bibliography
\bibitem {SCM}

Mario Marchand and John Shawe-Taylor, 
\begin_inset Quotes eld
\end_inset 

The Set Covering Machine
\begin_inset Quotes erd
\end_inset 

, ICML 2001.
 http://www.ai.mit.edu/projects/jmlr/papers/volume3/marchand02a/marchand02a.pdf
\layout Bibliography
\bibitem {Matthias}

Matthias Seeger, 
\begin_inset Quotes eld
\end_inset 

PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification
\begin_inset Quotes erd
\end_inset 

, Journal of Machine Learning Research 3 (2002), 233--269.
 http://www.dai.ed.ac.uk/homes/seeger/papers/seeger02a.ps.gz
\layout Bibliography
\bibitem {Chernoff}

Sebastian Seung, Unpublished notes
\layout Bibliography
\bibitem {Valiant}

L.G.
 Valiant.
 
\begin_inset Quotes eld
\end_inset 

A Theory of the Leuarnable.
\begin_inset Quotes erd
\end_inset 

 Communications of the ACM 27(11):1134-1142, November 1984
\layout Bibliography
\bibitem {Vapnik}

V.
 N.
 Vapnik and A.
 Y.
 Chervonenkis.
 
\begin_inset Quotes eld
\end_inset 

On the uniform convergence of relative frequencies of events to their probabilit
ies.
\begin_inset Quotes erd
\end_inset 

 Theory of Probab.
 and its Applications, 16(2):264-280, 1971.
 
\the_end