0 \)
\end_inset
,
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, e(h)\geq \hat{e}(h)+\sqrt{\frac{\ln |H|+\ln \frac{1}{\delta }}{2m}}\right) \leq \delta \]
\end_inset
\layout Proof
Loosen corollary (
\begin_inset LatexCommand \ref{th-REDHSCP}
\end_inset
) with the bound
\begin_inset Formula \( \textrm{KL}(q||p)\geq 2(q-p)^{2} \)
\end_inset
.
\layout Standard
The agnostic form of this bound has the advantage that it explicitly shows
that we are forcing the convergence (in high probability) of the empirical
error rate to the true error rate.
Graphically, we are forcing every empirical error to be near to its true
error.
This is a picture which represents the connection
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 89 12
file convergence.eps
width 4 30
flags 9
\end_inset
\layout Standard
Later bounds will have more complicated convergence conditions with more
complicated graphs.
These more complicated bounds are necessary in order to avoid the limitations
of the holdout bound.
See figure
\begin_inset LatexCommand \ref{fig-simple-holdout}
\end_inset
for a comparison with the holdout set.
\layout Section
Lower Bounds
\layout Standard
It is important to study lower bounds in order to discover how much room
for improvement exists in our upper bounds.
\layout Standard
There are two forms of lower bounds: a lower bounds on the true error rate
and lower
\emph on
upper
\emph default
bounds which lower bound the value our upper bound could return (given available
information).
For lower bounds essentially the upper bound theorem applies with the bound
reversed.
This form of lower bound allow us to construct a two-sided confidence interval
on the location of the true error rate.
Typically, such lower bounds can be formed by simply changing the sign
in upper bound arguments.
\layout Theorem
\begin_inset LatexCommand \label{th-dhlb}
\end_inset
(Discrete Hypothesis Lower Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
\layout Proof
For every individual hypothesis, we know that:
\begin_inset Formula
\[
\Pr _{D^{m}}(e(h)\leq \bar{e}(m,\hat{e}(h),\delta ))\leq \delta \]
\end_inset
Applying the union bound for every hypothesis gives us the result.
\layout Section
Lower Upper Bounds
\layout Standard
\begin_inset LatexCommand \label{sec-lower_upper}
\end_inset
\layout Standard
The second form of lower bound is a lower bound on the upper bound (similar
to the results of
\begin_inset LatexCommand \cite{AHKV}
\end_inset
).
If we can lower bound the upper bound (as a function of its observables),
then we can be confident that the upper bound is no looser than the gap
between the lower upper bound and the upper bound.
\layout Standard
How tight is the discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
)? The answer is
\emph on
sometimes
\emph default
tight.
In particular, we can exhibit a set of learning problem where the discrete
hypotheses bound can not be made significantly lower as a function of the
observables,
\begin_inset Formula \( m \)
\end_inset
,
\begin_inset Formula \( \delta \)
\end_inset
,
\begin_inset Formula \( |H| \)
\end_inset
, and
\begin_inset Formula \( \hat{e}(h) \)
\end_inset
.
Fix the value of these quantities and then we will construct a learning
problem for which a lower upper bound can not be stated.
\layout Standard
Our learning problem will be defined over the input space
\begin_inset Formula \( X=\{0,1\}^{|H|} \)
\end_inset
.
The hypotheses will be
\begin_inset Formula \( h_{i}(x)=x_{i} \)
\end_inset
where
\begin_inset Formula \( x_{i} \)
\end_inset
is the
\begin_inset Formula \( i \)
\end_inset
th component of the vector
\begin_inset Formula \( x \)
\end_inset
.
This construction allows us to vary the true error rate of each hypothesis
independent of the other hypotheses.
In fact, we can pick any true error rate for any hypothesis by simply adjusting
the probability that
\begin_inset Formula \( x_{i}=y \)
\end_inset
.
Our learning problem can therefore generate problems according to the following
algorithm:
\layout Algorithm
\begin_inset LatexCommand \label{alg-lp}
\end_inset
Draw_Sample(float
\begin_inset Formula \( e \)
\end_inset
)
\layout Enumerate
Pick
\begin_inset Formula \( y\in \{0,1\} \)
\end_inset
uniformly
\layout Enumerate
For
\begin_inset Formula \( i \)
\end_inset
, pick
\begin_inset Formula \( x_{i}\neq y \)
\end_inset
with probability
\begin_inset Formula \( e \)
\end_inset
.
\layout Standard
By construction, the true error rate of each hypothesis will be
\begin_inset Formula \( e(h_{i}) \)
\end_inset
.
Now, we can prove the following theorem:
\layout Theorem
\begin_inset LatexCommand \label{th-dhlub}
\end_inset
(Discrete Hypothesis lower upper bound) For all true error rates
\begin_inset Formula \( e(h)>\frac{\ln |H|+\ln \frac{1}{\delta }}{m} \)
\end_inset
there exists a learning problem and algorithm such that:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
\layout Standard
Intuitively, this theorem implies that we can not improve significantly
on the results of theorem
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
without using extra information about our learning problem.
Some of our later results do exactly this - they use extra information.
\layout Proof
Using the family of learning problems implicitly defined by algorithm
\begin_inset LatexCommand \ref{alg-lp}
\end_inset
, we know that
\begin_inset Formula
\[
\forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq \frac{\delta }{|H|}\]
\end_inset
(negation)
\begin_inset Formula
\[
\Rightarrow \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <1-\frac{\delta }{|H|}\]
\end_inset
(independence)
\begin_inset Formula
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \forall h\, \, e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\]
\end_inset
(negation)
\begin_inset Formula
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\]
\end_inset
(approximation)
\begin_inset Formula
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\]
\end_inset
\layout Standard
For small
\begin_inset Formula \( \delta \)
\end_inset
, we get that
\begin_inset Formula \( 1-e^{-\delta }\simeq \delta \)
\end_inset
which implies that, no significantly better true error bound can be stated
for all learning algorithms.
In particular, the empirical risk minimization algorithm (which chooses
the hypothesis with minimal empirical error) will have a significant probabilit
y of a large deviation between the empirical and true error.
\layout Section
Structural Risk Minimization
\layout Standard
Structural Risk Minimization
\begin_inset LatexCommand \cite{Vapnik}
\end_inset
(SRM) is a technique used in the learning theory community to avoid the
difficulties associated with convergence on hypothesis sets that are too
\begin_inset Quotes eld
\end_inset
large
\begin_inset Quotes erd
\end_inset
.
SRM works with a sequence of nested hypothesis sets,
\begin_inset Formula \( H_{1}\subset H_{2}\subset ....\subset H_{l} \)
\end_inset
.
For each hypothesis set, a discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
) on the difference between empirical and true error exists.
For
\begin_inset Quotes eld
\end_inset
small
\begin_inset Quotes erd
\end_inset
hypothesis sets, this bound may be tight while for large hypothesis sets
it may be inherently loose.
However, we also expect that the best hypothesis in the hypothesis set
improves as the hypothesis set becomes larger.
This naturally induces a trade-off: there will be some hypothesis set
\begin_inset Formula \( H_{i} \)
\end_inset
for which the true error bound is minimized.
\layout Standard
We can't simply apply the discrete hypothesis bound to the meta-algorithm
which picks the algorithm (and associated hypothesis space) with the smallest
true error bound since this meta-algorithm could, potentially, output any
hypothesis in
\begin_inset Formula \( H_{l} \)
\end_inset
.
The simplest way to retrofit the bound to include all hypothesis sets is
with a simple theorem which essentially states that we can guarantee
\emph on
nearly
\emph default
the same bound as would apply on the smallest hypothesis space
\begin_inset Formula \( H_{i} \)
\end_inset
containing the output hypothesis,
\begin_inset Formula \( h \)
\end_inset
.
\layout Theorem
\begin_inset LatexCommand \label{th-SRM}
\end_inset
(Structural Risk Minimization) Let
\begin_inset Formula \( p(i) \)
\end_inset
be some measure across the
\begin_inset Formula \( l \)
\end_inset
hypothesis sets with
\begin_inset Formula \( \sum _{i=1}^{l}p(i)=1 \)
\end_inset
.
Then:
\begin_inset Formula
\[
\forall p(i):\, \, \, \Pr _{D^{m}}\left( \exists h\in H_{i}\in \{H_{1},...,H_{l}\}:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{p(i)\delta }{|H_{i}|}\right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
.
\layout Proof
Apply the union bound to the discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
).
\layout Standard
The SRM bound is slightly inefficient in the sense that the bound for all
hypotheses in
\begin_inset Formula \( H_{2} \)
\end_inset
includes a bound for every hypothesis in
\begin_inset Formula \( H_{1} \)
\end_inset
.
This effect is typically small because the size of the hypothesis sets
usually grows exponentially, implying that the extra confidence given to
a hypothesis
\begin_inset Formula \( h \)
\end_inset
in
\begin_inset Formula \( H_{1} \)
\end_inset
by the bounds used on hypothesis set
\begin_inset Formula \( H_{2},H_{3},... \)
\end_inset
is small relative to the confidence given by the bound for
\begin_inset Formula \( H_{1} \)
\end_inset
.
One can remove this slack in Structural Risk Minimization bound by
\begin_inset Quotes eld
\end_inset
cutting out
\begin_inset Quotes erd
\end_inset
the nested portion of each hypothesis set in the formulation of
\begin_inset Formula \( H_{1},...,H_{l} \)
\end_inset
.
We will call this Disjoint Structural Risk Minimization (also mentioned
in
\begin_inset LatexCommand \cite{MC_Journal}
\end_inset
).
\layout Section
Incorporating a
\begin_inset Quotes eld
\end_inset
Prior
\begin_inset Quotes erd
\end_inset
\layout Standard
In constructing the discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
), it is notable that an arbitrary choice was made.
We decided to give the same error allowance to every hypothesis.
This is an arbitrary choice which, in practice, we will wish to make differentl
y.
The next theorem is essentially a restatement of the discrete hypothesis
bound with this arbitrary choice made explicit.
The same bound using the Hoeffding approximation has appeared elsewhere
\begin_inset LatexCommand \cite{BEHW}
\end_inset
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Theorem
\begin_inset LatexCommand \label{th-ORB}
\end_inset
(Occam's Razor Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Quotes eld
\end_inset
priors
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
over the hypothesis space,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\forall p(h)\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\delta p(h)\right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
\layout Standard
It is very important to notice that the
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
must be selected
\emph on
before
\emph default
seeing the examples.
\layout Proof
The proof again starts with the basic observation that:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( e(h)\geq \bar{e}(m,\hat{e}(h),\delta )\right) \leq \delta \]
\end_inset
then, we apply the union bound in a
\emph on
nonuniform
\emph default
manner.
In particular, we allocate confidence
\begin_inset Formula \( \delta p(h) \)
\end_inset
to hypothesis
\begin_inset Formula \( h \)
\end_inset
.
Since
\begin_inset Formula \( p \)
\end_inset
is normalized, we know that
\begin_inset Formula
\[
\sum _{h}\delta p(h)=\delta \]
\end_inset
which implies that the union bound completes the proof.
\layout Standard
Once again, we can relax the Occam's Razor bound (theorem
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
) with the relative entropy Chernoff bound (
\begin_inset LatexCommand \ref{eq-recb}
\end_inset
) to get a somewhat more tractable expression.
\layout Corollary
\begin_inset LatexCommand \label{th-reorb}
\end_inset
(Relative Entropy Occam's Razor Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Quotes eld
\end_inset
priors
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
over the hypothesis space,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))\geq \frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end_inset
\layout Proof
approximate the Binomial tail with (
\begin_inset LatexCommand \ref{eq-recb}
\end_inset
) and solve for the minimum.
\layout Standard
The Occam's razor bound is often nonvacuous for discrete learning algorithms
such as decision lists and decision trees.
The next chapter will discuss a particular motivated choice of the measure
\begin_inset Formula \( p(h) \)
\end_inset
which can lead to much tighter bounds in practice.
\layout Standard
The application of the Occam's Razor bound is somewhat more complicated
then the application of the test set bound.
Pictorially, the protocol for bound application is given in figure
\begin_inset LatexCommand \ref{fig-training-set}
\end_inset
.
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 159
file thesis-presentation/training_set.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-training-set}
\end_inset
In order to apply this bound it is necessary that the choice of
\begin_inset Quotes eld
\end_inset
Prior
\begin_inset Quotes erd
\end_inset
be made before seeing any training examples.
Then, the bound is calculated based upon the chosen hypothesis.
Note that it
\emph on
is
\emph default
\begin_inset Quotes eld
\end_inset
legal
\begin_inset Quotes erd
\end_inset
to chose the hypothesis based upon the prior
\begin_inset Formula \( p(h) \)
\end_inset
as well as the empirical error
\begin_inset Formula \( \hat{e}_{S}(h) \)
\end_inset
.
\end_float
\layout Standard
Examples of calculation of these bounds is detailed in appendix section
\begin_inset LatexCommand \ref{sec-train-bound-calc}
\end_inset
.
\layout Standard
The
\begin_inset Quotes eld
\end_inset
Occam's Razor bound
\begin_inset Quotes erd
\end_inset
is strongly related to compression.
In particular, for any self-terminating description language,
\begin_inset Formula \( d(h) \)
\end_inset
, we can associate a
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h)=2^{-d(h)} \)
\end_inset
with the property that
\begin_inset Formula \( \sum _{h}p(h)\leq 1 \)
\end_inset
.
Consequently, short description length hypotheses will tend to have a tighter
convergence and the penalty term,
\begin_inset Formula \( \ln \frac{1}{p(h)} \)
\end_inset
is the number of
\begin_inset Quotes eld
\end_inset
nats
\begin_inset Quotes erd
\end_inset
(bits base e).
\layout Part
New Techniques
\layout Chapter
Microchoice Bounds (the algebra of choices)
\layout Standard
\begin_inset LatexCommand \label{sec-MC}
\end_inset
\layout Standard
The work in this chapter is joint with Avrim Blum and was first presented
at ICML
\begin_inset LatexCommand \cite{MC}
\end_inset
and then in the Machine Learning Journal
\begin_inset LatexCommand \cite{MC_journal}
\end_inset
.
The presentation here generalizes, unifies, and improves the earlier work.
Microchoice bounds can be thought of as unifying
\begin_inset Quotes eld
\end_inset
Self-bounding Learning Algorithms
\begin_inset Quotes erd
\end_inset
\begin_inset LatexCommand \cite{SB}
\end_inset
and the Occam's razor bound,
\begin_inset LatexCommand \cite{BEHW}
\end_inset
.
\layout Standard
The bounds of the previous chapter do not use much structure on the hypothesis
space.
Yet learning algorithms induce a natural structure.
In particular, many learning algorithms work by an iterative process in
which they take a sequence of steps each from a small set of choices (small
in comparison to the overall hypothesis set size).
Local optimization algorithms such as hill-climbing or simulated annealing,
for example, work in this manner.
Each step in a local optimization algorithm can be viewed as making a choice
from a small set of possible steps to take.
If we take into account the number of choices made and the size (and other
properties) of each choice set, can we produce a tighter bound? This chapter
introduce microchoice bounds which use this type of information to construct
high confidence bounds on the future error rate.
\layout Standard
The microchoice bounds can be thought of in several ways.
The simple microchoice bound (given in section
\begin_inset LatexCommand \ref{sec:mc}
\end_inset
) can be thought of as a well motivated application of the Occam's Razor
bound (
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
).
The idea is to use the learning algorithm itself to define a description
language for hypotheses, so that the description length of the hypothesis
actually produced gives a bound on the estimation error.
The adaptive microchoice theorem (in section
\begin_inset LatexCommand \ref{sec:query}
\end_inset
) can be thought of as a computationally tractable adaptation of Self-bounding
Learning algorithms
\begin_inset LatexCommand \cite{SB}
\end_inset
.
Microchoice bounds tie together and show the relationship between these
different sample complexity bounds.
\layout Standard
Microchoice bounds also provide insight into the nature of choices.
In general, we know that choice is
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
for the purposes of creating a uniform bound on the true error rate.
The microchoice bounds give a quantitative understanding of how much choice
is
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
.
In particular, the log of the choice space size is the natural measure
of
\begin_inset Quotes eld
\end_inset
badness
\begin_inset Quotes erd
\end_inset
.
This is directly related to the log of the hypothesis space size in the
discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
).
There is also an indirect relationship with information theory where the
log of the alphabet size is an important parameter for specifying the number
of bits required to send a message.
\layout Standard
Viewed as an interactive proof of learning, microchoice bounds can be described
pictorially as in figure
\begin_inset LatexCommand \ref{fig-microchoice-protocol}
\end_inset
.
\layout Standard
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 176
file thesis-presentation/microchoice.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-microchoice-protocol}
\end_inset
Microchoice bounds collapse the two-round Occam's Razor style protocol
into a one round protocol.
This is (essentially) done by providing a compiler for the verifier which
takes a learning algorithm as input, and extracts a choice of
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
.
\end_float
\layout Standard
Important early work developing approximately self-bounding learning algorithms
was also done by Domingos
\begin_inset LatexCommand \cite{Domingos}
\end_inset
.
\layout Section
A Motivating Observation
\layout Standard
\begin_inset LatexCommand \label{sec:motivating}
\end_inset
\layout Standard
Imagine, for the moment, that we know the (unknown) problem distribution,
\begin_inset Formula \( D \)
\end_inset
.
For a given learning algorithm
\begin_inset Formula \( A \)
\end_inset
, a distribution
\begin_inset Formula \( D \)
\end_inset
on labeled examples induces a distribution
\begin_inset Formula \( q(h) \)
\end_inset
over the possible hypotheses
\begin_inset Formula \( h\in H \)
\end_inset
produced by algorithm
\begin_inset Formula \( A \)
\end_inset
after
\begin_inset Formula \( m \)
\end_inset
examples
\begin_float footnote
\layout Standard
\begin_inset Formula \( q(h) \)
\end_inset
is the probability over draws of examples and any internal randomization
of the algorithm.
\end_float
.
A natural choice for the Occam's Razor bound (
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
) is the measure
\begin_inset Formula \( p(h)=q(h) \)
\end_inset
.
Is this choice optimal? The answer is
\begin_inset Quotes eld
\end_inset
yes
\begin_inset Quotes erd
\end_inset
, given the right notion of optimal.
In particular, if we start with the relative entropy Occam's razor bound
(
\begin_inset LatexCommand \ref{th-reorb}
\end_inset
), we can show that
\begin_inset Formula \( p(h)=q(h) \)
\end_inset
minimizes the expected value of the bound on the Kullback-Leibler divergence
between the empirical error and true error.
\layout Theorem
(KL divergence minimization)
\begin_inset Formula \( p(h)=q(h) \)
\end_inset
minimizes the expected value of the KL divergence in the relative entropy
Occam's Razor bound (
\begin_inset LatexCommand \ref{th-reorb}
\end_inset
).
\layout Proof
We need to show that
\begin_inset Formula
\[
q(h)=\textrm{argmin}_{p(h)}\sum _{h}q(h)\frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\]
\end_inset
removing terms which the minimum does not depend on, we get:
\begin_inset Formula
\[
\textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{1}{p(h)}\]
\end_inset
adding a constant, we get:
\begin_inset Formula
\[
\textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{q(h)}{p(h)}\]
\end_inset
This is equivalent to minimizing the Kullback-Leibler divergence between
the distribution of
\begin_inset Formula \( q(h) \)
\end_inset
and
\begin_inset Formula \( p(h) \)
\end_inset
which is minimized for
\begin_inset Formula \( q(h)=p(h) \)
\end_inset
.
\layout Standard
Using the KL divergence as our notion of loss is somewhat non-intuitive.
However, it is mathematically simple and not irrational.
For all true errors, the KL divergence will be upper and lower bounded
between an
\begin_inset Formula \( l_{1} \)
\end_inset
and an
\begin_inset Formula \( l_{2} \)
\end_inset
metric.
Since these are two of the most common metrics, the choice of KL divergence
based metric should behave similarly well.
\layout Standard
The point of these observations is to notice that if the structure of the
learning algorithm produces a choice
\begin_inset Formula \( p(h) \)
\end_inset
that approximates
\begin_inset Formula \( q(h) \)
\end_inset
, the result should be better estimation bounds.
\layout Section
The Simple Microchoice Bound
\layout Standard
\begin_inset LatexCommand \label{sec:mc}
\end_inset
\layout Standard
The simple microchoice bound is essentially a compelling and easy way to
select a measure
\begin_inset Formula \( p(h) \)
\end_inset
for learning algorithms that operate by making a series of small choices.
In particular, consider a learning algorithm that works by making a sequence
of choices,
\begin_inset Formula \( c_{1},...,c_{d} \)
\end_inset
, from a sequence of choice sets,
\begin_inset Formula \( C_{1},...,C_{d} \)
\end_inset
, finally producing a hypothesis,
\begin_inset Formula \( h\in H \)
\end_inset
.
Specifically, the algorithm first looks at the choice set
\begin_inset Formula \( C_{1} \)
\end_inset
and the data
\begin_inset Formula \( z^{N} \)
\end_inset
to produce choice
\begin_inset Formula \( c_{1}\in C_{1} \)
\end_inset
.
The choice
\begin_inset Formula \( c_{1} \)
\end_inset
then determines the next choice set
\begin_inset Formula \( C_{2} \)
\end_inset
(different initial choices produce different choice sets for the second
level).
The algorithm again looks at the data to make some choice
\begin_inset Formula \( c_{2}\in C_{2} \)
\end_inset
.
This choice then determines the next choice set
\begin_inset Formula \( C_{3} \)
\end_inset
, and so on.
These choice sets can be thought of as nodes in a
\emph on
choice tree
\emph default
, where each node in the tree corresponds to some internal state of the
learning algorithm, and a node containing some choice set
\begin_inset Formula \( C \)
\end_inset
has branching factor
\begin_inset Formula \( |C| \)
\end_inset
.
Pictorially, we can draw the tree as follows:
\layout Standard
\begin_inset Figure size 270 112
file thesis-presentation/microchoice_tree.eps
width 4 91
flags 9
\end_inset
\layout Standard
Depending on the learning algorithm, sub-trees of the overall tree may be
identical.
We address optimization of the bound for this case later.
Eventually there is a final choice leading to a leaf, and a single hypothesis
is output.
\layout Standard
For example, the decision list algorithm of Rivest
\begin_inset LatexCommand \cite{Rivest}
\end_inset
, applied to a set of
\begin_inset Formula \( n \)
\end_inset
features, uses the data to choose one of
\begin_inset Formula \( 4n+2 \)
\end_inset
rules (e.g.,
\begin_inset Quotes eld
\end_inset
if
\begin_inset Formula \( \bar{x}_{3} \)
\end_inset
then
\begin_inset Formula \( - \)
\end_inset
\begin_inset Quotes erd
\end_inset
) to put at the top.
Based on the choice made, it moves to a choice set of
\begin_inset Formula \( 4n-2 \)
\end_inset
possible rules to put at the next level, then a choice set of size
\begin_inset Formula \( 4n-6 \)
\end_inset
, and so on, until eventually it chooses a rule such as
\begin_inset Quotes eld
\end_inset
else
\begin_inset Formula \( + \)
\end_inset
\begin_inset Quotes erd
\end_inset
leading to a leaf.
\layout Standard
The microchoice bound calculation program is as follows:
\layout Algorithm
Calculate_Microchoice
\layout Enumerate
set
\begin_inset Formula \( p\leftarrow 1 \)
\end_inset
\layout Enumerate
while learning algorithm has not halted.
\begin_deeper
\layout Enumerate
\begin_inset Formula \( |C|\leftarrow \)
\end_inset
number of possible data-dependent choices
\layout Enumerate
\begin_inset Formula \( p\leftarrow \frac{p}{|C|} \)
\end_inset
\end_deeper
\layout Enumerate
return
\begin_inset Formula \( p \)
\end_inset
\layout Standard
Pictorially, this algorithm can be thought of as taking a
\begin_inset Quotes eld
\end_inset
supply
\begin_inset Quotes erd
\end_inset
of probability at the root of the choice tree.
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 267 139
file thesis-presentation/microchoice_alg.eps
width 4 90
flags 9
\end_inset
\layout Standard
The root takes its supply and splits it equally among all its children.
Recursively, each child then does the same: it takes the supply it is given
and splits it evenly among its children, until all of the supplied probability
is allocated among the leaves.
If we examine some leaf containing a hypothesis
\begin_inset Formula \( h \)
\end_inset
, we see that this method gives at least probability
\begin_inset Formula \( p(h)=\prod _{i=1}^{d(h)}\frac{1}{|C_{i}(h)|} \)
\end_inset
to each
\begin_inset Formula \( h \)
\end_inset
for any path of depth
\begin_inset Formula \( d(h) \)
\end_inset
reaching the hypothesis
\begin_inset Formula \( h \)
\end_inset
.
\layout Standard
Note it is possible that several leaves will contain the same hypothesis
\begin_inset Formula \( h \)
\end_inset
, and in that case one should really add the allocated measures together.
However, the microchoice bound neglects this issue, implying that it will
be unnecessarily loose for learning algorithms which can arrive at the
same hypothesis in multiple ways.
The reason for neglecting this is that now,
\begin_inset Formula \( p(h) \)
\end_inset
is something the learning algorithm itself can calculate by simply keeping
track of the sizes of the choice sets it has encountered so far.
It is important to notice that this construction is defined before observing
any data.
Consequently, every hypothesis has some bound associated with it before
the data is used to pick a particular hypothesis and its corresponding
bound.
\layout Standard
Another way to view this process is that we cannot know in advance which
choice sequence the algorithm will make.
However, a distribution
\begin_inset Formula \( D \)
\end_inset
on labeled examples induces a probability distribution over choice sequences,
inducing a probability distribution
\begin_inset Formula \( q(h) \)
\end_inset
over hypotheses.
Ideally we would like to use
\begin_inset Formula \( p(h)=q(h) \)
\end_inset
in our bounds as noted above.
However, we cannot calculate
\begin_inset Formula \( q(h) \)
\end_inset
(since the distribution
\begin_inset Formula \( D \)
\end_inset
is unknown), so instead, our choice of
\begin_inset Formula \( p(h) \)
\end_inset
will be just an estimate.
We hope that the algorithm designer has chosen a
\begin_inset Quotes eld
\end_inset
good
\begin_inset Quotes erd
\end_inset
leaning algorithm which induces a distribution
\begin_inset Formula \( p(h) \)
\end_inset
over the final hypotheses which is near to
\begin_inset Formula \( q(h) \)
\end_inset
.
Our estimate
\begin_inset Formula \( p(h) \)
\end_inset
is the probability distribution resulting from picking each choice uniformly
at random from the current choice set at each level (note: this is different
from picking a final hypothesis uniformly at random).
I.e., it can be viewed as the measure associated with the assumption that
at each step, all choices are equally likely.
\layout Standard
We immediately find the following theorem:
\layout Theorem
\begin_inset LatexCommand \label{th-smb}
\end_inset
(Microchoice Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),\frac{\delta }{\prod _{i=1}^{d(h)}|C_{i}(h)|})\right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
\layout Standard
\noindent
\series bold
Proof.
\series default
Specialization of the Occam's Razor bound (
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
).
\layout Standard
Once again, it will be worthwhile to slightly loosen this bound with the
following corollary:
\layout Corollary
\begin_inset LatexCommand \label{th-remb}
\end_inset
(Relative Entropy Microchoice Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |C_{i}(h)|+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end_inset
\layout Standard
The point of the microchoice bound is that the quantity
\begin_inset Formula \( \bar{e}(...) \)
\end_inset
is something the algorithm can calculate as it goes along, based on the
sizes of the choice sets encountered.
To see this, note that the hypothesis dependent term is
\begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \)
\end_inset
.
We can calculate
\begin_inset Formula \( d(h) \)
\end_inset
by noting the number of choices made.
The choice sets,
\begin_inset Formula \( C_{i}(h) \)
\end_inset
, can often be easily deduced by reasoning about the possible microchoices
the algorithm could have made given different datasets.
\layout Standard
In many natural cases, a
\begin_inset Quotes eld
\end_inset
fortuitous distribution and target concept
\begin_inset Quotes erd
\end_inset
corresponds to a shallow leaf or a part of the tree with low branching,
resulting in a better bound.
For instance, in the decision list case,
\begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \)
\end_inset
is roughly
\begin_inset Formula \( d\ln n \)
\end_inset
where
\begin_inset Formula \( d \)
\end_inset
is the length of the list produced and
\begin_inset Formula \( n \)
\end_inset
is the number of features.
Notice that
\begin_inset Formula \( d\ln n \)
\end_inset
is also the description length of the final hypothesis produced in the
natural encoding, thus in this case these theorems yield similar bounds
to a simple application of Occam's razor or SRM.
\layout Standard
More generally, the microchoice bound is similar to Occam's razor or SRM
bounds when each
\begin_inset Formula \( k \)
\end_inset
-ary choice in the tree corresponds to
\begin_inset Formula \( \log k \)
\end_inset
bits in the natural encoding of the final hypothesis
\begin_inset Formula \( h \)
\end_inset
.
However, sometimes this may not be the case.
Consider, for instance, a local optimization algorithm in which there are
\begin_inset Formula \( n \)
\end_inset
parameters and each step adds or subtracts 1 from one of the parameters.
Suppose in addition the algorithm knows certain constraints that these
parameters must satisfy (perhaps a set of linear inequalities) and the
algorithm restricts itself to choices in the legal region.
In this case, the branching factor, at most
\begin_inset Formula \( 2n \)
\end_inset
, might become much smaller if we are
\begin_inset Quotes eld
\end_inset
lucky
\begin_inset Quotes erd
\end_inset
and head toward a highly constrained portion of the solution space.
One could always reverse-engineer an encoding of hypotheses based on the
choice tree, but the microchoice approach is much more natural.
\layout Standard
There is also an opportunity to use
\emph on
a
\protected_separator
priori
\emph default
knowledge in the choice of
\begin_inset Formula \( p(h) \)
\end_inset
.
In particular, instead of splitting our confidence equally at each node
of the tree, we could split it unevenly, according to some heuristic function
\begin_inset Formula \( g \)
\end_inset
.
If
\begin_inset Formula \( g \)
\end_inset
is
\begin_inset Quotes eld
\end_inset
good
\begin_inset Quotes erd
\end_inset
it may produce error bounds similar to the bounds when
\begin_inset Formula \( p(h)=q(h) \)
\end_inset
.
In fact, the method of section (
\begin_inset LatexCommand \ref{sec:query}
\end_inset
) where we combine these results with Freund's query-tree
\begin_inset LatexCommand \cite{SB}
\end_inset
approach can be thought of as an attempt to do exactly this.
\layout Subsection
Examples
\layout Standard
It is difficult to create a bound which is universally better than previous
bounds.
The microchoice bound can be much better than the discrete hypothesis bound
(
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
) and can be slightly worse.
To develop some understanding of how they compare we consider several cases.
\layout Subsubsection
Greedy Set Cover
\layout Standard
Consider a greedy set cover algorithm for learning an OR function over
\begin_inset Formula \( F \)
\end_inset
Boolean features.
The algorithm begins with a choice space of size
\begin_inset Formula \( F+1 \)
\end_inset
(one per feature or halt) and chooses the feature that covers the most
positive examples while covering no negative ones.
It then moves to a choice space of size
\begin_inset Formula \( F \)
\end_inset
(one per feature remaining or halt) and chooses the best remaining feature
and so on until it halts.
If the number of features chosen is
\begin_inset Formula \( k \)
\end_inset
then the microchoice bound is:
\layout Standard
\begin_inset Formula
\[
\epsilon (h)=\frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{i=1}^{k}\ln (F-i+2)\right) \leq \frac{1}{m}\left( \ln \frac{1}{\delta }+k\ln (F+1)\right) \]
\end_inset
\layout Standard
The bound of (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
) is:
\layout Standard
\begin_inset Formula
\[
\epsilon =\frac{1}{m}\left( \ln \frac{1}{\delta }+F\ln 2\right) .\]
\end_inset
\layout Standard
If
\begin_inset Formula \( k \)
\end_inset
is small, then the microchoice bound is a lot better, but if
\begin_inset Formula \( k=O(F) \)
\end_inset
then the microchoice bound is slightly worse than the discrete hypothesis
bound.
Notice that in this case the microchoice bound is essentially the same
as the standard Occam's razor analysis when one uses
\begin_inset Formula \( O(\ln F) \)
\end_inset
bits per feature to describe the hypothesis.
\layout Subsubsection
Decision Trees
\layout Standard
Decision trees over discrete sets (say,
\begin_inset Formula \( \{0,1\}^{F} \)
\end_inset
) are another natural setting for application of the microchoice bound.
\layout Standard
A decision tree differs from a decision list in that the size of the available
choice set is larger due to the fact that there are multiple nodes where
a new test may be applied.
In particular, for a decision tree with
\begin_inset Formula \( K \)
\end_inset
leaves at an average depth of
\begin_inset Formula \( d \)
\end_inset
, the choice set size is
\begin_inset Formula \( K(F-d) \)
\end_inset
, giving a bound noticeably worse than the bound for the decision list.
This motivates a slightly different decision algorithm which considers
only one leaf node at a time.
The algorithm adds a new test or decides to never add a new test at this
node.
In this case, there are
\begin_inset Formula \( (F-d(v)+1) \)
\end_inset
choices for a node
\begin_inset Formula \( v \)
\end_inset
at depth
\begin_inset Formula \( d(v) \)
\end_inset
, implying the bound:
\begin_inset Formula
\begin{equation}
\label{eqn:dectreemc}
\textrm{KL}(\hat{e}(h)||e(h))\leq \frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{v}\ln (F-d(v)+1)\right)
\end{equation}
\end_inset
where
\begin_inset Formula \( v \)
\end_inset
ranges over the nodes of the decision tree.
Once again, this is very similar to what might be produced by an Occam's
Razor Bound with an appropriate choice of prior.
This result is again sometimes much better than the Discrete Hypothesis
bound and sometimes slightly worse.
\layout Subsection
Pruning
\layout Standard
Decision tree algorithms for real-world learning problems often have some
form of
\begin_inset Quotes eld
\end_inset
pruning
\begin_inset Quotes erd
\end_inset
as in
\begin_inset LatexCommand \cite{Quinlan}
\end_inset
and
\begin_inset LatexCommand \cite{Mingers}
\end_inset
.
The tree is first grown to full size producing a hypothesis with minimum
empirical error.
Then the tree is
\begin_inset Quotes eld
\end_inset
pruned
\begin_inset Quotes erd
\end_inset
starting at the leaves and progressing up through the tree toward the root
node using some test for the significance of an internal node.
An internal node is not significant if the reduction in total error is
small in comparison to the complexity of its children.
Insignificant internal nodes are replaced with a leaf resulting in a smaller
tree.
\layout Standard
Microchoice bounds have the property that they incidentally prove a bound
for every decision tree which can be found by pruning internal nodes.
In particular, one of the choices available when constructing a node is
to make the node a leaf.
Therefore, if we begin with the tree
\begin_inset Formula \( T \)
\end_inset
and then prune to the smaller tree
\begin_inset Formula \( T' \)
\end_inset
, we can apply the bound (
\begin_inset LatexCommand \ref{eqn:dectreemc}
\end_inset
) to
\begin_inset Formula \( T' \)
\end_inset
\emph on
as if
\emph default
the algorithm had constructed
\begin_inset Formula \( T' \)
\end_inset
directly rather than having gone first through the tree
\begin_inset Formula \( T \)
\end_inset
.
This suggests another possible pruning criterion: prune a node if the pruning
would result in an improved microchoice bound.
That is, prune if the increase in empirical error is less than the decrease
in
\begin_inset Formula \( \epsilon (h) \)
\end_inset
.
This pruning criteria is a
\begin_inset Quotes eld
\end_inset
pessimistic criteria
\begin_inset Quotes erd
\end_inset
\begin_inset LatexCommand \cite{Yishay}
\end_inset
.
\layout Standard
The similarities to SRM are discussed next.
\layout Subsection
Microchoice and Structural Risk Minimization
\layout Standard
The microchoice bound is essentially a compelling application of the Disjoint
SRM bound
\begin_inset LatexCommand \ref{th-SRM}
\end_inset
where the description language for a hypothesis is the sequence of data-depende
nt choices which the algorithm makes in the process of deciding upon the
hypothesis.
The hypothesis set
\begin_inset Formula \( H_{i} \)
\end_inset
is all hypotheses with the same description length in this language.
\layout Standard
As an example, consider a binary decision tree with
\begin_inset Formula \( F \)
\end_inset
Boolean features and a Boolean label.
The first hypothesis set,
\begin_inset Formula \( H_{1} \)
\end_inset
will consist of
\begin_inset Formula \( 2 \)
\end_inset
hypotheses; always false and always true.
In general, we will have one hypothesis set for every legal configuration
of internal nodes.
The size of a hypothesis set where every tree contains
\begin_inset Formula \( k \)
\end_inset
internal nodes will be
\begin_inset Formula \( 2^{k+1} \)
\end_inset
because there are
\begin_inset Formula \( k+1 \)
\end_inset
leaves each of which can take
\begin_inset Formula \( 2 \)
\end_inset
values.
The weighting
\begin_inset Formula \( p(i) \)
\end_inset
across the different hypothesis sets is defined by the microchoice allocation
of confidence.
\layout Standard
The principle disadvantage of the microchoice bound is that the sequence
of data-dependent choices may contain redundancy.
A different SRM bound with a different set of disjoint hypothesis sets
might be able to better avoid redundancy.
As an example, assume that we are working with a decision tree on
\begin_inset Formula \( F \)
\end_inset
binary features.
There are
\begin_inset Formula \( F+2 \)
\end_inset
choices (any of
\begin_inset Formula \( F \)
\end_inset
features or
\begin_inset Formula \( 2 \)
\end_inset
labels) at the top node.
At the next node down there will be
\begin_inset Formula \( F+1 \)
\end_inset
choices in both the left and right children.
Repeat until a maximal decision tree is constructed.
There will be
\begin_inset Formula \( \prod _{i=0}^{F}(F-i+2)^{2^{i}} \)
\end_inset
possible trees.
This number is somewhat larger than the number of Boolean functions on
\begin_inset Formula \( F \)
\end_inset
features:
\begin_inset Formula \( 2^{2^{F}} \)
\end_inset
.
\layout Section
Combining Microchoice with Freund's Query Tree approach
\layout Standard
\begin_inset LatexCommand \label{sec:query}
\end_inset
\layout Standard
The next section is devoted to an improvement of the microchoice bound called
adaptive microchoice, which arises from synthesizing Freund's query trees
\begin_inset LatexCommand \cite{SB}
\end_inset
with the microchoice bound.
This improvement is not easily expressed as a simplification of Structural
Risk Minimization.
In essence, the adaptive microchoice bound can gain from dependence on
the learning problem distribution
\begin_inset Formula \( D \)
\end_inset
and can take advantage of an
\begin_inset Quotes eld
\end_inset
easy
\begin_inset Quotes erd
\end_inset
distribution.
\layout Standard
First we require some background material in order to state and understand
Freund's bound.
\layout Subsection
Preliminaries and Definitions
\layout Standard
The statistical query framework introduced by Kearns
\begin_inset LatexCommand \cite{SQ}
\end_inset
restricts learning algorithm to only access the data using statistical
queries.
A statistical query takes as input a binary predicate,
\begin_inset Formula \( \chi \)
\end_inset
, mapping examples to a binary output:
\begin_inset Formula \( (X,Y)\rightarrow \{0,1\} \)
\end_inset
.
The output of the statistical query is the average of
\begin_inset Formula \( \chi \)
\end_inset
over the examples seen.
Let
\begin_inset Formula \( z_{i} \)
\end_inset
be the
\begin_inset Formula \( i \)
\end_inset
th labeled example, then:
\begin_inset Formula
\[
\hat{D}_{\chi }=\frac{1}{m}\sum _{i=1}^{m}\chi (z_{i})\]
\end_inset
\layout Standard
The output is an empirical estimate of the true value
\begin_inset Formula \( D_{\chi }={\textbf {E}}_{D}[\chi (z)]=\Pr _{z\sim D}(\chi (z)=1) \)
\end_inset
of the query under the distribution
\begin_inset Formula \( D \)
\end_inset
\begin_float footnote
\layout Standard
In the real SQ model there is no set of examples.
The algorithm asks a query
\begin_inset Formula \( \chi \)
\end_inset
and is given a response
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
that is guaranteed to be near to the true value
\begin_inset Formula \( D_{\chi } \)
\end_inset
.
That is, the true SQ model is an abstraction of the scenario described
here where
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
is computed from an observed sample.
\end_float
.
One simple example of a predicate
\begin_inset Formula \( \chi \)
\end_inset
is
\begin_inset Quotes eld
\end_inset
the first bit is
\begin_inset Formula \( 1 \)
\end_inset
\begin_inset Quotes erd
\end_inset
.
A more complicated predicate might be
\begin_inset Quotes eld
\end_inset
the third bit xor the 4th bit is
\begin_inset Formula \( 0 \)
\end_inset
\begin_inset Quotes erd
\end_inset
.
Naturally, the distribution of
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
will be the familiar Binomial distribution.
\layout Standard
It is convenient to define
\begin_inset Formula
\[
\bar{I}_{D_{\chi }}(\delta )=\frac{1}{m}\max _{k}\left\{ k:\, \, 1-\textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \]
\end_inset
and
\begin_inset Formula
\[
\underline{I}_{D_{\chi }}(\delta )=\frac{1}{m}\min _{k}\left\{ k:\, \, \textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \]
\end_inset
and let
\begin_inset Formula
\[
I_{\chi }(\delta )=\left[ \underline{I}_{D_{\chi }}(\delta ),\bar{I}_{D_{\chi }}(\delta )\right] \]
\end_inset
Intuitively,
\begin_inset Formula \( I_{\chi }(\delta ) \)
\end_inset
is a (fixed) interval in which the random variable
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
will fall with high probability.
In other words, we know that:
\begin_inset Formula
\[
\Pr \left[ \hat{D}_{\chi }\not \in I_{\chi }(\delta )\right] \leq \delta \]
\end_inset
Now, we want to construct a confidence interval based upon the high confidence
interval
\begin_inset Formula \( I_{\chi }(\delta ) \)
\end_inset
.
We can do this using the inversion lemma (
\begin_inset LatexCommand \ref{lem-inversion}
\end_inset
) to get:
\begin_inset Formula
\[
\bar{D}_{\chi }(\delta )=\max _{p}\left\{ p:\underline{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \]
\end_inset
and
\begin_inset Formula
\[
\underline{D}_{\chi }(\delta )=\min _{p}\left\{ p:\bar{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \]
\end_inset
\layout Standard
The random interval defined here contains the
\begin_inset Quotes eld
\end_inset
real
\begin_inset Quotes erd
\end_inset
answer
\begin_inset Formula \( D_{\chi } \)
\end_inset
with high probability.
In other words, we have:
\begin_inset Formula
\[
\Pr \left[ D_{\chi }\not \in [\underline{D}_{\chi }(\delta ),\bar{D}_{\chi }(\delta )]\right] \leq \delta \]
\end_inset
\layout Subsection
Background and Summary
\layout Standard
Freund
\begin_inset LatexCommand \cite{SB}
\end_inset
considers choice algorithms that at each step perform a Statistical Query
on the sample, using the result to determine which choice to take.
For an algorithm
\begin_inset Formula \( A \)
\end_inset
, tolerance
\begin_inset Formula \( \alpha \)
\end_inset
(defined next), and distribution
\begin_inset Formula \( D \)
\end_inset
, Freund defines the query tree
\begin_inset Formula \( T_{A}(D,\alpha ) \)
\end_inset
as the choice tree created by considering only those choices resulting
from answers
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
to queries
\begin_inset Formula \( \chi \)
\end_inset
such that
\begin_inset Formula \( \left| \hat{D}_{\chi }-D_{\chi }\right| \leq \alpha \)
\end_inset
.
The idea is that if a particular predicate,
\begin_inset Formula \( \chi \)
\end_inset
, is true with probability
\begin_inset Formula \( .9 \)
\end_inset
(for example) on a random sample it is very unlikely that the empirical
result of the query will be
\begin_inset Formula \( .1 \)
\end_inset
.
More generally, the chance the answer to a given query is off by more than
\begin_inset Formula \( \alpha \)
\end_inset
is at most
\begin_inset Formula \( 2e^{-2m\alpha ^{2}} \)
\end_inset
by Hoeffding's inequality.
So, if the entire tree contains a total of
\begin_inset Formula \( |Q(T_{A}(D,\alpha ))| \)
\end_inset
queries in it, the probability
\emph on
any
\emph default
of these queries is off by more than
\begin_inset Formula \( \alpha \)
\end_inset
is at most
\begin_inset Formula \( 2\cdot |Q(T_{A}(D,\alpha ))|\cdot e^{-2m\alpha ^{2}} \)
\end_inset
.
In other words, this is an upper bound on the probability the algorithm
ever
\begin_inset Quotes eld
\end_inset
falls off the tree
\begin_inset Quotes erd
\end_inset
and makes a low probability choice.
The point of this is that we can allocate half (say) of the confidence
parameter
\begin_inset Formula \( \delta \)
\end_inset
to the event that the algorithm ever falls off the tree, and then spread
the remaining half evenly on the hypotheses in the tree (which hopefully
is a much smaller set than the entire hypothesis set).
\layout Standard
Unfortunately, the query tree suffers from the same problem as the
\begin_inset Formula \( q(h) \)
\end_inset
distribution considered in section (
\begin_inset LatexCommand \ref{sec:motivating}
\end_inset
), namely that to compute it, one needs to know
\begin_inset Formula \( D \)
\end_inset
.
So, Freund proposes an algorithmic method to find a super-set approximation
of the tree.
The idea is that by analyzing the results of queries, it is possible to
determine which outcomes were unlikely given that the query is close to
the desired outcome.
In particular, each time a query
\begin_inset Formula \( \chi \)
\end_inset
is asked and a response
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
is received, if it is true that
\begin_inset Formula \( |\hat{D}_{\chi }-D_{\chi }|\leq \alpha \)
\end_inset
, then the range
\begin_inset Formula \( \left[ \hat{D}_{\chi }-2\alpha ,\hat{D}_{\chi }+2\alpha \right] \)
\end_inset
contains the range
\begin_inset Formula \( \left[ D_{\chi }-\alpha ,D_{\chi }+\alpha \right] \)
\end_inset
.
Thus, under the assumption that no query in the
\emph on
correct
\emph default
tree is answered badly, a super-set of the correct tree can be produced
by exploring all choices resulting from responses within
\begin_inset Formula \( 2\alpha \)
\end_inset
of the response actually received.
By applying this method to every node in the query tree we can generate
an empirically observable super-set of the query tree: that is, the original
query tree is a pruning of the empirically constructed tree.
\layout Standard
A drawback of this method is that it can easily take exponential time to
produce the approximate tree, because even the smaller correct tree can
have a size exponential in the running time of the learning algorithm.
Instead, we would much rather simply keep track of the choices actually
made and the sizes of the nodes actually followed, which is what the microchoic
e approach allows us to do.
As a secondary point, given
\begin_inset Formula \( \delta \)
\end_inset
, computing a good value of
\begin_inset Formula \( \alpha \)
\end_inset
for Freund's approach is not trivial, see
\begin_inset LatexCommand \cite{SB}
\end_inset
; we will be able to finesse that issue and use the tighter bound
\begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta ) \)
\end_inset
.
\layout Standard
In order to apply the microchoice approach, we modify Freund's query tree
so that different nodes in the tree receive different confidence,
\begin_inset Formula \( \delta \)
\end_inset
, much in the same way that different hypotheses
\begin_inset Formula \( h \)
\end_inset
in our choice tree receive different values of
\begin_inset Formula \( \delta (h) \)
\end_inset
.
\layout Subsection
Microchoice Bounds for Query Trees
\layout Standard
The manipulations of the choice tree are now reasonably straightforward.
We begin by describing the
\emph on
true
\emph default
microchoice query tree and then give the algorithmic approximation.
As with the choice tree in section (
\begin_inset LatexCommand \ref{sec:mc}
\end_inset
), one should think of each node in the tree as representing the current
internal state of the algorithm.
\layout Standard
We incorporate Freund's approach into the choice tree construction by having
each internal node allocate a portion,
\begin_inset Formula \( p' \)
\end_inset
of its
\begin_inset Quotes eld
\end_inset
supply
\begin_inset Quotes erd
\end_inset
of failure probability to the event that
\begin_inset Formula \( \hat{D}_{\chi }\not \in I_{\chi }(\delta *p') \)
\end_inset
.
The node then splits the remainder of its supply evenly among the children
corresponding to choices that result from answers
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
with
\begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta *p') \)
\end_inset
.
Choices that would result from
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
answers to the query are
\emph on
pruned away
\emph default
from the tree and get nothing.
This continues down the tree to the leaves.
Pictorially, this looks like:
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 267 116
file thesis-presentation/amicro_tree.eps
width 4 90
flags 9
\end_inset
\layout Standard
How should
\begin_inset Formula \( p' \)
\end_inset
be chosen? Smaller values of
\begin_inset Formula \( p' \)
\end_inset
result in larger intervals
\begin_inset Formula \( I_{\chi }(\delta *p') \)
\end_inset
leading to more children in the pruned tree and less confidence given to
each.
Larger values of
\begin_inset Formula \( p' \)
\end_inset
result in less left over to divide among the children.
Unfortunately, our algorithmic approximation (which only sees the empirical
answers
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
and needs to be efficient) will not be able to make this optimization.
Therefore, we define
\begin_inset Formula \( p' \)
\end_inset
in the
\emph on
true
\emph default
microchoice query tree to be
\begin_inset Formula \( \frac{p}{d+1} \)
\end_inset
where
\begin_inset Formula \( d \)
\end_inset
is the depth of the current node.
This choice will imply that the adaptive microchoice bound is never much
worse than the Microchoice bound, and sometimes much better.
\layout Standard
\added_space_bottom 0.3cm
Since a particular query value
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
implies a particular choice
\begin_inset Formula \( c \)
\end_inset
, we can think of the interval
\begin_inset Formula \( I_{\chi }(\delta ) \)
\end_inset
as containing
\emph on
choices
\emph default
rather than query results.
After all, we only care about the choices the algorithm makes.
We can calculate the probability assigned to a hypothesis in the
\emph on
true
\emph default
adaptive microchoice query tree according to the following algorithm:
\layout Algorithm
True_Adaptive_Microchoice(
\begin_inset Formula \( \delta \)
\end_inset
)
\layout Enumerate
set
\begin_inset Formula \( p\leftarrow 1 \)
\end_inset
\layout Enumerate
set
\begin_inset Formula \( d=1 \)
\end_inset
\layout Enumerate
while learning algorithm has not halted.
\begin_deeper
\layout Enumerate
\begin_inset Formula \( d\leftarrow d+1 \)
\end_inset
\layout Enumerate
\begin_inset Formula \( p'\leftarrow \frac{p}{d} \)
\end_inset
\layout Enumerate
Let
\begin_inset Formula \( C \)
\end_inset
= the current set of possible data-dependent choices.
\layout Enumerate
\begin_inset Formula \( |\hat{C}|\leftarrow |\{c\in C|c\in I_{\chi }(\delta *p')\}| \)
\end_inset
\layout Enumerate
\begin_inset Formula \( p\leftarrow \frac{p-p'}{|\hat{C}|} \)
\end_inset
\end_deeper
\layout Enumerate
return
\begin_inset Formula \( p \)
\end_inset
\layout Standard
There are two important things to note about this algorithm.
First of all, we could plug the value
\begin_inset Formula \( p(h) \)
\end_inset
it returns into the Occam's Razor Bound
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
and receive a bound on the true error rate of our chosen classifier.
\layout Standard
Second, this algorithm can not be executed.
The essential problem is determining whether or not
\begin_inset Formula \( c\in I_{\chi }(\delta *p') \)
\end_inset
which cannot be done without knowledge of the underlying distribution
\begin_inset Formula \( D \)
\end_inset
.
However, we
\emph on
can
\emph default
calculate an approximate version of this algorithm which, with high probability,
returns a value
\begin_inset Formula \( p \)
\end_inset
which is smaller.
Since a smaller value is pessimistic, we can use it in our bounds.
\layout Standard
The algorithmic approximation uses the idea in
\begin_inset LatexCommand \cite{SB}
\end_inset
of including all choices within the
\emph on
double
\emph default
confidence interval of the observed value
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
.
Unlike
\begin_inset LatexCommand \cite{SB}
\end_inset
, however, we do not actually create the tree; instead we just follow the
path taken by the learning algorithm, and argue that the
\begin_inset Quotes eld
\end_inset
supply
\begin_inset Quotes erd
\end_inset
probability remaining at the leaf is no greater than the amount that would
have been there in the original tree.
Finally, the algorithm outputs a bound calculated with
\begin_inset Formula \( p(h) \)
\end_inset
.
\layout Standard
Specifically, the algorithm is as follows.
Suppose we are at a node of the tree containing statistical query
\begin_inset Formula \( \chi \)
\end_inset
at depth
\begin_inset Formula \( d(\chi ) \)
\end_inset
and we have a
\begin_inset Formula \( p \)
\end_inset
supply of parameter.
(If the current node is the root, then
\begin_inset Formula \( p=1 \)
\end_inset
and
\begin_inset Formula \( d(\chi )=1 \)
\end_inset
).
We choose
\begin_inset Formula \( p'=p/(d+1) \)
\end_inset
, ask the query
\begin_inset Formula \( \chi \)
\end_inset
, and receive
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
.
Let
\begin_inset Formula
\[
\bar{\bar{D}}_{\chi }(\delta )=\min _{k}\left\{ \frac{k}{m}:\, \, 1-\textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \]
\end_inset
and
\begin_inset Formula
\[
\underline{\underline{D}}_{\chi }(\delta )=\max _{k}\left\{ \frac{k}{m}:\, \, \textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \]
\end_inset
with
\begin_inset Formula
\[
\hat{I}_{\chi }(\delta )=\left[ \underline{\underline{D}}_{\chi }(\delta ),\bar{\bar{D}}_{\chi }(\delta )\right] \]
\end_inset
\layout Standard
We now let
\begin_inset Formula \( k \)
\end_inset
be the number of children of our node corresponding to answers in the range
\begin_inset Formula \( \hat{I}_{\chi }(p') \)
\end_inset
.
We then go to the child corresponding to the answer
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
that we received, giving this child a confidence parameter supply of
\begin_inset Formula \( (p-p')/k \)
\end_inset
.
This is the same as we would have given it had we allocated
\begin_inset Formula \( p-p' \)
\end_inset
to the children equally.
We then continue from that child.
Finally, when we reach a leaf, we output the probability left for the hypothesi
s.
Pictorially, this looks like:
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 267 137
file thesis-presentation/amcicro_alg.eps
width 4 90
flags 9
\end_inset
\layout Standard
Notice that the second choice set is larger than in the true adaptive microchoic
e set tree.
This can easily happen and it makes our results somewhat more pessimistic.
The approximate adaptive microchoice algorithm is specified as follows:
\layout Algorithm
Approximate_Adaptive_Microchoice(
\begin_inset Formula \( \delta \)
\end_inset
)
\layout Enumerate
set
\begin_inset Formula \( p\leftarrow 1 \)
\end_inset
\layout Enumerate
set
\begin_inset Formula \( d=1 \)
\end_inset
\layout Enumerate
while learning algorithm has not halted.
\begin_deeper
\layout Enumerate
\begin_inset Formula \( d\leftarrow d+1 \)
\end_inset
\layout Enumerate
\begin_inset Formula \( p'\leftarrow \frac{p}{d} \)
\end_inset
\layout Enumerate
Let
\begin_inset Formula \( C \)
\end_inset
= the current set of possible data-dependent choices.
\layout Enumerate
\begin_inset Formula \( |\hat{C}|\leftarrow |\{c\in C|c\in \hat{I}_{\chi }(\delta *p')\}| \)
\end_inset
\layout Enumerate
\begin_inset Formula \( p\leftarrow \frac{p-p'}{|\hat{C}|} \)
\end_inset
\end_deeper
\layout Enumerate
return
\begin_inset Formula \( p \)
\end_inset
\layout Standard
Let
\begin_inset Formula \( d(h) \)
\end_inset
be the depth of some hypothesis
\begin_inset Formula \( h \)
\end_inset
in the empirical path and
\begin_inset Formula \( \hat{C}_{1}(h) \)
\end_inset
,
\begin_inset Formula \( \hat{C}_{2}(h) \)
\end_inset
,
\begin_inset Formula \( \ldots \)
\end_inset
,
\begin_inset Formula \( \hat{C}_{d}(h) \)
\end_inset
be the sequence of choice sets resulting in
\begin_inset Formula \( h \)
\end_inset
in the algorithmic construction; i.e.,
\begin_inset Formula \( \hat{C}_{i}(h) \)
\end_inset
is the number of unpruned children of the
\begin_inset Formula \( i \)
\end_inset
-th node.
Then, the confidence placed on
\begin_inset Formula \( h \)
\end_inset
will be:
\layout Standard
\begin_inset Formula
\begin{equation}
\label{eqn:adaptiveMC}
p(h)=\prod _{i=1}^{d(h)}\left( \frac{i}{i+1}\frac{1}{|\hat{C}_{i}(h)|}\right) =\frac{1}{d(h)+1}\prod _{i=1}^{d(h)}\frac{1}{|\hat{C}_{i}(h)|}
\end{equation}
\end_inset
\layout Theorem
\begin_inset LatexCommand \label{th-amb}
\end_inset
(Adaptive Microchoice Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),p(h)*\delta )\right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end_inset
, and
\begin_inset Formula \( p(h) \)
\end_inset
is as defined in equation
\begin_inset LatexCommand \ref{eqn:adaptiveMC}
\end_inset
.
\layout Proof
\noindent
By design, with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
all queries in the true microchoice query tree receive good answers,
\emph on
and
\emph default
all hypotheses in that tree have their true errors within their estimates.
\layout Proof
\noindent
We will prove that in the high probability case, the output of the Approximate_A
daptive_Microchoice algorithm is less than the output of the True_Adaptive_Micro
choice algorithm.
Since a smaller
\begin_inset Formula \( p(h) \)
\end_inset
makes the bound more pessimistic, we will prove the bound.
\layout Proof
\noindent
Assume inductively that at the current node of our empirical path the supply
\begin_inset Formula \( p_{\mathrm{emp}} \)
\end_inset
is no greater than the supply
\begin_inset Formula \( p_{\mathrm{true}} \)
\end_inset
given to that node in the true tree.
This is clearly satisfied in the base case when
\begin_inset Formula \( p_{emp}=p_{true}=1 \)
\end_inset
.
\layout Proof
\noindent
Under the assumption that the response
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
falls in the interval
\begin_inset Formula \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \)
\end_inset
, it must be the case that the interval
\begin_inset Formula \( \hat{I}_{\chi }(p_{\mathrm{emp}}/(d+1)) \)
\end_inset
contains the interval
\begin_inset Formula \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \)
\end_inset
.
Therefore, the supply given to any child in the empirical path is no greater
than the supply given in the true tree.
\layout Standard
The corresponding relative entropy corollary is:
\layout Corollary
\begin_inset LatexCommand \label{th-reamb}
\end_inset
(Relative Entropy Microchoice Bound) For all hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
,
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |\hat{C}_{i}(h)|+\ln (d(h)+1)+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end_inset
\layout Standard
The bound in theorem (
\begin_inset LatexCommand \ref{th-amb}
\end_inset
) is very similar to (
\begin_inset LatexCommand \ref{th-smb}
\end_inset
) except that the choice complexity is slightly worsened with the
\begin_inset Formula \( \ln (d(h)+1) \)
\end_inset
term but improved by replacing
\begin_inset Formula \( C_{i}(h) \)
\end_inset
with the smaller
\begin_inset Formula \( \hat{C}_{i}(h) \)
\end_inset
.
\layout Subsection
Allowing batch queries
\layout Standard
Most natural Statistical Query algorithms make each choice based on responses
to a
\emph on
set
\emph default
of queries, not just one.
For instance, to decide what variable to put at the top of a decision tree,
we ask
\begin_inset Formula \( F \)
\end_inset
queries, one for each feature; we then choose the feature whose answer
was most
\begin_inset Quotes eld
\end_inset
interesting
\begin_inset Quotes erd
\end_inset
.
This suggests generalizing the query tree model to allow each tree node
to contain a set of queries, executed in batch.
Requiring each node in the query tree to contain just a single query as
in the above construction would result in an unfortunately high branching
factor just for the purpose of
\begin_inset Quotes eld
\end_inset
remembering
\begin_inset Quotes erd
\end_inset
the answers received so far.
\begin_float footnote
\layout Standard
Consider a decision tree algorithm attempting to find the right feature
for a node.
If the first query returns a value of
\begin_inset Formula \( \hat{D}_{\chi } \)
\end_inset
with a confidence of
\begin_inset Formula \( \delta \)
\end_inset
then the branching factor would be approximately
\begin_inset Formula \( m\cdot \hat{I}_{\chi }(\delta ) \)
\end_inset
.
This branching factor would be approximately the same for further queries
required by the algorithm to make a decision about what feature to use.
This results in a total multiplied choice space size of approximately
\begin_inset Formula \( \left[ 4m\cdot \hat{I}_{\chi }(\delta )\right] ^{F} \)
\end_inset
which can be reduced to
\begin_inset Formula \( F \)
\end_inset
or less using a batch query.
\end_float
\layout Standard
Extending the algorithmic construction to allow for batch queries is easily
done.
If a node has
\begin_inset Formula \( q \)
\end_inset
queries
\begin_inset Formula \( \chi _{1},\ldots ,\chi _{q} \)
\end_inset
, we choose the query confidence
\begin_inset Formula \( \delta *p' \)
\end_inset
as before, but we now split the mass evenly among all
\begin_inset Formula \( q \)
\end_inset
queries.
We then let
\begin_inset Formula \( k \)
\end_inset
be the number of children corresponding to answers to the queries
\begin_inset Formula \( \chi _{1},\ldots ,\chi _{q} \)
\end_inset
in the ranges
\begin_inset Formula \( \hat{I}_{\chi _{1}}(\delta p'/q),\ldots ,\hat{I}_{\chi _{q}}(\delta p'/q) \)
\end_inset
respectively.
We then go to the child corresponding to the answers we actually received,
and as before give the child a probability supply of
\begin_inset Formula \( (p-p')/k \)
\end_inset
.
Theorem (
\begin_inset LatexCommand \ref{th-amb}
\end_inset
) holds exactly as before; the only change is that
\begin_inset Formula \( |\hat{C}_{i}(h)| \)
\end_inset
means the size of the
\begin_inset Formula \( i \)
\end_inset
-th choice set in the batch tree rather than the size in the single-query-per-no
de tree.
\layout Subsection
Example: Batch Queries for Decision trees
\layout Standard
When growing a decision tree, it is natural to make a batch of queries and
then make a decision about which feature to place in a node.
The process is then repeated to grow the full tree structure.
As in the decision tree example described in the simple microchoice section,
if we have
\begin_inset Formula \( F \)
\end_inset
features and are considering adding a node at depth
\begin_inset Formula \( d(v) \)
\end_inset
, there are
\begin_inset Formula \( F-d(v)+1 \)
\end_inset
possible features that could be chosen for placement in a particular node.
The decision of which feature to use is made by comparing the results of
\begin_inset Formula \( F-d(v)+1 \)
\end_inset
queries to pick the best feature according to some criteria, such as informatio
n gain.
We can choose
\begin_inset Formula \( p'=p/(d+1) \)
\end_inset
, then further divide
\begin_inset Formula \( p' \)
\end_inset
into confidences of size
\begin_inset Formula \( p'/(F-d(v)+1) \)
\end_inset
, placing each divided confidence on one of the
\begin_inset Formula \( F-d(v)+1 \)
\end_inset
statistical queries.
We now may be able to eliminate some of the
\begin_inset Formula \( F-d(v)+1 \)
\end_inset
choices from consideration, allowing the remaining confidence,
\begin_inset Formula \( p-p' \)
\end_inset
to be apportioned evenly amongst the remaining choices.
Depending on the underlying distribution this could substantially reduce
the size of the choice set.
The best case occurs when one feature partitions all examples reaching
the node perfectly and all other features are independent of the target.
In this case the choice set will have size
\begin_inset Formula \( 1 \)
\end_inset
if there are enough examples.
\layout Subsection
Adaptive Microchoice vs.
\protected_separator
Basic Microchoice
\layout Standard
The adaptive microchoice bound is a significant improvement over the simple
microchoice bound when the distribution is such that each choice is clear.
For example, consider
\begin_inset Formula \( F \)
\end_inset
Boolean features and
\begin_inset Formula \( N=O(F) \)
\end_inset
examples.
Suppose that one feature is identical to the label and all the rest of
the features are determined with a coin flip independent of the label.
\layout Standard
When we apply a decision tree to a data set generated with this distribution,
what will be the resulting bound? Given enough examples, with high probability
there will only be one significant choice for the first batch query: the
feature identical to the label.
The second and third batch queries, corresponding to the children of the
root feature, will also have a choice space of size
\begin_inset Formula \( 1 \)
\end_inset
with very high probability.
The
\begin_inset Quotes eld
\end_inset
right
\begin_inset Quotes erd
\end_inset
choice will be the label value.
Each choice set has size
\begin_inset Formula \( 1 \)
\end_inset
resulting in a complexity of
\begin_inset Formula \( \ln 4 \)
\end_inset
due to allocation of confidence to the statistical queries necessary for
learning the decision tree.
\begin_inset Formula \( \ln 4 \)
\end_inset
is considerably better than
\begin_inset Formula \( \ln (F+2)+2\ln (F+1) \)
\end_inset
which the simple version of the microchoice bound provides.
Note that the complexity reduction only occurs with a large enough number
of examples
\begin_inset Formula \( m \)
\end_inset
implying that the value of
\begin_inset Formula \( \bar{e} \)
\end_inset
calculated can improve faster than (inverse) linearly in the number of
examples.
\layout Standard
The adaptive microchoice bound is never much looser than the simple microchoice
bound because under the assumption that choice sets are of size at least
\begin_inset Formula \( 2 \)
\end_inset
, the penalty for using the adaptive version,
\begin_inset Formula \( \ln d \)
\end_inset
, is always small compared to the complexity term for the simple microchoice
bound,
\begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \)
\end_inset
.
\layout Subsection
Other Adaptive Microchoice
\layout Standard
The adaptive microchoice bound provides a simple scheme for dividing confidence
between choices and queries.
There are other choices which may be useful in some settings.
Any scheme which
\emph on
a priori
\emph default
divides the confidence between queries and choices at every node will generally
work.
Here are two schemes which may be useful:
\layout Itemize
Assign a constant proportion of confidence to the query.
This scheme is more aggressive than the one used in the adaptive microchoice
bounds and may result in a lower complexity when many choices are eliminatable.
The drawback is we no longer get the telescoping in equation (
\begin_inset LatexCommand \ref{eqn:adaptiveMC}
\end_inset
) and so the term logarithmic in
\begin_inset Formula \( d(h) \)
\end_inset
in theorem (
\begin_inset LatexCommand \ref{th-reamb}
\end_inset
) becomes linear in
\begin_inset Formula \( d(h) \)
\end_inset
.
\layout Itemize
For a decision tree, assign a portion dependent on the depth of the node
in the decision tree that the choice set is over.
It is unlikely that choices are eliminatable from nodes not near the root
because the number of examples available at a node typically decays exponential
ly with the (decision tree) depth.
A progressive scheme which allocates less confidence to queries for deep
nodes will probably behave better in practice.
\layout Subsection
Comparison with Freund's Self-Bounding algorithms
\layout Standard
Freund's approach for self-bounding learning algorithms can require exponentiall
y more computation then the microchoice approach.
In its basic form, it requires explicit construction of every path in the
state space of the algorithm not pruned in the tree.
There exist some learning algorithms where this process can be done implicitly
making the computation feasible.
However, in general this does not appear to be possible.
\layout Standard
The adaptive microchoice bound only requires explicit construction of the
size of each subset from which a choice is made.
Because many common learning algorithms work by a process of making choices
from small subsets, this is often computationally easy.
The adaptive microchoice bound does poorly, however, when Freund's query
tree has a high degree of sharing; for example, when many nodes of the
tree correspond to the same query, or many leaves of the tree have the
same final hypothesis.
Allowing batch queries alleviates the most egregious examples of this.
It is also possible to interpolate between the adaptive microchoice bound
and Freund's bound by a process of conglomerating the subsets of the microchoic
e bound.
\layout Subsection
Choice Set Conglomeration
\layout Standard
The mechanism of choice set conglomeration is a similar to the batch query
technique.
It allows you to trade increased computation for a tighter bound.
When starting with the simple microchoice bound, this technique can smoothly
interpolate with the discrete hypothesis bound (
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
).
When starting with the adaptive microchoice bound, we can interpolate with
Freund's bound.
\layout Standard
Consider a particular choice set,
\begin_inset Formula \( \hat{C}_{i} \)
\end_inset
, with elements
\begin_inset Formula \( c_{i} \)
\end_inset
.
Each
\begin_inset Formula \( c_{i} \)
\end_inset
indexes another choice set,
\begin_inset Formula \( \hat{C}_{i+1}(c_{i}) \)
\end_inset
.
If the computational resources exist to calculate the union
\begin_inset Formula \( \hat{C}_{i,i+1}=\bigcup _{c_{i}\in \hat{C}_{i}}\hat{C}_{i+1}(c_{i}) \)
\end_inset
, then
\begin_inset Formula \( |\hat{C}_{i,i+1}| \)
\end_inset
can be used in place of
\begin_inset Formula \( |\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \)
\end_inset
in the adaptive microchoice bound.
The conglomeration can be done repeatedly to build large choice sets and
also applies to the simple microchoice bound (
\begin_inset LatexCommand \ref{th-smb}
\end_inset
).
Conglomeration can be useful for tightening the bound when there are multiple
choice sequences leading to the same hypothesis.
However, choice set conglomeration is not always helpful because it trades
away the fine granularity of the microchoice bound.
The extreme case where all choice sets are conglomerated into one choice
set and every hypothesis and query have the same weight is equivalent to
Freund's bound.
\layout Standard
When the choices of the attached choice sets are all different, conglomeration
will have little use because the size of the union of the choice sets is
the sum of the sizes of each choice set
\begin_inset Formula \( |\hat{C}_{i,i+1}|=\sum _{c_{i}\in \hat{C}_{i}}|\hat{C}_{i+1}(c_{i})| \)
\end_inset
.
If the child sets each have the same size
\begin_inset Formula \( |\hat{C}_{i+1}| \)
\end_inset
then this simplifies to
\begin_inset Formula \( |\hat{C}_{i,i+1}|=|\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \)
\end_inset
which results in the same confidence applied to each choice whether conglomerat
ing or not.
The best case for conglomeration is equivalent to the batch query case:
every sub-choice set contains the same elements.
Then we have
\begin_inset Formula \( |\hat{C}_{i,i+1}|=|\hat{C}_{i+1}| \)
\end_inset
and can pay no cost for the choice set
\begin_inset Formula \( |\hat{C}_{i}| \)
\end_inset
.
\layout Section
Microchoice discussion
\layout Standard
Microchoice bounds can be used in practice and do yield results comparable
with holdout set based techniques (see figure \SpecialChar \@.
\begin_inset LatexCommand \ref{fig-microchoice-holdout}
\end_inset
and the surrounding section for details).
There are two significant insights which the microchoice bounds provide.
\layout Enumerate
The
\begin_inset Quotes eld
\end_inset
cost
\begin_inset Quotes erd
\end_inset
of choices is made very explicit.
The cost of a choice (in terms of sample complexity) is the log of the
number of choices.
\layout Enumerate
It is possible to improve upon the Occam's Razor Bound (theorem
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
) by using information from the sample set to infer properties (such as
the choice tree) of the distribution
\begin_inset Formula \( D \)
\end_inset
.
This can be done
\emph on
without
\emph default
any explicit knowledge of the distribution
\begin_inset Formula \( D \)
\end_inset
.
\layout Problem
(Open) Is there a satisfying, natural bound for the continuous case? Preliminary
work and thoughts by several people has occurred, but nothing has yet come
of it.
\layout Chapter
PAC-Bayes bounds
\layout Standard
\begin_inset LatexCommand \label{sec-PB}
\end_inset
\layout Standard
The work presented here is also published in
\begin_inset LatexCommand \cite{averaging_tech}
\end_inset
.
\layout Standard
PAC-Bayes bounds are a generalization of the Occam's razor bound for algorithms
which output a
\emph on
distribution
\emph default
over classifiers rather than just a single classifier.
This includes the possibility of a distribution over a single classifier,
so it is a generalization.
Most classifiers do not output a distribution over base classifiers.
Instead, they output either a classifier, or an average over base classifiers.
Nonetheless, PAC-Bayes bounds are interesting for several reasons:
\layout Enumerate
PAC-Bayes bounds are much tighter (in practice) than most common VC-related
\begin_inset LatexCommand \cite{Vapnik}
\end_inset
approaches on continuous classifier spaces.
This can be shown by application to stochastic neural networks (see section
\begin_inset LatexCommand \ref{SNN}
\end_inset
) as well as other classifiers.
It also can be seen by observation: when specializing the PAC-Bayes bounds
on discrete hypothesis spaces, only
\begin_inset Formula \( O(\ln m) \)
\end_inset
sample complexity is lost.
\layout Enumerate
Due to the achievable tightness, the result motivates new learning algorithms
which strongly limit the amount of overfitting that a learning algorithm
will incur.
\layout Enumerate
The result found here will turn out to be useful for averaging hypotheses.
\layout Standard
PAC-Bayes bounds were first introduced by McAllester
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Standard
There are three relatively independent observations in this chapter:
\layout Enumerate
A quantitative improvement of the PAC-Bayes by retrofit with relative entropy
Chernoff bound
\begin_inset LatexCommand \ref{eq-recb}
\end_inset
.
This retrofit is not as trivial as might be expected, but it can be done.
The result is the tightest known PAC-Bayes bound.
In addition to the quantitative improvements, this tightening simplifies
the proof and adds to our qualitative understanding of the bound.
\layout Enumerate
A method for (partially) derandomizing the PAC-Bayes stochastic hypothesis
\layout Enumerate
A method for stochastic evaluation of the empirical error.
\layout Standard
The first observation is the most important.
Observation (3) is important for many practical applications because it
is safely avoids a (sometimes) very complicated evaluation problem.
Observation (2) is of little theoretical interest, but it might interest
some people who feel reassured when every classifier randomized over has
a low empirical error rate.
\layout Standard
Figure
\begin_inset LatexCommand \ref{fig-pb-protocol}
\end_inset
shows what the PAC-Bayes bound looks like as an interactive proof of learning.
\layout Standard
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 152
file thesis-presentation/pac-bayes.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-pb-protocol}
\end_inset
The PAC-Bayes bound can be viewed as a new style for a proof of learning.
The learner must commit to a
\begin_inset Quotes eld
\end_inset
Prior
\begin_inset Quotes erd
\end_inset
as in the Occam's Razor Bound
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
before seeing examples, but it does not commit to a single hypothesis.
Instead, it commits to a distribution over hypotheses,
\begin_inset Formula \( q(h) \)
\end_inset
and the bound applies to a randomization with respect to the distribution
\begin_inset Formula \( q(h) \)
\end_inset
.
\end_float
\layout Section
PAC-Bayes Basics
\layout Standard
In the PAC-Bayes setting, a classifier is defined by a distribution
\begin_inset Formula \( q(h) \)
\end_inset
over the hypothesis space.
Each classification is carried out according to a hypothesis
\emph on
sampled
\emph default
from
\begin_inset Formula \( q(h) \)
\end_inset
.
We are interested in the gap between the
\emph on
expected
\emph default
generalization error
\begin_inset Formula \( e_{q}(h)\equiv E_{q}\left[ e(h)\right] \)
\end_inset
and the
\emph on
expected
\emph default
empirical error
\begin_inset Formula \( \hat{e}_{q}(h)=E_{q}\left[ \hat{e}(h)\right] \)
\end_inset
, where both expectations are taken with respect to
\begin_inset Formula \( q(h) \)
\end_inset
.
The gap will be parameterized by the Kullback-Leibler divergence (see
\begin_inset LatexCommand \cite{Cover}
\end_inset
).
Recall that:
\layout Standard
\begin_inset Formula
\begin{equation}
\label{eq-relent}
\textrm{KL}(q||p)=E_{h\sim q(h)}\ln \frac{q(h)}{p(h)}
\end{equation}
\end_inset
If the support is finite, we have
\begin_inset Formula
\begin{equation}
\label{eq-frelent}
\textrm{KL}(q||p)=\sum _{h}q(h)\ln \frac{q(h)}{p(h)}
\end{equation}
\end_inset
The relative entropy is an asymmetric distance measure between probability
distributions, with
\begin_inset Formula \( \textrm{KL}(q||p)=0\Leftrightarrow q=p \)
\end_inset
almost everywhere.
\layout Theorem
\begin_inset LatexCommand \label{th-pbb}
\end_inset
(PAC-Bayes
\begin_inset LatexCommand \cite{PB}
\end_inset
) For all
\begin_inset Quotes eld
\end_inset
priors
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
and for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\layout Theorem
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists q(h):\, \, e_{q}(h)\geq \hat{e}_{q}(h)+\sqrt{\frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }+2}{2m-1}}\right) \leq \delta \]
\end_inset
\layout Proof
Given in
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Standard
This PAC-Bayes bound is almost the same as the Occam's Razor bound (theorem
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
) when the distribution is peaked on a single hypothesis and the Occam's
razor bound is proved using the looser Hoeffding inequality.
This can be seen by noting that the KL-divergence when
\begin_inset Formula \( q \)
\end_inset
is all on one hypothesis,
\begin_inset Formula \( h \)
\end_inset
satisfies
\begin_inset Formula \( \textrm{KL}(q||p)=\log \frac{1}{p(h)} \)
\end_inset
.
Comparing with the Occam's Razor bound, we see that a (small) extra term
of size
\begin_inset Formula \( \frac{\ln m}{m} \)
\end_inset
has been introduced in return for the capability to average with respect
to
\emph on
any
\emph default
posterior
\begin_inset Formula \( q(h) \)
\end_inset
.
It is unclear yet that this
\begin_inset Formula \( \frac{\ln m}{m} \)
\end_inset
term needs to be there.
\layout Problem
(Open) Remove the
\begin_inset Formula \( \frac{\ln m}{m} \)
\end_inset
term from the sample complexity.
\layout Standard
The real power of the PAC-Bayes bound occurs when the average is over many
hypotheses.
This might occur if the distribution
\begin_inset Formula \( q(h) \)
\end_inset
is picked using Bayes law or a Gibbs distribution.
One of the most interesting aspects of the PAC-Bayes bound is that it holds
for finite
\emph on
and
\emph default
infinite hypothesis spaces.
\layout Section
A Tighter PAC-Bayes Bound
\layout Standard
We can tighten this bound by employing a more accurate tail bound on the
Binomial distribution.
The proof of this improved lemma is not as straightforward as a simple
substitution of the Hoeffding bound with the relative entropy Chernoff
bound
\begin_inset LatexCommand \ref{eq-recb}
\end_inset
but it can be worked out nonetheless.
\layout Theorem
\begin_inset LatexCommand \label{th-repbb}
\end_inset
(Relative Entropy PAC-Bayes bound) For all binary loss functions,
\begin_inset Formula \( l(h,(x,y)) \)
\end_inset
, for all
\begin_inset Quotes eld
\end_inset
priors
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
and for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\layout Theorem
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists q(h):\, \, \textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\geq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\right) \leq \delta \]
\end_inset
\layout Standard
This bound is always at least as tight as the original PAC-Bayes bound
\begin_inset LatexCommand \cite{PB}
\end_inset
and sometimes much tighter, such as when the average empirical error is
near
\begin_inset Formula \( 0 \)
\end_inset
.
In particular, when the average empirical error is zero (
\begin_inset Formula \( \hat{e}_{q}(h)=0 \)
\end_inset
) the bound can be significantly tighter as shown in figure
\begin_inset LatexCommand \ref{fig-bounds}
\end_inset
on page
\begin_inset LatexCommand \pageref{fig-bounds}
\end_inset
.
\layout Standard
One interesting new feature of this PAC-Bayes bound is
\begin_inset Quotes eld
\end_inset
dimensional consistency
\begin_inset Quotes erd
\end_inset
\begin_float footnote
\layout Standard
My thanks to Patrick Haffner for pointing this out.
\end_float
.
In particular, each side of the equation is an expectation of log probabilities
---
\begin_inset Quotes eld
\end_inset
nats
\begin_inset Quotes erd
\end_inset
.
Note, this does not hold when the Hoeffding approximation is used.
Rewriting, we get that with high probability, approximately the following
holds:
\begin_inset Formula
\[
m\textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq \textrm{KL}(q||p)\]
\end_inset
There is a coding theory interpretation of KL divergence:
\begin_inset Formula \( \textrm{KL}(q||p)= \)
\end_inset
the expected number of
\emph on
extra
\emph default
bits required to encode symbols drawn from
\begin_inset Formula \( q \)
\end_inset
given a code designed for symbols drawn from
\begin_inset Formula \( p \)
\end_inset
rather than from
\begin_inset Formula \( q \)
\end_inset
.
\layout Standard
Using the coding theory interpretation of KL divergence, this says approximately
:
\begin_inset Quotes eld
\end_inset
With high probability the number of
\emph on
extra
\emph default
bits required to encode the empirical errors is less than the number of
\emph on
extra
\emph default
bits required to encode hypotheses drawn from the posterior.
\begin_inset Quotes erd
\end_inset
\layout Standard
The retrofit of the PAC-Bayes bound is accomplished by reproving a technical
lemma about distributions.
The proof relies upon two lemmas.
The first is Lemma 22 from
\begin_inset LatexCommand \cite{PB}
\end_inset
which is given by:
\layout Lemma
\begin_inset LatexCommand \label{lem-opt}
\end_inset
For all
\begin_inset Formula \( \beta >0,K>0 \)
\end_inset
and
\begin_inset Formula \( Q,P,y\in R^{n} \)
\end_inset
satisfying
\begin_inset Formula \( P_{i}>0,Q_{i}>0 \)
\end_inset
and
\begin_inset Formula \( \sum _{i}Q_{i}=1 \)
\end_inset
, if
\begin_inset Formula
\[
\sum _{i=1}^{n}P_{i}e^{\beta y_{i}}\leq K\]
\end_inset
then
\layout Lemma
\begin_inset Formula
\[
\sum _{i=1}^{n}Q_{i}y_{i}\leq \frac{\textrm{KL}(Q||P)+\ln K}{\beta }\]
\end_inset
\layout Proof
given in
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Standard
We will need to prove the following lemma in order to tighten the PAC-Bayes
bound.
It is analogous to Lemma 17 from
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Lemma
\begin_inset LatexCommand \label{lem-tech}
\end_inset
For all
\begin_inset Quotes eld
\end_inset
priors
\begin_inset Quotes erd
\end_inset
\begin_inset Formula \( p(h) \)
\end_inset
, for all
\begin_inset Formula \( \delta ,\alpha \in (0,1) \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}>\frac{1}{\alpha \delta }\right) \leq \delta \]
\end_inset
\layout Proof
For any given hypothesis
\begin_inset Formula \( h \)
\end_inset
we will prove the following.
\begin_inset Formula
\begin{equation}
\label{eq-expectation}
\forall h\, \, E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }
\end{equation}
\end_inset
The Lemma then follows from the sequence:
\begin_inset Formula
\[
\Rightarrow \forall p\, \, E_{p}E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall p\, \, E_{D^{m}}E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall p\, \, \Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\geq \frac{1}{\alpha \delta }\right) \leq \delta \]
\end_inset
\layout Proof
Consequently, we must only prove equation
\begin_inset LatexCommand \ref{eq-expectation}
\end_inset
.
Given the hypothesis, we have a fixed true error rate,
\begin_inset Formula \( e(h) \)
\end_inset
, and the empirical error rate
\begin_inset Formula \( \hat{e}(h) \)
\end_inset
will be distributed like a Binomial.
Let
\begin_inset Formula \( R(e(h)) \)
\end_inset
be the random variable with a cumulative distribution given by the relative
entropy Chernoff bound for a hypothesis with true error
\begin_inset Formula \( e(h) \)
\end_inset
.
In other words, define a cumulative distribution function on
\begin_inset Formula \( [0,1] \)
\end_inset
according to:
\begin_inset Formula
\[
e^{-m\textrm{KL}\left( R||p\right) }\]
\end_inset
(note that we defined
\begin_inset Formula \( \textrm{KL}() \)
\end_inset
so that it is always
\begin_inset Formula \( 0 \)
\end_inset
when
\begin_inset Formula \( \frac{k}{m}>p \)
\end_inset
).
Note that the relative entropy Chernoff bound implies
\begin_inset Formula \( R \)
\end_inset
satisfies:
\begin_inset Formula
\[
\textrm{Bin}(m,mR,p)\leq e^{-m\textrm{KL}\left( R||p\right) }\]
\end_inset
whenever
\begin_inset Formula \( mR \)
\end_inset
is integer.
\layout Proof
Since
\begin_inset Formula \( e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))} \)
\end_inset
increases monotonically with decreasing
\begin_inset Formula \( \hat{e}(h) \)
\end_inset
, the probability distribution function of
\begin_inset Formula \( R(e(h)) \)
\end_inset
will have a larger expected value.
In other words:
\begin_inset Formula
\[
E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq E_{R}e^{(1-\alpha )m\textrm{KL}(R||e(h))}\]
\end_inset
The probability distribution function of
\begin_inset Formula \( R \)
\end_inset
is given by:
\begin_inset Formula
\[
f(x)=\left\{ \begin{array}{ll}
0 & \textrm{for}\; x\geq p\\
& \\
-m\frac{\partial \textrm{KL}(x||p)}{\partial x}e^{-m\textrm{KL}(x||p)} & \textrm{for}\; x\leq p
\end{array}\right. \]
\end_inset
Taking the expectation with respect to this distribution gives us:
\begin_inset Formula
\begin{eqnarray*}
E_{R}\left[ e^{(1-\alpha )m\textrm{KL}(R||e(h))}\right] & = & \int _{0}^{1}e^{(1-\alpha )m\textrm{KL}(x||e(h))}f(x)dx\\
& & \\
& = & \int _{0}^{e(h)}-m\frac{\partial \textrm{KL}(x||e(h))}{\partial x}e^{-\alpha m\textrm{KL}(x||e(h))}dx\\
& = & \left. \frac{1}{\alpha }e^{-\alpha m\textrm{KL}(x||e(h))}\right| _{0}^{e(h)}\\
& \leq & \frac{1}{\alpha }
\end{eqnarray*}
\end_inset
This finishes the proof of the technical lemma.
\layout Standard
Now, we can prove the relative entropy PAC-Bayes bound
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
.
\layout Proof
First, we can specialize lemma
\protected_separator
\begin_inset LatexCommand \ref{lem-tech}
\end_inset
with
\begin_inset Formula \( \alpha =\frac{1}{m} \)
\end_inset
to get that with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
\begin_inset Formula
\[
E_{p}e^{(m-1)\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{m}{\delta }\]
\end_inset
Apply lemma
\begin_inset LatexCommand \ref{lem-opt}
\end_inset
with
\begin_inset Formula \( K=\frac{m}{\delta } \)
\end_inset
,
\begin_inset Formula \( \beta =m-1 \)
\end_inset
,
\begin_inset Formula \( Q_{i}=q(h_{i}) \)
\end_inset
and
\begin_inset Formula \( y_{i}=\textrm{KL}(\hat{e}(h)||e(h)) \)
\end_inset
to get:
\begin_inset Formula
\[
E_{q}\textrm{KL}(\hat{e}(h)||e(h))=\sum _{i=1}^{n}Q_{i}\textrm{KL}(\hat{e}(h_{i})||e(h_{i}))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\]
\end_inset
Jensen's inequality, gives us:
\begin_inset Formula
\[
\textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq E_{q}\textrm{KL}(\hat{e}(h)||e(h))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\]
\end_inset
which proves the theorem for the finite case.
For the infinite case, a sequence of limits can be defined just as in
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Section
PAC-Bayes Approximations
\layout Subsection
Approximating the empirical error
\layout Standard
\begin_inset LatexCommand \label{pb-approx}
\end_inset
\layout Standard
In practice, it is not always easy to calculate some of the observable variables
in the PAC-Bayes bound.
In particular,
\begin_inset Formula \( \hat{e}_{q}(h) \)
\end_inset
is not necessarily easy to calculate when
\begin_inset Formula \( q \)
\end_inset
is some continuous distribution.
We can avoid the need for a direct evaluation by a Monte Carlo evaluation
and a bound on the tail of the Monte Carlo evaluation.
Let
\begin_inset Formula \( \hat{e}_{\hat{q}}(h)\equiv \Pr _{\hat{q},S}(h(x)\neq y) \)
\end_inset
be the observed rate of failure of
\begin_inset Formula \( n \)
\end_inset
random hypotheses drawn according to
\begin_inset Formula \( q(h) \)
\end_inset
and applied to a random training example.
Once again, we have a familiar Binomial distribution.
Direct calculation will give us:
\layout Theorem
(Monte Carlo Sampling Bound) For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
\begin_inset Formula
\[
\Pr _{q^{n}}\left( \, \, \hat{e}(h)>\bar{e}\left( n,\hat{e}_{\hat{q}}(h),\delta \right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \)
\end_inset
\layout Proof
Observer that the Monte Carlo estimate is distributed like a Binomial distributi
on and apply the Binomial Tail bound.
\layout Standard
In order to calculate a bound on the expected true error rate, we can first
bound the expected empirical error rate
\begin_inset Formula \( \hat{e}_{q}(h) \)
\end_inset
with confidence
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
then bound the expected true error rate
\begin_inset Formula \( e_{q}(h) \)
\end_inset
with confidence
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
, using our bound on
\begin_inset Formula \( \hat{e}_{q}(h) \)
\end_inset
.
Since the total probability of failure is only
\begin_inset Formula \( \frac{\delta }{2}+\frac{\delta }{2}=\delta \)
\end_inset
our bound will hold with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
.
\layout Subsection
Derandomizing the PAC-Bayes bound
\layout Standard
It is sometimes desirable to derandomize the PAC-Bayes bound.
There are several ways to do this.
The next chapter will talk about replacing the randomization over
\begin_inset Formula \( q(h) \)
\end_inset
with a thresholded average.
Another technique is to simply pick a hypothesis according to
\begin_inset Formula \( q(h) \)
\end_inset
.
While this would probably be effective in practice, the theoretical guarantees
that can be made for this technique are weak.
Strong theoretical guarantees
\emph on
can
\emph default
be made for a similar technique.
\layout Standard
Suppose we make
\begin_inset Formula \( n \)
\end_inset
draws form
\begin_inset Formula \( q(h) \)
\end_inset
.
Let the drawn hypotheses be
\begin_inset Formula \( \{h_{1},...,h_{n}\} \)
\end_inset
.
We can form a new distribution
\begin_inset Formula \( \hat{q}(h) \)
\end_inset
which is uniform over the
\begin_inset Formula \( n \)
\end_inset
draws.
The true error rate of this distribution can be bound with high probability
according to the following theorem.
\layout Theorem
(PAC-Bayes Derandomization) For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
\begin_inset Formula
\[
\Pr _{q^{n}}\left( \, \, e_{\hat{q}}(h)>\bar{e}\left( n,e_{q}(h),\delta \right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \)
\end_inset
\layout Proof
Observer that the distribution of
\begin_inset Formula \( e_{\hat{q}}(h) \)
\end_inset
is distributed like a Binomial around
\begin_inset Formula \( e_{q}(h) \)
\end_inset
and apply the Binomial Tail bound.
\layout Standard
Note the this theorem and the last theorem are essentially the same theorem.
\layout Standard
This theorem allows us to do an (incomplete) derandomization.
Instead of drawing from
\begin_inset Formula \( q(h) \)
\end_inset
in order to evaluate an input, we can draw from
\begin_inset Formula \( \hat{q}(h) \)
\end_inset
which requires a fixed finite number of bits.
This may allow for more efficient algorithms, and some people may find
it reassuring that every hypothesis in
\begin_inset Formula \( \{h_{1},...,h_{n}\} \)
\end_inset
has a low empirical error.
The same confidence splitting trick of the last section can be used in
order to guarantee
\begin_inset Formula \( e_{q}(h) \)
\end_inset
is bounded and
\begin_inset Formula \( e_{\hat{q}}(h) \)
\end_inset
is bounded given that
\begin_inset Formula \( e_{q}(h) \)
\end_inset
is bounded.
\layout Standard
It is worth mentioning that
\emph on
no
\emph default
assumption of independence applies to either this theorem or the last theorem
since we explicitly control (and create) the independence ourselves.
These theorems hold for totally verifiable preconditions.
\layout Section
Application of the PAC-Bayes bound
\layout Standard
The goal of this chapter is making PAC-Bayes bounds more applicable.
This is done by tightening the analysis from a Hoeffding-like to a Chernoff-lik
e statement, and by noting that we can use monte-carlo evaluation to safely
bound the stochastic empirical error rate quickly.
\layout Standard
In work detailed in chapter
\begin_inset LatexCommand \ref{SNN}
\end_inset
, results for the application of PAC-Bayes bounds to stochastic neural networks
is presented.
PAC-Bayes bounds are one of very few approaches capable of producing nonvacuous
learning theory bounds on continuous valued classifiers for real-world
problems.
\layout Chapter
Averaging Bounds (Improved margin)
\layout Standard
\begin_inset LatexCommand \label{sec-average}
\end_inset
\layout Standard
The work in this chapter is joint with Matthias Seeger and Nimrod Megiddo.
It was first presented at ICML
\begin_inset LatexCommand \cite{averaging_icml}
\end_inset
.
\layout Standard
Averaging bounds are specialized for averaging classifiers.
An average has the form
\begin_inset Formula
\[
f(x)=\int q(h)h(x)dx\textrm{ or }f(x)=\int _{i=1}^{N}q(h_{i})h_{i}(x)dx\]
\end_inset
where
\begin_inset Formula \( q(h_{i})\geq 0 \)
\end_inset
and
\begin_inset Formula \( \int q(h)dh=1 \)
\end_inset
.
Averaging classifiers have the form:
\begin_inset Formula
\[
c(x)=\textrm{sign}\left( f(x)\right) \]
\end_inset
Averaging bounds are especially interesting because there are many learning
algorithms which use averaging.
These techniques include:
\layout Enumerate
Boosting
\begin_inset LatexCommand \cite{Boosting}
\end_inset
\layout Enumerate
Bayes-Optimal learning (see section 6.7 of
\begin_inset LatexCommand \cite{Tom}
\end_inset
)
\layout Enumerate
Support Vector Machines
\begin_inset LatexCommand \cite{SVM}
\end_inset
\layout Enumerate
Bagging
\begin_inset LatexCommand \cite{Bagging}
\end_inset
\layout Enumerate
Maximum Entropy classification
\begin_inset LatexCommand \cite{ME}
\end_inset
\layout Standard
Viewed as an interactive proof of learning (see figure
\begin_inset LatexCommand \ref{fig-averaging-protocol}
\end_inset
) the bound presented here is almost the same as the PAC-Bayes bound except
that it applies to the
\emph on
average
\emph default
over the posterior rather than to stochastic choices over the posterior.
\layout Standard
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 152
file thesis-presentation/averaging.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-averaging-protocol}
\end_inset
For the averaging bound, the learning must commit to some measure
\begin_inset Formula \( p(h) \)
\end_inset
, receive examples, and then commit to another measure
\begin_inset Formula \( q(h) \)
\end_inset
.
The true error bound applies to the
\emph on
average
\emph default
over
\begin_inset Formula \( q(h) \)
\end_inset
rather than stochastic choices as in the PAC-Bayes bound
\begin_inset LatexCommand \ref{sec-PB}
\end_inset
.
\end_float
\layout Standard
The bound in this section is a qualitative improvement on prior results
for averaging bounds.
For the average learning algorithms listed above, the form of the improvements
is most interesting for Maximum Entropy Classification and Bayes-Optimal
classification.
All (currently known) specialized averaging bounds use as a parameter the
\begin_inset Quotes eld
\end_inset
margin
\begin_inset Quotes erd
\end_inset
.
For this section only, suppose the label and the hypotheses have value
\begin_inset Formula \( -1 \)
\end_inset
or
\begin_inset Formula \( 1 \)
\end_inset
.
(
\begin_inset Formula \( y\in \{-1,1\} \)
\end_inset
and
\begin_inset Formula \( h(x)\in \{-1,1\} \)
\end_inset
).
Then, the margin will be defined as
\begin_inset Formula
\[
t(x,y)=yf(x)\]
\end_inset
Some simple observations are immediate:
\layout Enumerate
The margin is bounded.
\begin_inset Formula \( t(x,y)\in [-1,1] \)
\end_inset
\layout Enumerate
If the classifier is correct, the margin is positive.
\begin_inset Formula \( c(x)=y\Rightarrow t(x,y)\in (0,1] \)
\end_inset
\layout Standard
The
\emph on
error
\emph default
at some margin is the quantity actually used in the averaging bounds.
The empirical error at margin
\begin_inset Formula \( \theta \)
\end_inset
is defined by:
\begin_inset Formula
\[
\hat{e}_{\theta }(c)=\Pr _{S}(t(x,y)\leq \theta )\]
\end_inset
and the true error at margin
\begin_inset Formula \( \theta \)
\end_inset
is defined by:
\begin_inset Formula
\[
e_{\theta }(c)=\Pr _{D^{m}}(t(x,y)\leq \theta )\]
\end_inset
Note that
\begin_inset Formula \( e_{0}(c)=e_{D}(c) \)
\end_inset
(the true error rate of
\begin_inset Formula \( c \)
\end_inset
).
\layout Standard
The
\begin_inset Quotes eld
\end_inset
margin
\begin_inset Quotes erd
\end_inset
is a useful way to parameterize our learning algorithms.
It will turn out that the sample complexity will be low (and the guarantees
we can make better) when the margin of most of the training examples is
large.
There is, however, a price associated with using the margin: some hypotheses
have no notion of a margin.
Thus the theory in this chapter is less general than appears elsewhere.
\layout Section
Earlier Results
\layout Standard
\begin_inset LatexCommand \label{sec-earlier}
\end_inset
\layout Standard
The improved averaging bound arises from improving one critical step in
the proof of the original margin bound
\begin_inset LatexCommand \cite{Margin}
\end_inset
which is stated next.
\layout Theorem
(Margin Bound
\begin_inset LatexCommand \cite{Margin}
\end_inset
)
\begin_inset LatexCommand \label{th-margin}
\end_inset
For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
, for all base hypothesis spaces,
\begin_inset Formula \( H \)
\end_inset
,
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)>\hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\ln |H|}{\theta ^{2}}\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \]
\end_inset
\layout Proof
Given in
\begin_inset LatexCommand \cite{Margin}
\end_inset
.
A simplification of the improved averaging bound proof.
\layout Standard
Here, the notation
\begin_inset Formula \( b(m)=O(a(m)) \)
\end_inset
means there exists a constant
\begin_inset Formula \( C \)
\end_inset
such that
\begin_inset Formula \( b(m)\leq C\cdot a(m) \)
\end_inset
for all
\begin_inset Formula \( m \)
\end_inset
.
This margin bound implies that if most training examples have a large margin
\begin_inset Formula \( \theta \)
\end_inset
(i.e.
\begin_inset Formula \( t(x,y)>\theta \)
\end_inset
for most
\begin_inset Formula \( (x,y)\in S \)
\end_inset
) and the hypothesis space is not too large, then the generalization error
cannot be large.
This theorem can only be non-vacuous when the base hypothesis space is
finite.
There are various extensions (see
\begin_inset LatexCommand \cite{Margin}
\end_inset
) of this bound for continuous hypothesis spaces based upon VC dimension
and covering number techniques.
However, the extensions tend to result in extremely loose guarantees and
are not relevant to the discussion here.
One of the advantages of the improved averaging bound is that it
\emph on
can
\emph default
apply in a non-vacuous way to infinite hypothesis spaces.
This generalization comes about with essentially zero loosening of the
underlying bound.
\layout Section
A generalized averaging bound
\layout Standard
Before discussing the main theorem, it is important to notice that the averaging
classifier,
\begin_inset Formula \( c(x) \)
\end_inset
\emph on
implies
\emph default
a distribution over the base hypothesis space
\begin_inset Formula \( H \)
\end_inset
.
This implied distribution is
\begin_inset Formula \( q_{c}(h) \)
\end_inset
where
\begin_inset Formula
\[
c(x)=\textrm{sign}(\int h(x)q_{c}(h)dh)\]
\end_inset
The distribution
\begin_inset Formula \( q_{c} \)
\end_inset
is used in the following theorem.
\layout Theorem
\begin_inset LatexCommand \label{th-averaging}
\end_inset
(Relative Entropy Averaging Theorem) For all distribution
\begin_inset Formula \( p(h) \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\layout Theorem
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, \textrm{KL}\left( \hat{e}_{\theta }(c)||e(c)\right) \geq O\left( \frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}\right) \right) \leq \delta \]
\end_inset
\layout Proof
Given in the next section.
\layout Standard
The main theorem uses a KL-divergence based pseudodistance which is a bit
hard to understand intuitively.
In order to gain intuition, we can relax the tightness of the proof with
an inequality.
\layout Standard
\begin_inset Formula
\[
\textrm{KL}(\hat{e}_{\theta }(c)||e(c))\geq 2(\hat{e}_{\theta }(c)-e(c))^{2}\]
\end_inset
\layout Standard
This relaxation gives us an immediate corollary.
\layout Corollary
(Relative Entropy Averaging Theorem) For all distribution
\begin_inset Formula \( p(h) \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\layout Theorem
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)\geq \hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \]
\end_inset
\layout Standard
This theorem improves upon theorem
\begin_inset LatexCommand \ref{th-margin}
\end_inset
because
\begin_inset Formula \( \textrm{KL}(q_{c}||p) \)
\end_inset
is used instead of
\begin_inset Formula \( \ln |H| \)
\end_inset
.
For the case of a uniform distribution on
\begin_inset Formula \( |H| \)
\end_inset
different base classifiers, these results will agree when the average is
over just one classifier.
As the average becomes
\begin_inset Quotes eld
\end_inset
broader
\begin_inset Quotes erd
\end_inset
the results will improve.
In the limit when the average is over nearly all classifiers, the term
\begin_inset Formula \( \textrm{KL}(q_{c}||p) \)
\end_inset
will be nearly
\begin_inset Formula \( 0 \)
\end_inset
.
\layout Standard
The theorems are stated in an asymptotic fashion which is not be very useful
in practical applications.
Section
\begin_inset LatexCommand \ref{sec-tighten}
\end_inset
gives some ideas of how to tighten the result, and the non-asymptotic form
(
\begin_inset LatexCommand \ref{eq-finalbound}
\end_inset
) given at the end of the proof can be used directly in practice.
\layout Standard
The improved averaging bound applies to averages over continuous hypothesis
spaces.
In this setting, the average needs to be an integral over an uncountably-infini
te set of hypotheses or the KL-divergence will not converge to a finite
value.
It is exactly because of this limitation that the improvements of this
bound are most applicable to Bayes Optimal and Maximum Entropy classifiers.
\layout Standard
In practice, the limitation may not be a significant problem because machine
learning algorithms over large hypothesis spaces typically have some parameter
stability.
In other words, a small shift in the parameters of the learned model produces
a small change in the prediction of the hypothesis.
With hypothesis stability, we can convert any average over a finite set
of hypotheses into an average over an infinite set of hypotheses without
significantly altering the predictions of the average.
This technique is explored in chapter
\begin_inset LatexCommand \ref{SNN}
\end_inset
with positive results.
\layout Section
Proof of main theorem
\layout Standard
\begin_inset LatexCommand \label{sec-main-proof}
\end_inset
\layout Subsection
Definitions and observations
\layout Standard
The proof has the same structure as the original margin bound (
\begin_inset LatexCommand \ref{th-margin}
\end_inset
) proof with one step replaced by the application of the relative entropy
PAC-Bayes theorem (
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
).
\layout Standard
Let
\begin_inset Formula \( N \)
\end_inset
be any natural number; later, the choice of
\begin_inset Formula \( N \)
\end_inset
will be optimized.
In the first part of the proof, we regard
\begin_inset Formula \( \theta \)
\end_inset
and
\begin_inset Formula \( N \)
\end_inset
as fixed.
Later we generalize this so that they may depend on the sample
\begin_inset Formula \( S \)
\end_inset
.
\layout Standard
We construct the distribution
\begin_inset Formula \( q_{N} \)
\end_inset
as follows.
Draw
\begin_inset Formula \( N \)
\end_inset
hypotheses
\begin_inset Formula \( h_{i}\sim q(h) \)
\end_inset
independently.
\begin_inset Formula \( Q_{N} \)
\end_inset
might therefore be viewed as the product distribution
\begin_inset Formula
\begin{equation}
q_{c}(h)^{N}.
\end{equation}
\end_inset
Given the
\begin_inset Formula \( h_{i} \)
\end_inset
we define
\begin_inset Formula
\begin{equation}
\label{eq-sample}
g(x)=\frac{1}{N}\sum _{i=1}^{N}h_{i}(x)
\end{equation}
\end_inset
\layout Standard
For fixed
\begin_inset Formula \( x,\, y \)
\end_inset
, the value of
\begin_inset Formula \( yh(x) \)
\end_inset
are i.i.d.
\latex latex
\latex default
Bernoulli variables with the mean equal to the expected margin:
\begin_inset Formula
\begin{equation}
E_{q\sim h}yg(x)=yf(x)
\end{equation}
\end_inset
Therefore,
\begin_inset Formula \( \forall x,y\, \, E_{g\sim Q_{N}}[yg(x)]=yf(x) \)
\end_inset
and
\begin_inset Formula \( yg(x) \)
\end_inset
is distributed according to the familiar Binomial distribution.
Later, we will apply a Binomial tail bound on this quantity.
\layout Standard
Before writing out the proof mathematically, it is helpful to consider a
graphical view of what we will prove.
We will force convergence between three quantities.
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 148 19
file averaging.eps
width 4 50
flags 9
\end_inset
\layout Standard
The convergences are then tied together with triangle inequalities resulting
in the overall proof.
The critical improvement of this paper is a refined version of the second
convergence.
\layout Subsection
The Proof
\layout Standard
For every
\begin_inset Formula \( \theta \in (0,1] \)
\end_inset
and for every (fixed)
\begin_inset Formula \( g \)
\end_inset
, the following simple inequality holds:
\begin_inset Formula
\begin{equation}
\label{eq-simple}
e(f)=\Pr _{D}[yf(x)\leq 0]
\end{equation}
\end_inset
\begin_inset Formula
\[
=\Pr _{D}[yg(x)\leq \frac{\theta }{2},\, yf(x)\leq 0]+\Pr _{D}[yg(x)>\frac{\theta }{2},\, yf(x)\leq 0]|\, yf(x)\leq 0]\]
\end_inset
\begin_inset Formula
\[
\leq \Pr _{D}[yg(x)\leq \frac{\theta }{2}]+\Pr _{D}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\]
\end_inset
Note that the left-hand side does not depend on
\begin_inset Formula \( g \)
\end_inset
.
By taking the expectation over
\begin_inset Formula \( g\sim Q_{N} \)
\end_inset
(and exchanging the order of expectations in the second term on the right-hand
side), we arrive at
\begin_inset Formula
\begin{equation}
e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ P_{g\sim Q_{N}}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\right] .
\end{equation}
\end_inset
For the rightmost term, we can apply a Binomial tail bound to get:
\layout Standard
\begin_inset Formula
\begin{equation}
e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ 1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \right]
\end{equation}
\end_inset
\begin_inset Formula
\begin{equation}
=E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right)
\end{equation}
\end_inset
We would like to apply the PAC-Bayes theorem
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
to the right-hand term.
Here we use the loss function
\begin_inset Formula \( I_{\{yg(x)\leq \theta /2\}} \)
\end_inset
.
The PAC-Bayes theorem applies for any
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
distribution.
We use as the
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
the distribution
\begin_inset Formula \( P_{N}=p(h)^{N} \)
\end_inset
where
\begin_inset Formula \( p(h) \)
\end_inset
is the
\begin_inset Quotes eld
\end_inset
prior
\begin_inset Quotes erd
\end_inset
over the base hypothesis space.
The choice of this prior allows us to use the identity:
\begin_inset Formula
\[
\textrm{KL}(Q_{N}||P_{N})=N\textrm{KL}\{q(h)||p(h))\]
\end_inset
It follows from Theorem
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
that: with probability at least
\begin_inset Formula \( 1-\delta \)
\end_inset
over random choices of
\begin_inset Formula \( S \)
\end_inset
, for every
\begin_inset Formula \( Q \)
\end_inset
,
\layout Standard
\begin_inset Formula
\begin{equation}
\label{eq-usemcall}
\Pr _{D}\left( \exists q(h):\, \, \textrm{KL}\left( \hat{e}_{\frac{\theta }{2}}(g)||e_{\frac{\theta }{2}}(g)\right) \leq \frac{\textrm{KL}(Q_{N}||P_{N})+\ln \frac{m}{\delta }}{m-1}\right)
\end{equation}
\end_inset
where
\begin_inset Formula \( e_{\frac{\theta }{2}}(g)=1 \)
\end_inset
with probability
\begin_inset Formula \( E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] \)
\end_inset
and
\begin_inset Formula \( 0 \)
\end_inset
otherwise, and
\begin_inset Formula \( \hat{e}_{\frac{\theta }{2}}(g)=1 \)
\end_inset
with probability
\begin_inset Formula \( E_{g\sim Q_{N}}\left[ \Pr _{S}[yg(x)\leq \frac{\theta }{2}]\right] \)
\end_inset
and
\begin_inset Formula \( 0 \)
\end_inset
otherwise.
\layout Standard
By the same argument as in (
\begin_inset LatexCommand \ref{eq-simple}
\end_inset
), we get:
\begin_inset Formula
\begin{equation}
\hat{e}_{\frac{\theta }{2}}(g)=\Pr _{S}[yg(x)\leq \frac{\theta }{2}]\leq \Pr _{S}[yg(x)\leq \frac{\theta }{2},\; yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ]
\end{equation}
\end_inset
\begin_inset Formula
\begin{equation}
\leq \Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ]
\end{equation}
\end_inset
\begin_inset Formula
\begin{equation}
=\Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\hat{e}_{\theta }(f)
\end{equation}
\end_inset
Again, we take expectations over
\begin_inset Formula \( g\sim Q_{N} \)
\end_inset
on both sides, interchange the order of the expectations and apply the
Binomial tail bound to get:
\begin_inset Formula
\begin{equation}
\label{eq-secondhoeff}
E_{S}\left[ P_{g\sim Q_{N}}[yg(x)\leq \frac{\theta }{2}]\right] \leq \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] .
\end{equation}
\end_inset
Combining our inequalities, we conclude that with probability at least
\begin_inset Formula \( 1-\delta \)
\end_inset
, for every
\begin_inset Formula \( q(h) \)
\end_inset
\begin_inset Formula
\begin{equation}
\textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}
\end{equation}
\end_inset
where
\begin_inset Formula \( q_{S}=1 \)
\end_inset
with probability
\begin_inset Formula \( \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] \)
\end_inset
and
\begin_inset Formula \( 0 \)
\end_inset
otherwise, and
\begin_inset Formula \( p_{D}=1 \)
\end_inset
with probability
\begin_inset Formula \( \Pr _{D}[yf(x)\leq 0]-1+\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \)
\end_inset
and
\begin_inset Formula \( 0 \)
\end_inset
otherwise.
\layout Standard
This bound holds for any fixed
\begin_inset Formula \( N \)
\end_inset
and
\begin_inset Formula \( \theta \)
\end_inset
, which is not yet what we need here, since we want to allow these to depend
on the data
\begin_inset Formula \( S \)
\end_inset
.
In essence, the bound we proved so far is a statement about certain events,
parameterized by
\begin_inset Formula \( N \)
\end_inset
and
\begin_inset Formula \( \theta \)
\end_inset
, namely the probability of each event is smaller than
\begin_inset Formula \( \delta \)
\end_inset
.
However, we need to prove that the probability of the
\emph on
union
\emph default
of all these events is smaller than
\begin_inset Formula \( \delta \)
\end_inset
.
To this end, we first observe that this union is contained in the union
over only a
\emph on
countable
\emph default
number of events.
Note that
\begin_inset Formula \( g(x)\in \{(2k-N)/N|k=0,1,\ldots ,N\} \)
\end_inset
.
Thus, even with all the possible (positive) values of
\begin_inset Formula \( \theta \)
\end_inset
, there are no more than
\begin_inset Formula \( N+1 \)
\end_inset
events of the form
\begin_inset Formula \( \{yg(x)\leq \theta /2\} \)
\end_inset
.
Denote by
\begin_inset Formula \( k(\theta ,N) \)
\end_inset
the largest integer
\begin_inset Formula \( k \)
\end_inset
such that
\begin_inset Formula \( k/N\leq \theta /2 \)
\end_inset
.
We observe that for every
\begin_inset Formula \( \theta >0 \)
\end_inset
, every
\begin_inset Formula \( g \)
\end_inset
and every distribution over
\begin_inset Formula \( (x,y) \)
\end_inset
:
\begin_inset Formula
\begin{equation}
\Pr \left[ yg(x)\leq \theta /2\right] =\Pr \left[ yg(x)\leq \frac{k(\theta ,N)}{N}\right] .
\end{equation}
\end_inset
This means that the middle step in the proof above, i.e.
\latex latex
\latex default
the application of theorem
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
, depends on
\begin_inset Formula \( (N,\theta ) \)
\end_inset
only through
\begin_inset Formula \( (N,k) \)
\end_inset
.
Since the other steps are true with probability one, we see that we can
restrict ourselves to the union of countably many events, indexed by
\begin_inset Formula \( (N,k) \)
\end_inset
.
Now, we
\begin_inset Quotes eld
\end_inset
allocate
\begin_inset Quotes erd
\end_inset
parts of the confidence quantity
\begin_inset Formula \( \delta \)
\end_inset
to each of these events, namely
\begin_inset Formula \( (N,k) \)
\end_inset
receives
\begin_inset Formula \( \delta _{N,k}=\delta /(N(N+1)^{2}),\; N=1,2,\dots ;\, k=0,\dots ,N \)
\end_inset
.
It follows easily that the union of all these events has probability at
most
\begin_inset Formula \( \sum _{N,k}\delta _{N,k}=\delta \)
\end_inset
.
Therefore we have proved that with probability at least
\begin_inset Formula \( 1-\delta \)
\end_inset
over random choices of
\begin_inset Formula \( S \)
\end_inset
it holds true that for
\emph on
all
\emph default
\begin_inset Formula \( N \)
\end_inset
and
\emph on
all
\emph default
\begin_inset Formula \( \theta \in (0,1] \)
\end_inset
,
\begin_inset Formula
\begin{equation}
\label{eq-finalbound}
\textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta _{N,k}}}{m-1}\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta }+3\ln N+1}{m-1}
\end{equation}
\end_inset
We can now choose
\begin_inset Formula \( N \)
\end_inset
to minimize this bound.
\begin_inset Formula \( N \)
\end_inset
may depend on
\begin_inset Formula \( \theta ,\, Q \)
\end_inset
and the sample
\begin_inset Formula \( S \)
\end_inset
.
\layout Standard
The asymptotic bound stated in the theorem can be derived by choosing
\begin_inset Formula \( N \)
\end_inset
(with respect to
\begin_inset Formula \( \theta \)
\end_inset
and
\begin_inset Formula \( Q \)
\end_inset
) so as to
\emph on
approximately
\emph default
minimize the bound we have derived above.
The first step in this minimization is the replacement of the Binomial
tail with a looser approximation such as the Hoeffding bound which gives
us:
\begin_inset Formula
\[
1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\]
\end_inset
\begin_inset Formula
\[
\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\]
\end_inset
\layout Standard
We can then choose
\begin_inset Formula
\[
N=\left\lceil 8\frac{\ln \frac{m}{\textrm{KL}(q||p)}}{\theta ^{2}}\right\rceil .\]
\end_inset
\layout Standard
This choice gives us:
\layout Standard
\begin_inset Formula
\[
e^{-\frac{N\theta ^{2}}{8}}=\frac{\textrm{KL}(q||p)}{m}\]
\end_inset
\layout Standard
Which implies we have an equation of the form:
\layout Standard
\begin_inset Formula
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) \leq \frac{\theta ^{-2}\textrm{KL}(q||p)\ln m+\ln m+\ln \frac{1}{\delta }}{m}
\end{equation}
\end_inset
In order to prove the result, we must show that:
\begin_inset Formula
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) =O\left( \textrm{KL}\left( q||p\right) \right)
\end{equation}
\end_inset
\layout Standard
We can do this by taking the difference to get an equation of the form:
\layout Standard
\begin_inset Formula \( (q+k)\ln \frac{q+k}{p-k}+(1-q-k)\ln \frac{1-q-k}{1-p+k}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\end_inset
\layout Standard
If
\begin_inset Formula \( p-q>2k \)
\end_inset
and
\begin_inset Formula \( k<\frac{1}{2} \)
\end_inset
then we have:
\layout Standard
\begin_inset Formula \( \simeq (q+k)\ln \frac{q+k+\frac{k}{p}}{p}+(1-q-k)\ln \frac{1-q-k-\frac{k}{1-p}}{1-p}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\end_inset
\layout Standard
\begin_inset Formula \( \simeq (q+k)[\ln \frac{q}{p}+\frac{k+\frac{k}{p}}{\frac{q}{p}}+(1-q-k)\ln \frac{1-q}{1-p}-\left[ \frac{k+\frac{k}{1-p}}{\frac{1-q}{1-p}}\right] -q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\end_inset
\layout Standard
\begin_inset Formula \( \simeq (q+k)[\ln \frac{q}{p}+k(\frac{p+1}{q})]+(1-q-k)[\ln \frac{1-q}{1-p}-k(\frac{2-p}{1-q})]-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\end_inset
\layout Standard
\begin_inset Formula \( \simeq k(1+p)-k(2-p) \)
\end_inset
\layout Standard
\begin_inset Formula \( \simeq k \)
\end_inset
\layout Standard
Since we have that the difference
\begin_inset Formula
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) -\textrm{KL}\left( q||p\right) \simeq [\textrm{KL}\left( q||p\right) ]
\end{equation}
\end_inset
the main theorem follows immediately.
\layout Section
Methods for tightening
\layout Standard
\begin_inset LatexCommand \label{sec-tighten}
\end_inset
\layout Standard
The previous section showed a bound in asymptotic form which is good for
understanding the trade-offs between the number of examples (
\begin_inset Formula \( m \)
\end_inset
), the size of the hypothesis space (
\begin_inset Formula \( |H| \)
\end_inset
), the margin (
\begin_inset Formula \( \theta \)
\end_inset
) and the entropy of the average (
\begin_inset Formula \( H(Q) \)
\end_inset
).
However, it is not a good form for those interested in quantitative application
of the bound to specific problems.
We state improvements which aid in the development of a quantitatively
applicable bound.
We can tighten the bound above through several techniques:
\layout Enumerate
Parameterizing and then optimizing the parameterization of arbitrary choices
within the proof.
In the (improved) margin bound proof, we arbitrarily decided to work with
the margin of the randomly produced function
\begin_inset Formula \( g(x) \)
\end_inset
at
\begin_inset Formula \( \frac{\theta }{2} \)
\end_inset
.
This is a good heuristic, but not the optimal choice when we use the improved
tail bounds.
Since the decision of the margin for the random function
\begin_inset Formula \( g(x) \)
\end_inset
is a parameter of the proof, we are free to optimize it.
\layout Enumerate
Tighter argument within the proof.
The optimal value of
\begin_inset Formula \( N \)
\end_inset
is a function of
\begin_inset Formula \( \theta ,m,\textrm{KL}(q||p) \)
\end_inset
and
\begin_inset Formula \( \delta \)
\end_inset
.
All of these are known in advance except for
\begin_inset Formula \( \textrm{KL}(q||p) \)
\end_inset
.
If we can estimate in advance the value of
\begin_inset Formula \( \textrm{KL}(q||p) \)
\end_inset
, then it becomes possible to optimize the value of
\begin_inset Formula \( N \)
\end_inset
in a data-independent manner.
Consequently, it becomes unnecessary to split our confidence over the possible
values of
\begin_inset Formula \( N \)
\end_inset
and we need only split the confidence over the values of
\begin_inset Formula \( \theta \)
\end_inset
in proving the bound.
The effect of this improvement is reducing
\begin_inset Formula \( 1/\delta _{N,k}=1/(N(N+1)^{2}) \)
\end_inset
to
\begin_inset Formula \( 1/\delta _{N,k}=1/(N(N+1)) \)
\end_inset
giving us a small improvement in the low order terms of the improved averaging
bound.
\layout Section
Final thoughts for Averaging Bounds
\layout Standard
The practical implication of this more-refined analysis of averaging bounds
is more support for the practice of averaging.
Sufficient averaging over the hypothesis space can reduce the
\begin_inset Quotes eld
\end_inset
complexity
\begin_inset Quotes erd
\end_inset
allowing tight estimates on the true error rate---possibly even tighter
than could be guaranteed with a single hypothesis.
\layout Standard
The improved averaging bound is not yet the tightest possible and it appears
there are several possible theoretical improvements.
\layout Enumerate
Remove the low order terms from the bound to make it more quantitatively
applicable.
\layout Enumerate
Improve the argument to take into account the
\emph on
distribution
\emph default
of the margin rather than the margin at some point.
\layout Enumerate
Prove a lower bound which corresponds to the upper bound given here.
Since no good lower bound yet exists, we do not know that large improvements
in the upper bound are not possible.
\layout Standard
Open Problem: In practice, is the averaging bound (
\begin_inset LatexCommand \ref{eq-finalbound}
\end_inset
) ever better than the PAC-Bayes bound
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
? The extra triangle inequality applications in the averaging bound may
make it worse than the PAC-Bayes bound in practice.
\layout Chapter
Computable Shell bounds
\layout Standard
\begin_inset LatexCommand \label{sec-shell}
\end_inset
\layout Standard
The first shell bound paper was joint work with David McAllester and was
presented at Colt
\begin_inset LatexCommand \cite{Shell}
\end_inset
.
The work presented here incorporates significant refinement, generalization,
and simplification of the first Colt paper.
\layout Standard
Roughly speaking, the shell bound (usually) provides much tighter true error
rate upper bounds on learned hypotheses than conventional Occam's Razor
bound (theorem
\begin_inset LatexCommand \ref{th-ORB}
\end_inset
) or PAC-Bayes bounds (theorem s
\begin_inset LatexCommand \ref{th-repbb}
\end_inset
).
It does this without violating lower upper bounds
\begin_inset LatexCommand \ref{sec-lower_upper}
\end_inset
by incorporating
\emph on
much
\emph default
more information into the bound calculation.
\layout Standard
The inspiration behind the work on Shell bounds rests on two pieces of work.
In
\begin_inset LatexCommand \cite{old_shell}
\end_inset
by Haussler, Kearns, Seung, and Tishby, learning theory curves are investigated
from an omniscient point of view where the true error rates of various
hypotheses are known.
The principle improvement in this paper is that our bounds are reduced
to
\emph on
observable
\emph default
quantities.
Put another way, we do not need to know the underlying learning distribution,
\begin_inset Formula \( D \)
\end_inset
.
In
\begin_inset LatexCommand \cite{Tobias}
\end_inset
, an analysis was made assuming some distribution over true error rates.
Our analysis does not rely on any assumption about the distribution of
true error rates---only the independence assumption is made.
Despite using only observable information and making no extra assumptions,
the results here are quite tight and yield practical results.
\layout Standard
We start with the distribution of empirical errors over hypotheses and subtract
a small amount from the empirical error rates to create a pessimistic distribut
ion.
With high probability, the cumulative of the pessimistic distribution will
lower bound the cumulative distribution of hypothesis true error rates.
Given this, we can directly calculate a bound on the probability that a
\begin_inset Quotes eld
\end_inset
large
\begin_inset Quotes erd
\end_inset
hypothesis will produce a misleadingly small error.
This bound can be
\emph on
much
\emph default
tighter than standard union bound techniques although the quantity of improvemen
t is highly problem dependent.
\layout Standard
After presenting the first bound, we will transform it into a bound on continuou
s hypothesis spaces using a PAC-Bayes like approach
\begin_inset LatexCommand \cite{PB}
\end_inset
.
\layout Standard
Viewed as an interactive proof of learning (figure
\begin_inset LatexCommand \ref{fig-shell-protocol}
\end_inset
), the stochastic shell bound is much like the PAC-Bayes bound.
\layout Standard
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 153
file thesis-presentation/shell.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-shell-protocol}
\end_inset
The stochastic shell bound, as an interactive proof of learning, has the
same general outline as the PAC-Bayes bound except that
\emph on
much
\emph default
more information is required in order to calculate the bound.
The shell bound (proved first below) is a simplification which is somewhat
tighter when the
\begin_inset Quotes eld
\end_inset
Posterior
\begin_inset Quotes erd
\end_inset
places all mass on one hypothesis.
\end_float
\layout Standard
The strongest criticism of shell bounds is, in fact, that too much information
is required.
While this information is always theoretically observable, it may not be
tractable to collect.
There are two answers to this criticism given here.
The first is an empirical employment on decision tree learning algorithms
which shows that in practice, there are often shortcuts which make the
information gathering feasible.
The second answer is to construct a sampled version of the shell bound
which approximate versions of the information required by the shell bound.
We will:
\layout Enumerate
Present the discrete shell bound
\layout Enumerate
Present the sampled version of the shell bound.
\layout Enumerate
Extend the discrete shell bound to continuous spaces
\layout Section
The Discrete Shell Bound
\layout Subsection
Knowledge of learning distribution
\layout Standard
Let
\begin_inset Formula \( B(m,K,p)=\binom{m}{K}e(h)^{K}(1-e(h))^{m-K} \)
\end_inset
be the probability that hypothesis with true error
\begin_inset Formula \( p \)
\end_inset
produces
\begin_inset Formula \( K \)
\end_inset
errors on
\begin_inset Formula \( m \)
\end_inset
independent examples.
\layout Standard
The discrete shell bound works directly with the probability that there
will be a confusingly small empirical error.
Let
\begin_inset Formula
\[
P(\epsilon ,K)\equiv \sum _{h:e(h)>K+\epsilon }B(m,K,e(h))\]
\end_inset
\layout Standard
Intuitively,
\begin_inset Formula \( P(\epsilon ,K) \)
\end_inset
is a bound on the probability that a hypotheses with a true error rate
larger than
\begin_inset Formula \( \frac{K}{m}+\epsilon \)
\end_inset
will have an empirical error rate of
\begin_inset Formula \( \frac{K}{m} \)
\end_inset
.
The contribution to the sum will fall off exponentially as the true error,
\begin_inset Formula \( e(h) \)
\end_inset
, increases.
\layout Standard
Our first step is stating a shell bound which requires unknown information.
The purpose of this bound is motivational - it provides incite into why
we can expect a large improvement.
Later, we will remove the unknown information requirements and recover
a useful bound.
\layout Theorem
\begin_inset LatexCommand \label{Full_knowledge}
\end_inset
(Full knowledge theorem) For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
\end_inset
where
\begin_inset Formula \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, P(\epsilon ,K)=\frac{\delta }{m}\right\} \)
\end_inset
\layout Standard
The full knowledge theorem relies on unobservable information---the true
error rates of all hypotheses.
This theorem is not (quite) trivial because it does not rely upon information
about
\emph on
which
\emph default
hypothesis has a particular true error.
\layout Proof
For every hypothesis with a true error rate of
\begin_inset Formula
\[
e(h)>\frac{K}{m}+\epsilon (\delta ,K)\]
\end_inset
the probability of producing an empirical error of
\begin_inset Formula
\[
\hat{e}(h)=\frac{K}{m}\]
\end_inset
is
\begin_inset Formula \( B(K,e(h),m) \)
\end_inset
.
Applying the union bound over all hypotheses and all
\begin_inset Formula \( m \)
\end_inset
possible nontrivial values of
\begin_inset Formula \( K \)
\end_inset
completes the proof.
\layout Standard
There are a couple things to note about this theorem.
First, for most balanced machine learning problems most hypotheses typically
have a true error rate near to
\begin_inset Formula \( 0.5 \)
\end_inset
.
Given this, and noticing that Binomial tails fall of exponentially, dramatic
improvements in the bound are feasible.
\layout Standard
Second, we must use
\begin_inset Formula \( \frac{\delta }{m} \)
\end_inset
rather than
\begin_inset Formula \( \delta \)
\end_inset
in order to make the proof work.
It is possible that theorem holds without splitting
\begin_inset Formula \( \delta \)
\end_inset
\begin_inset Quotes eld
\end_inset
\begin_inset Formula \( m \)
\end_inset
-ways
\begin_inset Quotes erd
\end_inset
.
Removing the factor of
\begin_inset Formula \( m \)
\end_inset
is an open problem.
For the special case of empirical risk minimization algorithms, McDiarmid's
inequality
\begin_inset LatexCommand \cite{McD}
\end_inset
implies that the range of hypotheses with minimum empirical error is of
size
\begin_inset Formula \( O(\sqrt{m}) \)
\end_inset
with high probability.
Therefore, we need only apply the union bound to
\begin_inset Formula \( O(\sqrt{m}) \)
\end_inset
possible minimum empirical error rates.
\layout Subsection
No knowledge of learning distribution
\layout Standard
Applying the full knowledge theorem (
\begin_inset LatexCommand \ref{Full_knowledge}
\end_inset
) is not practical in almost all learning settings because we do not know
the distribution of true error rates.
Therefore, it is necessary to construct a slightly looser theorem which
relies upon only observable quantities.
Surprisingly, this is possible while introducing only slightly more slack.
\layout Standard
First, we need a couple of definitions.
\layout Standard
\begin_inset Formula
\[
\bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ \frac{K}{m}+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,i,p)\geq \delta \right\} \right\} \]
\end_inset
Intuitively,
\begin_inset Formula \( \bar{e}(\epsilon ,K,\delta ,\frac{i}{m}) \)
\end_inset
is a
\emph on
lower
\emph default
bound on the true error rate of the hypothesis with an empirical error of
\begin_inset Formula \( \frac{i}{m} \)
\end_inset
.
\layout Standard
\begin_inset Formula
\[
\hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{h}B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\]
\end_inset
The quantity
\begin_inset Formula \( \hat{P}(\epsilon ,K,\delta ) \)
\end_inset
is an upper bound on the probability that one of the hypotheses with a
true error rate of
\begin_inset Formula \( \bar{e} \)
\end_inset
or more could produce
\begin_inset Formula \( K \)
\end_inset
empirical errors.
\layout Standard
Noting that there are only
\begin_inset Formula \( m+1 \)
\end_inset
possible empirical errors, we can first let
\begin_inset Formula
\[
c\left( \frac{i}{m}\right) =\left| \left\{ h:\, \hat{e}(h)=\frac{i}{m}\right\} \right| \]
\end_inset
be the count of empirical errors at
\begin_inset Formula \( \frac{i}{m} \)
\end_inset
.
Then we can redefine:
\begin_inset Formula
\[
\hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}c\left( \frac{i}{m}\right) B(m,K,\bar{e}(\epsilon ,K,\delta ,\frac{i}{m}))\]
\end_inset
\layout Standard
Later, we will prove that with high probability,
\begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \)
\end_inset
.
Given that this is so, we can prove a theorem which
\emph on
only
\emph default
relies on observable quantities.
\layout Theorem
\begin_inset LatexCommand \label{Observable}
\end_inset
(Observable Shell Bound) For all
\begin_inset Formula \( \delta >0 \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
\end_inset
where
\begin_inset Formula \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \)
\end_inset
\layout Standard
The observable shell bound preserves the important locality property of
the full knowledge shell bound.
In particular, when most of the true error rates are
\begin_inset Quotes eld
\end_inset
far
\begin_inset Quotes erd
\end_inset
from the empirical error rate (and
\begin_inset Formula \( m \)
\end_inset
is large enough), we expect to make large (functional) improvements on
the discrete hypothesis bound
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
.
\layout Standard
The proof rests upon a technical lemma which allows us to bound the unobservable
\begin_inset Quotes eld
\end_inset
probability of a misleading event
\begin_inset Quotes erd
\end_inset
with an observable event.
\layout Lemma
\begin_inset LatexCommand \label{unobservable bound}
\end_inset
(Unobservable bound) For all empirical errors,
\begin_inset Formula \( K \)
\end_inset
, for all distributions
\begin_inset Formula \( Q(h) \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( 2\sum _{h}Q(h)B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\geq \sum _{h:e(h)>K+\epsilon }Q(h)B(m,K,e(h))\right) \leq \delta \]
\end_inset
\layout Standard
This lemma is powerful because it bounds the unobservable right hand side
in terms of the observable left hand side.
\layout Proof
Let
\begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, 1-\textrm{Bin}(m,k,p)\geq \delta \} \)
\end_inset
\begin_inset Formula
\[
\forall \alpha \in (0,1]\, \, \forall h:\, \, E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall P(h)\, \, E_{P}E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall P(h)\, \, E_{D^{m}}E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\geq \frac{\alpha }{\delta }\right) \leq \delta \]
\end_inset
\begin_inset Formula
\[
\Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( \Pr _{P}\left( e(h)<\bar{e}(m,\hat{e}(h),\alpha )\right) \geq \frac{\alpha }{\delta }\right) \leq \delta \]
\end_inset
Let
\begin_inset Formula \( P(h)\propto Q(h)B(m,K,e(h)) \)
\end_inset
.
Then:
\begin_inset Formula
\[
\Rightarrow \forall Q(h)\, \, \Pr _{D^{m}}\left( \frac{\sum _{h:e(h)>\frac{K}{m}+\epsilon \wedge e(h)>\bar{e}(m,\hat{e}(h),\alpha )}Q(h)B(m,K,e(h))}{\sum _{h:e(h)>\frac{K}{m}+\epsilon }Q(h)B(m,K,e(h))}\geq \frac{\alpha }{\delta }\right) \leq \delta \]
\end_inset
set
\begin_inset Formula \( \alpha =\frac{\delta }{2} \)
\end_inset
and replace
\begin_inset Formula \( e(h) \)
\end_inset
with
\begin_inset Formula \( \bar{e}(m,\hat{e}(h),\alpha ) \)
\end_inset
to achieve the result.
\layout Standard
We now have all the tools required to prove the theorem.
\layout Proof
(of theorem
\begin_inset LatexCommand \ref{Observable}
\end_inset
) Choose
\begin_inset Formula \( Q(h) \)
\end_inset
= the uniform distribution on a our hypothesis space.
Then, we know that with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
,
\begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \)
\end_inset
.
Therefore, we can (arbitrarily) allocate a
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
probability of failure to the unobservable bound
\begin_inset LatexCommand \ref{unobservable bound}
\end_inset
and a
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
probability of failure to the full knowledge bound
\begin_inset LatexCommand \ref{Full_knowledge}
\end_inset
.
Assuming
\begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \)
\end_inset
, the observable bound will be more pessimistic than the full knowledge
bound.
\layout Standard
The Observable Shell bound behaves in a strange manner which is unlike other
true error bounds.
In particular, the true error bound can be discontinuous in the value of
\begin_inset Formula \( \delta \)
\end_inset
.
This discontinuity implies that relatively small improvements in the shell
bounds can result in dramatic improvements in the value of the true error
bound.
While dramatic improvements can happen with small improvements in the shell
bound, we expect that in practice, such large improvements will not be
too common, simply because a small improvement is unlikely (amongst all
learning problems) to shift the bound across one of these discontinuous
transitions.
\layout Section
Sampling Shell Bound
\layout Standard
The Shell bound relies upon the distribution of empirical errors across
the entire hypothesis space.
Calculating
\begin_inset Formula \( c\left( \frac{i}{m}\right) \)
\end_inset
, while theoretically possible, is often practically intractable.
For example, the space of all binary functions on
\begin_inset Formula \( n \)
\end_inset
features has size
\begin_inset Formula \( 2^{2^{n}} \)
\end_inset
and for a decision tree with a number of nodes
\begin_inset Formula \( k=O(m) \)
\end_inset
there are more than
\begin_inset Formula \( 2^{k} \)
\end_inset
hypotheses with the same number of nodes.
In order to avoid this difficulty, we will use sampling which is made
possible by noticing that the Shell bound does not require exact knowledge
of the empirical error distribution.
Instead, we can safely count a hypothesis twice because over-counting
monotonically worsens the bound.
Assume that we have an oracle which can be used to sample uniformly from
the set of all hypotheses.
Then, we can bin the sampled hypotheses according to their error rate on
the example set.
After repeating
\begin_inset Formula \( k \)
\end_inset
times we will arrive at an empirical distribution over error rates which
can be altered into a bound on the true distribution of error rates by
upper bounding the count
\begin_inset Formula \( c\left( \frac{i}{m}\right) \)
\end_inset
.
Let
\begin_inset Formula \( \hat{c}\left( \frac{i}{m}\right) \)
\end_inset
be the observed number of hypotheses out of
\begin_inset Formula \( k \)
\end_inset
uniform choices with empirical error
\begin_inset Formula \( \frac{i}{m} \)
\end_inset
and define:
\begin_inset Formula
\[
\bar{c}\left( k,\frac{\hat{c}}{k},\delta ,|H|\right) \equiv |H|*\max _{p}\left\{ p:\, \, \textrm{Bin}(k,\hat{c},p)=\frac{\delta }{m}\right\} \]
\end_inset
Intuitively,
\begin_inset Formula \( \bar{c} \)
\end_inset
will be a high confidence upper bound on the number of hypotheses with
empirical error rate
\begin_inset Formula \( \hat{c} \)
\end_inset
.
Given these quantities, we can calculate an approximate
\begin_inset Formula \( \hat{P} \)
\end_inset
according to the formula:
\begin_inset Formula
\[
\hat{\hat{P}}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}\bar{c}\left( k,\frac{\hat{c}\left( \frac{i}{m}\right) }{k},\frac{\delta }{2m},|H|\right) B\left( m,K,\bar{e}\left( \epsilon ,K,\frac{\delta }{2},\frac{i}{m}\right) \right) \]
\end_inset
\layout Theorem
\begin_inset LatexCommand \label{th-sample_shell}
\end_inset
(Sampling Shell Bound) For all
\begin_inset Formula \( \delta >0 \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\leq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
\end_inset
where
\begin_inset Formula \( \epsilon (\delta ,K)=\min \left\{ \epsilon :\, \, \hat{\hat{P}}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \)
\end_inset
\layout Proof
For every
\begin_inset Formula \( i \)
\end_inset
, we know that
\begin_inset Formula \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \)
\end_inset
with probability at least
\begin_inset Formula \( 1-\frac{\delta }{2} \)
\end_inset
.
This implies that
\begin_inset Formula \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq P\left( \epsilon ,K\right) \)
\end_inset
with probability at least
\begin_inset Formula \( 1-\delta \)
\end_inset
.
Applying the union bound with the full knowledge theorem gives us the result.
\layout Standard
The Sampling Shell bound is still relatively fast to calculate given the
sampling results, but it is worth noting that
\emph on
many
\emph default
samples are required---the bound will typically tighten linearly with
\begin_inset Formula \( \ln k \)
\end_inset
.
In other words, an exponentially large
\begin_inset Formula \( k \)
\end_inset
is required to realize all of the gains of the shell bound.
Thus, the sampling shell bound will interpolate between the discrete hypothesis
bound
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
and the shell bound as
\begin_inset Formula \( \ln k \)
\end_inset
moves from
\begin_inset Formula \( 1 \)
\end_inset
to
\begin_inset Formula \( \ln |H| \)
\end_inset
.
\layout Section
Lower Bounds
\layout Standard
The lower upper bound
\begin_inset LatexCommand \ref{sec-lower_upper}
\end_inset
does not apply to shell bounds because we are using more information than
just the empirical error rate of a learned hypothesis.
In particular, we are using the empirical error rates of
\emph on
all
\emph default
the hypotheses in calculating the bound.
Is there a lower upper bound which applies for the information used by
the shell bound? The same independent hypothesis technique will allow us
to lower bound the full knowledge theorem
\begin_inset LatexCommand \ref{Full_knowledge}
\end_inset
.
In particular, assume that we are given a set of independent hypotheses,
each with some true error
\begin_inset Formula \( e(h) \)
\end_inset
.
What is a lower bound on the probability that one of these hypotheses will
have an empirical error of
\begin_inset Formula \( \frac{K}{m} \)
\end_inset
?
\layout Standard
If
\begin_inset Formula \( A \)
\end_inset
and
\begin_inset Formula \( B \)
\end_inset
are independent events, then:
\begin_inset Formula
\[
\Pr (A\textrm{ or }B)=\Pr (A)+\Pr (B)-\Pr (A\textrm{ and }B)\]
\end_inset
\begin_inset Formula
\[
=\Pr (A)+\Pr (B)-\Pr (A)\Pr (B)\]
\end_inset
\begin_inset Formula
\[
=(1-\Pr (B))\Pr (A)+\Pr (B)\]
\end_inset
This implies that we can
\begin_inset Quotes eld
\end_inset
add
\begin_inset Quotes erd
\end_inset
the independent probabilities together as long as we rescale.
In particular,
\begin_inset Formula \( B \)
\end_inset
might be the probability of a
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
hypothesis in some set of hypotheses and
\begin_inset Formula \( A \)
\end_inset
might be the probability that some new hypothesis with a large true error
rate has a small empirical error.
\layout Standard
Using this fact, we get the following theorem:
\layout Theorem
(Lower Upper Shell Bound) For all true errors
\begin_inset Formula \( e(h) \)
\end_inset
,
\begin_inset Formula \( K \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h:e(h)>\frac{K}{m}+\epsilon \textrm{ and }\hat{e}(h)=\frac{K}{m}\right) \geq P(\epsilon ,K)(1-P(\epsilon ,K))\]
\end_inset
\layout Proof
The proof is by finite induction on the set of hypotheses with a large true
error rate.
Let
\begin_inset Formula
\[
P_{H}(\epsilon ,K)=\sum _{h\in H}B(m,K,e(h))\]
\end_inset
be the sum of the probabilities that each hypothesis in
\begin_inset Formula \( H' \)
\end_inset
produces an empirical error of
\begin_inset Formula \( \frac{K}{m} \)
\end_inset
.
Now, we want to prove that:
\begin_inset Formula
\[
\forall H\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\]
\end_inset
This is true for the base case of
\begin_inset Formula \( |H|=1 \)
\end_inset
.
Assuming that it is true for the case of
\begin_inset Formula \( |H|=n \)
\end_inset
, we need to prove it for the case of
\begin_inset Formula \( H'=H\cup |h| \)
\end_inset
.
In particular, we have assumed
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\]
\end_inset
Using the earlier independent principle, we get that:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H':\, \, \hat{e}(h)=\frac{K}{m}\right) \]
\end_inset
\begin_inset Formula
\[
\geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)))B(K,e(h),m)\]
\end_inset
\begin_inset Formula
\[
\geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K))B(K,e(h),m)\]
\end_inset
\begin_inset Formula
\[
=P_{H'}(\epsilon ,K)(1-P_{H}(\epsilon ,K))\]
\end_inset
\begin_inset Formula
\[
\geq P_{H'}(\epsilon ,K)(1-P_{H'}(\epsilon ,K))\]
\end_inset
By induction, this property therefore holds for the set of all hypotheses
with a large true error rate.
\layout Standard
Assuming that
\begin_inset Formula \( \delta <\frac{1}{2} \)
\end_inset
, the lower upper shell bound is tight to within a factor (in
\begin_inset Formula \( \delta \)
\end_inset
) of
\begin_inset Formula \( 2m \)
\end_inset
with the full knowledge bound
\begin_inset LatexCommand \ref{Full_knowledge}
\end_inset
.
Given the exponential behavior of Binomial tails, this usually (but not
always) implies a small impact on the true error bound.
One important question remains: how does this bound compare to the observable
shell bound
\begin_inset LatexCommand \ref{Observable}
\end_inset
? In the observable shell bound, the distribution of true errors is replaced
with a pessimistic distribution based upon the observed empirical errors.
The
\begin_inset Quotes eld
\end_inset
size
\begin_inset Quotes erd
\end_inset
of this pessimism in terms of the true error bound is, in general, of size
\begin_inset Formula \( \frac{1}{\sqrt{m}} \)
\end_inset
.
Thus, the gap between the lower upper shell bound and the upper shell bound
is typically of size
\begin_inset Formula \( \frac{1}{\sqrt{m}} \)
\end_inset
.
\layout Section
Shell Bounds for Continuous Spaces
\layout Standard
Applying Shell bounds to continuous hypothesis spaces is not easy.
In fact, upon first inspection, this appears to be impossible since shell
bounds require knowledge of the number of hypotheses with a particular
empirical error.
We can avoid these difficulties, while introducing a small amount of slack,
by always being concerned with the
\emph on
measure
\emph default
rather than the count of hypotheses.
In particular, we will keep track of the
\emph on
measure
\emph default
of hypotheses with a confusingly small error and the
\emph on
measure
\emph default
of the hypotheses that we pick.
The approach here is similar to the approach in section
\begin_inset LatexCommand \ref{sec-PB}
\end_inset
although more simplistic.
\layout Standard
First assume that there is some measure
\begin_inset Formula \( Q \)
\end_inset
over the hypothesis space
\begin_inset Formula \( H \)
\end_inset
.
Suppose that we choose some subset,
\begin_inset Formula \( U \)
\end_inset
, of the hypotheses with measure
\begin_inset Formula \( Q(U) \)
\end_inset
.
We will be concerned with the
\emph on
largest
\emph default
empirical error rate
\begin_inset Formula \( \hat{e}_{U}(h)=\max _{h\in U}\hat{e}(h) \)
\end_inset
and the
\emph on
average
\emph default
true error rate,
\begin_inset Formula \( e_{U}(h)=E_{Q_{U}}e(h) \)
\end_inset
.
A bound on the gap between the largest empirical error rate and the average
true error rate will imply a true error bound for the stochastic classifier
which chooses randomly and evaluates.
\layout Standard
We will need a different definition of
\begin_inset Formula \( P \)
\end_inset
.
Let:
\layout Standard
\begin_inset Formula
\[
P_{s}(\epsilon ,\, \, K)\equiv \int _{h:e(h)>K+\epsilon }Q(h)\textrm{Bin}(K,e(h),m)dh\]
\end_inset
We will also need the concept of rounding.
Choose
\begin_inset Formula \( i \)
\end_inset
such that
\begin_inset Formula \( \frac{1}{m^{i}}\geq Q(U)\geq \frac{1}{m^{i+1}} \)
\end_inset
then, define:
\begin_inset Formula
\[
\left\lfloor Q(U)\right\rfloor =\frac{1}{m^{i+1}}\]
\end_inset
Now, the following theorem holds:
\layout Theorem
(Stochastic Full knowledge)
\begin_inset LatexCommand \label{Stochastic Full Knowledge}
\end_inset
For all distributions
\begin_inset Formula \( Q \)
\end_inset
, For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \]
\end_inset
where
\begin_inset Formula \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{m^{3}}\right\} \)
\end_inset
\layout Proof
Call a hypothesis with a large true error (
\begin_inset Formula \( e(h)>\frac{K}{m}+\epsilon \)
\end_inset
) and small empirical error (
\begin_inset Formula \( \hat{e}(h)\leq \frac{K}{m} \)
\end_inset
) a
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
hypothesis.
\begin_inset Formula \( P_{s}(\epsilon ,K) \)
\end_inset
is the expected measure of the bad hypotheses.
We will use Markov's inequality to bound the actual measure of bad hypotheses.
Then, given that the quantity is bounded, we can bound the expected true
error by assuming that we included every bad hypothesis in the set
\begin_inset Formula \( U \)
\end_inset
.
\layout Proof
Let
\begin_inset Formula
\[
\epsilon _{Ki}=\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta }{m^{3+i}}\right\} \]
\end_inset
Intuitively,
\begin_inset Formula \( \epsilon _{Ki} \)
\end_inset
is the value we will use when
\begin_inset Formula \( \hat{e}=\frac{K}{m} \)
\end_inset
and
\begin_inset Formula \( \left\lfloor Q(\hat{e})\right\rfloor =\frac{1}{m^{i}} \)
\end_inset
.
Also let
\begin_inset Formula
\[
\hat{P}_{s}(\epsilon ,K)=\int _{h:e(h)>K+\epsilon \wedge \hat{e}(h)\leq K}p(h)dh\]
\end_inset
be the actual measure of bad hypotheses.
Then, Markov's inequality tells us:
\begin_inset Formula
\[
\forall \delta >0\, \, \Pr _{D}(\hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \frac{\delta }{m^{2}}\]
\end_inset
Taking the union bound over all values of
\begin_inset Formula \( \epsilon _{Ki} \)
\end_inset
, we get:
\begin_inset Formula
\[
\forall \delta >0\, \, \Pr _{D}(\forall K,i\, \, \hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \delta \]
\end_inset
So, with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
for all values of
\begin_inset Formula \( K \)
\end_inset
and
\begin_inset Formula \( i \)
\end_inset
, we have:
\begin_inset Formula \( \hat{P}_{s}(\epsilon _{Ki},K)\leq \frac{1}{m^{1+i}} \)
\end_inset
.
Therefore, if
\begin_inset Formula \( Q(U)\geq \frac{1}{m^{i}} \)
\end_inset
, we know that
\begin_inset Formula \( \frac{Q(U)}{\hat{P}_{s}(\epsilon _{Ki},K)}\geq m \)
\end_inset
.
Assuming that all of the bad hypotheses have true error
\begin_inset Formula \( 1 \)
\end_inset
and all the rest have true error at most
\begin_inset Formula \( \frac{K}{m}+\epsilon _{Ki} \)
\end_inset
, we get the following true error bound:
\begin_inset Formula
\[
e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon _{\hat{e}i}\]
\end_inset
\layout Standard
The stochastic full knowledge bound is loose when applied to the full knowledge
setting by
\begin_inset Formula \( \delta \rightarrow \frac{\delta }{m^{2}} \)
\end_inset
.
Typically, this results in only a
\begin_inset Formula \( \frac{2\ln m}{m} \)
\end_inset
increase in the size of
\begin_inset Formula \( \epsilon \)
\end_inset
, although the increase can sometimes be much larger near phase transitions.
These factors of
\begin_inset Formula \( \frac{\ln m}{m} \)
\end_inset
may be removable with improved argumentation.
Naturally, the stochastic full knowledge bound can be used to prove a stochasti
c observable bound.
\layout Standard
The next theorem is the observable analog of the stochastic full knowledge
bound.
Here, we eliminate the unobservable quantities to produce a stochastic
observable shell bound.
The observable quantity we will use is:
\layout Standard
\begin_inset Formula
\[
\hat{P}_{s}(\epsilon ,K,\delta )\equiv 2\sum _{h}Q(h)\textrm{Bin}(K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)),m)\]
\end_inset
where
\begin_inset Formula
\[
\bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ K+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,K,p)\geq \delta \right\} \right\} \]
\end_inset
is a slightly pessimal estimate of the true error rate given the empirical
error rate as before.
\layout Theorem
(Stochastic Observable Shell Bound) For all distributions
\begin_inset Formula \( Q \)
\end_inset
, For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \]
\end_inset
where
\begin_inset Formula \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{2m^{3}}\right\} \)
\end_inset
\layout Proof
First note that the unobservable bound lemma
\begin_inset LatexCommand \ref{unobservable bound}
\end_inset
implies that with probability
\begin_inset Formula \( 1-\frac{\delta }{2} \)
\end_inset
, we have
\begin_inset Formula \( \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\geq P_{s}(\epsilon ,K) \)
\end_inset
.
Given that this is the case, our choice of
\begin_inset Formula \( \epsilon \)
\end_inset
will be at least as pessimistic as the choice defined by the Stochastic
Full Knowledge bound
\begin_inset LatexCommand \ref{Stochastic Full Knowledge}
\end_inset
.
We thus have two sources of failure: the unobservable bound lemma will
fail with probability at most
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
and the stochastic full knowledge bound will fail with probability
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
.
The union bound then implies that the Stochastic Observable Shell Bound
holds with probability
\begin_inset Formula \( 1-\frac{\delta }{2} \)
\end_inset
.
\layout Standard
The information requirements for the continuous space shell bound are even
more severe than the requirements for the discrete space shell bound.
In particular, we need to know the exact measure of the hypotheses with
any particular empirical error.
Clearly, this is unrealistic.
We can relax this requirement to observations computable in a finite amount
of time using the sampling techniques of theorem
\begin_inset LatexCommand \ref{th-sample_shell}
\end_inset
.
In particular, given a well-behaved distribution
\begin_inset Formula \( Q \)
\end_inset
, we can make a bounded estimate of the measure of hypotheses with some
empirical error rate.
Given this bounded estimate, we can then calculate a pessimistic shell
bound for the continuous space.
\layout Section
Conclusion
\layout Standard
Details for calculating the shell bound are given in appendix section
\begin_inset LatexCommand \ref{sec-shell-bound-calc}
\end_inset
.
\layout Standard
Shell bounds are easily calculable and can provide large improvements when
we can afford to enumerate the hypotheses.
Their application is less clear when it is not possible to enumerate the
hypotheses because the computational burden may become too large.
Via sampling techniques, it is possible to smoothly improve from earlier
techniques to the best achievable shell bound result in an anytime fashion.
\layout Standard
There remain several important open questions:
\layout Enumerate
Can we remove the remaining division of
\begin_inset Formula \( \delta \)
\end_inset
by
\begin_inset Formula \( m \)
\end_inset
? This would make shell bounds a bit more elegant and tight.
\layout Enumerate
Is there a natural lower bound on the true error rate which uses the shell
approach? This improvement is of principally theoretical interest.
\layout Enumerate
For the continuous space shell bounds, two additional divisions by
\begin_inset Formula \( m \)
\end_inset
were introduced.
Is it possible to remove these factors with an improved argument? This
improvement would clean up the continuous shell bound.
\layout Enumerate
Our extension to the continuous case was done in the style of PAC-Bayes
bounds, but a more common technique for extending to the continuous case
is via the use of covering numbers.
Is there a natural way to extend Shell bounds to the continuous case using
the concept of a covering number? This is another approach which might
yield a result requiring less calculation.
\layout Chapter
Tight covering number bounds
\layout Standard
\begin_inset LatexCommand \label{sec-cover}
\end_inset
\layout Section
Introduction
\layout Standard
Covering number bounds are used to bound the true error rate of classifiers
chosen from an infinite hypothesis space using
\begin_inset Formula \( m \)
\end_inset
examples
\begin_inset LatexCommand \cite{Haussler}
\end_inset
.
A cover is a finite set of hypotheses which satisfies the following property:
every hypothesis in the infinite space is
\begin_inset Quotes eld
\end_inset
near
\begin_inset Quotes erd
\end_inset
some element in the finite cover.
When a Lipschitz condition holds on the hypothesis space, it is generally
possible to construct these covers and the existence of a cover is required
for learnability
\begin_inset LatexCommand \cite{BI}
\end_inset
.
Alternatively, Sauer's lemma (see
\begin_inset LatexCommand \cite{Pollard}
\end_inset
or
\begin_inset LatexCommand \cite{Vapnik}
\end_inset
) bounds the size of the cover in terms of the VC dimension which is defined
combinatorially: the VC dimension is the largest number of examples which
the hypothesis space can classify in an arbitrary manner.
\layout Standard
The principal disadvantage of covering number results is that they are notorious
ly loose, to the point that they are often useless when applied in practice
(see
\begin_inset Quotes eld
\end_inset
criticisms
\begin_inset Quotes erd
\end_inset
in
\begin_inset LatexCommand \cite{Haussler}
\end_inset
).
Here,
\begin_inset Quotes eld
\end_inset
useless
\begin_inset Quotes erd
\end_inset
means that the bound on the true error rate is
\begin_inset Quotes eld
\end_inset
always wrong
\begin_inset Quotes erd
\end_inset
.
The amount of
\begin_inset Quotes eld
\end_inset
looseness
\begin_inset Quotes erd
\end_inset
can be quantified by comparison with other bounds in the regimes where
other bounds hold.
On a finite hypothesis space we have near-perfect agreement between the
upper bound
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
and the lower upper bound
\begin_inset LatexCommand \ref{th-dhlub}
\end_inset
for independent hypotheses.
In fact, as the number of examples goes to infinity, the agreement is perfect,
regardless of the size of the hypothesis space.
When we apply covering number bounds to this problem such properties do
not arise.
Since part of the argument involves splitting the examples into
\begin_inset Formula \( 2 \)
\end_inset
sets, the difference between a covering number based upper bound and the
lower upper bound can be large even when the number of examples goes to
infinity.
In practice, the covering number bound at least squares the discrete hypothesis
size.
Obviously, further loosening of covering number bounds with Sauer's lemma
result in even worse bounds.
\layout Standard
Can we construct a calculable true error upper bound for continuous hypothesis
spaces which is at least asymptotically tight? In a sense, this has already
been done with PAC-Bayes bounds in chapter
\begin_inset LatexCommand \ref{sec-PB}
\end_inset
, but there are drawbacks in applicability to that approach since PAC-Bayes
bounds do not apply in a meaningful way to a single hypothesis drawn from
an infinite hypothesis space.
A covering number argument would hopefully apply in a meaningful way to
a singly hypothesis.
A covering number bound which is asymptotically tight on
\emph on
some
\emph default
learning problems does exist and is covered next.
\layout Section
The Setting and Prior Results
\layout Standard
We will first discuss standard covering number bounds.
Define a
\begin_inset Quotes eld
\end_inset
distance
\begin_inset Quotes erd
\end_inset
in terms of how often hypotheses disagree according to:
\begin_inset Formula
\[
d_{D}(h,f)=\Pr _{D}(h(x)\neq f(x))\]
\end_inset
Now, start with an epsilon net defined by:
\begin_inset Formula
\[
N(H,\epsilon ,d_{D})=\textrm{inf}_{F}\left| F:\, \, \forall h\in H\exists f\in F:\, \, d_{D}(h,f)\leq \epsilon \right| \]
\end_inset
An epsilon net is the minimum size of a set which contains an element
\begin_inset Quotes eld
\end_inset
near
\begin_inset Quotes erd
\end_inset
to every element in
\begin_inset Formula \( H \)
\end_inset
.
\layout Standard
Then a covering number is defined as:
\begin_inset Formula
\[
C(H,\epsilon )=\textrm{sup}_{D}N(H,\epsilon ,d_{D})\]
\end_inset
The covering number is the worst epsilon net.
\layout Theorem
\begin_inset LatexCommand \label{th-covering}
\end_inset
(Covering number bound) For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\forall H\, \, \Pr _{D^{m}}\left( e(h)\geq \hat{e}(h)+4\sqrt{\frac{\ln 4C(H,\epsilon )+\ln \frac{1}{\delta }}{m}}\right) \leq \delta \]
\end_inset
\layout Proof
In
\begin_inset LatexCommand \cite{Haussler}
\end_inset
.
\layout Standard
How tight is this bound when applied to a finite independent hypothesis
space? We can improve the constants by using an argument with fewer triangle
inequalities in the discrete case and get the following results:
\begin_inset Formula
\[
\Pr _{D}\left( e(h)-\hat{e}(h)\geq 2\sqrt{\frac{\ln 4|H|+\ln \delta }{m}}\right) \leq \delta \]
\end_inset
Comparing this with a very loose application of the discrete hypothesis
bound
\begin_inset LatexCommand \ref{th-adhscb}
\end_inset
we see that the penalty term in the covering number bound is worse by factor
of
\begin_inset Formula \( 2\sqrt{2} \)
\end_inset
.
Put another way, dividing the number of samples by
\begin_inset Formula \( 8 \)
\end_inset
or increasing the hypothesis space size to
\begin_inset Formula \( |H|^{8} \)
\end_inset
and then applying a sloppy discrete hypothesis bound is about equivalent
to applying a very specialized covering number bound.
We seek a covering number bound which does not divide the effective value
of a hypothesis by
\begin_inset Formula \( 8 \)
\end_inset
.
\layout Section
Bracketing Covering Number Bound
\layout Standard
There is an alternative version of the covering number bounds which has
been little used in learning theory.
This is mentioned as the
\begin_inset Quotes eld
\end_inset
direct method
\begin_inset Quotes erd
\end_inset
in
\begin_inset LatexCommand \cite{Haussler}
\end_inset
and Section II.2 of
\begin_inset LatexCommand \cite{Pollard}
\end_inset
discusses this approach but is only concerned with asymptotic consistency
rather than rates of convergence.
\layout Standard
We start with a more restricted notion of covering---a covering in which
we bracket the hypotheses.
Let
\begin_inset Formula \( Z \)
\end_inset
be the space of
\begin_inset Formula \( (x,y) \)
\end_inset
labeled examples and
\begin_inset Formula \( h(Z)=\{(x,y):h(x)\neq y\} \)
\end_inset
in:
\begin_inset Formula
\[
N_{[]}(H,\gamma ,d_{D})=\textrm{inf}_{F}\left| F:\forall h\in H\, \, \exists (f,f')\in F:\, \, \begin{array}{cc}
d_{D}(f,f')\leq \gamma & \\
\textrm{and }f(Z)>h(Z)>f'(Z) &
\end{array}\right| \]
\end_inset
In other words,
\begin_inset Formula \( f \)
\end_inset
and
\begin_inset Formula \( f' \)
\end_inset
are two sets which are
\begin_inset Quotes eld
\end_inset
above
\begin_inset Quotes erd
\end_inset
and
\begin_inset Quotes eld
\end_inset
below
\begin_inset Quotes erd
\end_inset
every hypothesis that they cover.
Note that it is very important in this definition that the sets
\begin_inset Formula \( f \)
\end_inset
and
\begin_inset Formula \( f' \)
\end_inset
are not required to correspond to functions in
\begin_inset Formula \( H \)
\end_inset
.
In fact, they can simply be understood as arbitrary sets of
\begin_inset Formula \( (x,y) \)
\end_inset
pairs.
\layout Standard
It is immediately obvious that:
\begin_inset Formula
\[
N_{[]}(H,\gamma ,d_{D})\geq N(H,\gamma ,d_{D})\]
\end_inset
However, the relationship between
\begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \)
\end_inset
and
\begin_inset Formula \( C(H,\gamma ) \)
\end_inset
is ambiguous: either could be larger.
\layout Standard
Using this alternative notion of a covering, we have the following theorem:
\layout Theorem
\begin_inset LatexCommand \label{th-bracketing_cover}
\end_inset
Let
\begin_inset Formula \( f(h) \)
\end_inset
be the upper bracket of any hypothesis,
\begin_inset Formula \( h \)
\end_inset
.
For all
\begin_inset Formula \( \gamma \in (0,1] \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c}
e(h)\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \textrm{ }\\
\textrm{or }\hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right)
\end{array}\right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \begin{array}{c}
\bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\
\textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\}
\end{array} \)
\end_inset
.
\layout Standard
This theorem constrains the distance between
\begin_inset Formula \( \hat{e}(f(h)) \)
\end_inset
and
\begin_inset Formula \( e(f(h)) \)
\end_inset
(which upper bounds
\begin_inset Formula \( e(h) \)
\end_inset
) and the distance between
\begin_inset Formula \( \hat{e}(f(h)) \)
\end_inset
and
\begin_inset Formula \( \hat{e}(h) \)
\end_inset
.
Consequently, it constrains the distance between
\begin_inset Formula \( e(h) \)
\end_inset
and
\begin_inset Formula \( \hat{e}(h) \)
\end_inset
.
The proof of the theorem rests on the following two lemmas which each assert
a convergence.
Graphically, the proof has the following form:
\layout Standard
\added_space_top 0.3cm \added_space_bottom 0.3cm \align center
\begin_inset Figure size 148 10
file bracketing_cover.eps
width 4 50
flags 9
\end_inset
\layout Lemma
Let
\begin_inset Formula \( f(h) \)
\end_inset
be the upper bracket of any hypothesis,
\begin_inset Formula \( h \)
\end_inset
.
For all
\begin_inset Formula \( \gamma \in (0,1] \)
\end_inset
, for all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\} \)
\end_inset
\layout Proof
If
\begin_inset Formula \( f(h),f'(h) \)
\end_inset
bracket
\begin_inset Formula \( h \)
\end_inset
, we have:
\begin_inset Formula
\[
e(f(h))-e(f'(h))\leq \gamma \]
\end_inset
Therefore a coin which is ``heads'' when
\begin_inset Formula \( f(h)(z)\neq f'(h)(z) \)
\end_inset
has a bias of
\begin_inset Formula \( \gamma \)
\end_inset
or less.
Since
\begin_inset Formula \( f'(h) \)
\end_inset
is correct everywhere that
\begin_inset Formula \( h \)
\end_inset
is correct, we also know:
\begin_inset Formula
\[
\forall \epsilon \, \, \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(h)\geq \epsilon )\leq \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(f'(h))\geq \epsilon )\]
\end_inset
Therefore, we have at most
\begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \)
\end_inset
Binomial distributions, each with bias at most
\begin_inset Formula \( \gamma \)
\end_inset
, and wish to bound the probability of a large deviation.
Using an upper binomial tail calculation, we get the result.
\layout Lemma
For all
\begin_inset Formula \( \delta \in (0,1] \)
\end_inset
:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists f,f'\in N_{[]}(H,\gamma ,d_{D}):\, \, e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \]
\end_inset
where
\begin_inset Formula \( \bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \} \)
\end_inset
\layout Proof
Application of the discrete hypothesis space bound
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
for
\begin_inset Formula \( f \)
\end_inset
in every
\begin_inset Formula \( (f,f') \)
\end_inset
pair.
\layout Standard
We can now combine the lemmas to get the theorem.
\layout Proof
(of theorem
\begin_inset LatexCommand \ref{th-bracketing_cover}
\end_inset
)
\latex latex
\latex default
Allocate
\begin_inset Formula \( \frac{\delta }{2} \)
\end_inset
confidence to each lemma and use the union bound with both lemmas to get:
\begin_inset Formula
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c}
e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \\
\hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right)
\end{array}\textrm{ }\right) \leq \delta \]
\end_inset
where
\begin_inset Formula \( \begin{array}{c}
\bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\
\textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\}
\end{array} \)
\end_inset
.
Since
\begin_inset Formula \( e(h)\leq e(f(h)) \)
\end_inset
by construction, the proof is complete.
\layout Standard
The alternative covering number argument has a significant advantage over
the standard argument: when restricted to a finite hypothesis space, the
argument becomes tight.
In particular, on a finite hypothesis space, we can set
\begin_inset Formula \( \gamma =0 \)
\end_inset
and get:
\begin_inset Formula \( N_{[]}(H,0,d_{D})\leq |H| \)
\end_inset
.
The bound reduces to the standard discrete hypothesis space bound
\begin_inset LatexCommand \ref{th-DHSCP}
\end_inset
.
Consequently, there exist learning problems where this bound is quite tight.
\layout Standard
The bracketing covering approach has the following advantages and disadvantages:
\layout Enumerate
Advantage: Covering defined in terms of the actual problem rather than a
worst case over all problems.
\layout Enumerate
Advantage: Asymptotically tight for some learning problems.
\layout Enumerate
Disadvantage: Less general.
There may exist spaces which are not coverable in the alternative approach.
\layout Enumerate
Disadvantage:
\begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \)
\end_inset
is more difficult to compute than
\begin_inset Formula \( N(H,\gamma ,d_{D}) \)
\end_inset
.
\layout Standard
In a sense, this bound is useless because it requires knowledge of the unknown
distribution
\begin_inset Formula \( D \)
\end_inset
in order to calculate a covering number.
In the next section, we will see that bounds on the bracketing covering
number which hold for all
\begin_inset Formula \( D \)
\end_inset
can often be found.
\layout Section
Covering number calculations
\layout Standard
It is important to demonstrate that this covering number is feasible to
calculate and gives a better answer than the traditional approach.
We will do this by first calculating the bracketing covering number for
a very simple continuous classifier and then comparing the results with
the traditional covering number approach.
\layout Standard
Bracketing covering numbers have already been proved for many function classes
\begin_inset LatexCommand \cite{Dudley}
\end_inset
\begin_inset LatexCommand \cite{VW}
\end_inset
.
Here, we will present a proof for the simplest of continuous hypothesis
spaces: the step function on a line segment.
Each hypothesis will be indexed by a number
\begin_inset Formula \( a\in [0,1] \)
\end_inset
according to:
\begin_inset Formula
\[
h_{a}(x)=\textrm{sign}(x-a)\]
\end_inset
What is
\begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \)
\end_inset
for this hypothesis set?
\layout Theorem
Assume that
\begin_inset Formula \( D \)
\end_inset
can be described by a probability density function, then:
\layout Theorem
\begin_inset Formula
\[
N_{[]}(H,\gamma ,d_{D})\leq \frac{1}{\gamma }+1\]
\end_inset
\layout Proof
Consider a range of hypotheses from
\begin_inset Formula \( h_{a} \)
\end_inset
to
\begin_inset Formula \( h_{b} \)
\end_inset
.
For this range of hypotheses, we can choose a bracketing pair,
\begin_inset Formula \( (f,f') \)
\end_inset
.
In particular, we can choose
\begin_inset Formula \( f \)
\end_inset
and
\begin_inset Formula \( f' \)
\end_inset
which agree on
\begin_inset Formula \( [0,a) \)
\end_inset
and
\begin_inset Formula \( [b,1] \)
\end_inset
and always predict either incorrectly (
\begin_inset Formula \( f \)
\end_inset
) or correctly (
\begin_inset Formula \( f' \)
\end_inset
) on
\begin_inset Formula \( [a,b) \)
\end_inset
.
The distance between these functions satisfies:
\begin_inset Formula
\[
d_{D}(f,f')=\Pr _{D}(x\in [a,b))\]
\end_inset
and every hypothesis
\begin_inset Formula \( h_{c} \)
\end_inset
in
\begin_inset Formula \( [a,b) \)
\end_inset
satisfies
\begin_inset Formula \( \forall z\, \, f(z)\geq h_{c}(z)\geq f'(z) \)
\end_inset
.
Consequently, if
\begin_inset Formula \( a \)
\end_inset
and
\begin_inset Formula \( b \)
\end_inset
are chosen appropriately, we will observe
\begin_inset Formula \( d_{D}(f,f')\leq \gamma \)
\end_inset
.
\layout Proof
If
\begin_inset Formula \( D \)
\end_inset
can be described by a probability distribution, then we can simply calculate
the marginal distribution,
\begin_inset Formula \( D_{x} \)
\end_inset
, and the cumulative distribution of the margin,
\begin_inset Formula \( F_{D}(x) \)
\end_inset
.
Now, find
\begin_inset Formula \( a_{i} \)
\end_inset
for which
\begin_inset Formula \( F_{D}(a_{i})=\frac{i}{\gamma } \)
\end_inset
for
\begin_inset Formula \( i<\frac{1}{\gamma } \)
\end_inset
.
Choose
\begin_inset Formula \( b_{i}=a_{i+1} \)
\end_inset
.
There are at most
\begin_inset Formula \( \frac{1}{\gamma }+1 \)
\end_inset
intervals, each with a measure (according to
\begin_inset Formula \( D \)
\end_inset
) of at most
\begin_inset Formula \( \gamma \)
\end_inset
.
Consequently, we can cover
\begin_inset Formula \( H \)
\end_inset
with
\begin_inset Formula \( \frac{1}{\gamma }+1 \)
\end_inset
pairs of
\begin_inset Formula \( (f_{i},f_{i}') \)
\end_inset
.
\layout Standard
Given that the bracketing cover is
\begin_inset Formula \( \frac{1}{\gamma }+1 \)
\end_inset
, we can use theorem
\begin_inset LatexCommand \ref{th-bracketing_cover}
\end_inset
to define a constraint that the true error rate must satisfy with high
probability.
Setting
\begin_inset Formula \( \gamma =\frac{1}{m-1} \)
\end_inset
, we get:
\begin_inset Formula
\[
\hat{e}(f(h))\leq \hat{e}(h)+b\left( m,\frac{1}{m-1},\frac{\delta }{2m}\right) \]
\end_inset
and
\begin_inset Formula
\[
e(h)\leq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2m}\right) \]
\end_inset
To be fair in comparison to the standard covering number approaches, we
should relax our theorem to use the Hoeffding approximation.
Note that this is a bit unfair because the first inequality is (inherently)
a highly biased Binomial with lower variance.
Relaxing to the Hoeffding bound, we get:
\begin_inset Formula
\[
\hat{e}(f(h))\leq \hat{e}(h)+\frac{1}{m-1}+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\]
\end_inset
and
\begin_inset Formula
\[
e(h)\leq \hat{e}(f(h))+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\]
\end_inset
which implies:
\begin_inset Formula
\[
e(h)\leq \hat{e}(h)+\frac{1}{m-1}+2\sqrt{\frac{\ln 2m+\ln \frac{1}{\delta }}{2m}}\]
\end_inset
Note again that we are being
\begin_inset Quotes eld
\end_inset
unfair
\begin_inset Quotes erd
\end_inset
to the new approach by using the Hoeffding approximation rather than exact
Binomial-tail bounds.
The standard covering number approach has not yet been reduced to exact
Binomial-tail bounds.
Using the standard approach, the covering number, we get
\begin_inset Formula \( C(H,\frac{1}{m-1})=m \)
\end_inset
.
This implies a bound of:
\begin_inset Formula
\[
e(h)\leq \hat{e}(h)+4\sqrt{\frac{\ln 8m+\ln \frac{1}{\delta }}{m}}\]
\end_inset
Comparing the bounds, we see that the new approach is about
\begin_inset Formula \( \frac{16}{2}=8 \)
\end_inset
times more efficient in the number of examples required to achieve a bound
on a given deviation.
\layout Subsection
Note on the Bracketing Cover proof
\layout Standard
There are several important things to note about this proof.
\layout Enumerate
We used the property that a small change in the hypothesis only affected
the prediction on a small portion of the input space.
\layout Enumerate
The bound on
\begin_inset Formula \( N_{[]} \)
\end_inset
holds for all
\begin_inset Formula \( D \)
\end_inset
with a density function, not just the
\begin_inset Formula \( D \)
\end_inset
which we happen to observe.
\layout Enumerate
The bound on
\begin_inset Formula \( N_{[]}(H,\gamma ,D) \)
\end_inset
is exactly the same as a bound on
\begin_inset Formula \( N\left( H,\frac{\gamma }{2},D\right) \)
\end_inset
.
\layout Standard
In fact, the proof can be extended to
\emph on
all
\emph default
\begin_inset Formula \( D \)
\end_inset
(even ones with point masses) at the cost of a factor of
\begin_inset Formula \( 2 \)
\end_inset
worsening and a messier argument.
Property (2) is desirable because it is often not the case that we know
the distribution
\begin_inset Formula \( D \)
\end_inset
when we wish to apply the bound.
Property (1) is an essential technique that can be used to prove other
covering number bounds for this notion of covering number.
\layout Standard
Can we show partial order covering number bounds for other classifiers?
There is a straightforward extension of the previous proof for classifiers
which consist of axis parallel intervals in
\begin_inset Formula \( R^{n} \)
\end_inset
.
More work is required to prove partial order covering numbers for the hypothesi
s spaces of standard learning algorithms.
\layout Section
Conclusion and Future Work
\layout Standard
We presented an alternative covering number argument and showed that the
true error rate bounds constructed using this argument are within
\begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \)
\end_inset
of the lower bound on some learning problems.
This is a significant improvement over prior results which just bound the
ratio of the lower and upper bounds up to a constant.
We also presented a simple improvement on PAC-Bayes bounds for stochastic
classifiers which achieves a similar
\begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \)
\end_inset
difference between the lower and upper bounds.
\layout Standard
It is interesting to examine the relationship between the bracketing covering
number and the PAC-Bayes bound.
With this notion of covering number we can guarantee that all of the hypotheses
covered by the same bracketing pair have similar empirical as well as true
errors.
Thus, we can relate the error rate of an individual hypothesis to a set
of hypotheses with a significant measure --- exactly the setting where
the PAC-Bayes bound is tight.
\layout Standard
Much work remains to be done in order to fulfill a quest for quantitatively
tight learning bounds.
\layout Enumerate
Proofs on the size of the partial order covering number need to be made
for common learning algorithms.
\layout Enumerate
Can this alternate form of covering number be related to the VC dimension
or to the standard definition of covering number?
\layout Enumerate
Can we extend the class of problems for which the lower and upper bounds
differ by only
\begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \)
\end_inset
to a larger set?
\layout Chapter
Holdout bounds: Progressive Validation
\layout Standard
\begin_inset LatexCommand \label{sec-PV}
\end_inset
\layout Section
Progressive Validation Technique
\layout Standard
Progressive validation is a technique which allows you to use almost
\begin_inset Formula \( \frac{1}{2} \)
\end_inset
of the data in a holdout set for training purposes while still providing
the same guarantee as the holdout bound.
It first appeared in
\begin_inset LatexCommand \cite{Progressive}
\end_inset
and is discussed in a more refined and detailed form here.
\layout Standard
Suppose that you have a training set of size
\begin_inset Formula \( m_{\textrm{train}} \)
\end_inset
and test set of size
\begin_inset Formula \( m_{\textrm{pv}} \)
\end_inset
.
Progressive validation starts by first learning a hypothesis on the training
set and then testing on the first example of the test set.
Then, we train on training set plus the first example of the test set and
test on the second example of the test set.
The process continues
\begin_inset Formula \( m_{\textrm{test}} \)
\end_inset
iterations.
Let
\begin_inset Formula \( m \)
\end_inset
abbreviate
\begin_inset Formula \( m_{\textrm{pv}} \)
\end_inset
.
Then, we have
\begin_inset Formula \( m \)
\end_inset
hypotheses,
\begin_inset Formula \( h_{1},...,h_{m} \)
\end_inset
and
\begin_inset Formula \( m \)
\end_inset
error observations,
\begin_inset Formula \( \hat{e}_{1},...,\hat{e}_{m} \)
\end_inset
.
The hypothesis output by progressive validation is the randomized hypothesis
which chooses uniformly from
\begin_inset Formula \( h_{1},...,h_{m} \)
\end_inset
and evaluates to get an estimated output.
Note that this protocol is similar to those in
\begin_inset LatexCommand \cite{Littlestone89}
\end_inset
and the new thing here is an analysis of performance.
\layout Standard
Since we are randomizing over hypotheses trained on
\begin_inset Formula \( m_{\textrm{train}} \)
\end_inset
to
\begin_inset Formula \( m_{\textrm{train}}+m_{\textrm{pv}}-1 \)
\end_inset
examples, the expected number of examples used by any hypothesis is
\begin_inset Formula \( m_{\textrm{train}}+\frac{m_{\textrm{pv}}-1}{2} \)
\end_inset
.
Given that training can exhibit phase transitions, the extra few examples
can greatly improve the accuracy of the trained example.
\layout Standard
Viewed as an interactive proof of learning, the progressive validation technique
follows the protocol of figure
\begin_inset LatexCommand \ref{fig-pv-protocol}
\end_inset
.
\layout Standard
\begin_float fig
\layout Standard
\align center
\begin_inset Figure size 267 175
file thesis-presentation/progressive_validation.eps
width 4 90
flags 9
\end_inset
\layout Caption
\begin_inset LatexCommand \label{fig-pv-protocol}
\end_inset
The progressive validation protocol has a learner repeatedly commit to
a hypothesis before it is given a new example.
Based upon the test errors, a bound on the true error rate of the metahypothesi
s which chooses randomly from each of
\begin_inset Formula \( (h_{1},...,h_{m}) \)
\end_inset
before each evaluation is provided.
\end_float
\layout Standard
The true error rate of this randomized hypothesis will be:
\begin_inset Formula
\[
e_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}e(h_{i})\]
\end_inset
where
\begin_inset Formula \( e(h_{i})=\Pr _{D}(h_{i}(x)\neq y) \)
\end_inset
and the empirical error estimate of this randomized hypothesis will be:
\begin_inset Formula
\[
\hat{e}_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}\hat{e}_{i}\]
\end_inset
\layout Section
Variance Analysis
\layout Standard
Bounding the deviation of
\begin_inset Formula \( \hat{e}_{\textrm{pv}} \)
\end_inset
is more difficult than bounding the deviation of a holdout error.
To understand this, we can think of two games, the holdout game and the
progressive validation game.
\layout Standard
In the holdout game, your opponent chooses a bias and then nature flips
\begin_inset Formula \( m \)
\end_inset
coins with that bias.
If the deviation of the average number of heads is larger than
\begin_inset Formula \( \epsilon \)
\end_inset
, then you lose.
Otherwise, you win.
\layout Standard
In the progressive validation game, the opponent chooses the bias of
\emph on
each
\emph default
coin just before it is flipped.
The goal of the opponent remains the same, and the opponent wins if a large
deviation is observed.
\layout Standard
The progressive validation opponent is at least as strong as the holdout
opponent since the progressive validation opponent could choose the same
bias for every coin.
Nonetheless, we will see that the progressive validation opponent is
\emph on
not
\emph default
much stronger.
\layout Standard
There are two ways in which we can show that the progressive validation
opponent is not much stronger.
The first technique will show that the variance of the progressive validation
estimate is smaller than might be expected.
Then we will show that the
\emph on
deviations
\emph default
of the progressive validation opponent behave much like the deviations of
an independent opponent.
\layout Theorem
Suppose we test the progressive validation hypothesis on
\begin_inset Formula \( m_{\textrm{test}}=m_{\textrm{pv}} \)
\end_inset
additional examples.
Let
\begin_inset Formula \( \hat{e}_{\textrm{test}} \)
\end_inset
be the empirical error on these examples.
Then, we have:
\begin_inset Formula
\[
E_{(x,y)^{m_{\textrm{pv}}+m_{\textrm{test}}}\sim D^{m_{\textrm{pv}}+m_{\textrm{test}}}}(\hat{e}_{\textrm{test}}-e_{\textrm{pv}})^{2}\]
\end_inset
\begin_inset Formula
\[
\geq E_{(x,y)^{m_{\textrm{pv}}}\sim D^{m_{\textrm{pv}}}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\]
\end_inset
\layout Proof
Every example on the left hand side can be though of as a coin with bias
\begin_inset Formula \( e_{\textrm{pv}}. \)
\end_inset
The variance of the LHS is then
\begin_inset Formula \( m*e_{\textrm{pv}}(1-e_{\textrm{pv}}) \)
\end_inset
.
The right hand side is:
\begin_inset Formula
\[
E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\]
\end_inset
\begin_inset Formula
\[
=E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i=1}^{m}\hat{e}_{i}-e_{i}\right] ^{2}\]
\end_inset
where
\begin_inset Formula \( e_{i}=e(h_{i}) \)
\end_inset
\begin_inset Formula
\[
=E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i\neq j}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}(\hat{e}_{i}-e_{i})^{2}\right] \]
\end_inset
\begin_inset Formula
\[
=\sum _{i\neq j}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})^{2}\]
\end_inset
The cross product term is:
\begin_inset Formula
\[
E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})\]
\end_inset
\layout Proof
Without loss of generality, assume that
\begin_inset Formula \( i