0 \),
\[
\Pr _{D^{m}}\left( \exists h\in H:\, e(h)\geq \hat{e}(h)+\sqrt{\frac{\ln |H|+\ln \frac{1}{\delta }}{2m}}\right) \leq \delta \]
\end{cor}
\begin{proof}
Loosen corollary (\ref{th-REDHSCP}) with the bound \( \textrm{KL}(q||p)\geq 2(q-p)^{2} \).
\end{proof}
The agnostic form of this bound has the advantage that it explicitly shows that
we are forcing the convergence (in high probability) of the empirical error
rate to the true error rate. Graphically, we are forcing every empirical error
to be near to its true error. This is a picture which represents the connection
\vspace{0.3cm}
{\par\centering \resizebox*{0.3\columnwidth}{!}{\includegraphics{convergence.eps}} \par}
\vspace{0.3cm}
Later bounds will have more complicated convergence conditions with more complicated
graphs. These more complicated bounds are necessary in order to avoid the limitations
of the holdout bound. See figure \ref{fig-simple-holdout} for a comparison
with the holdout set.
\section{Lower Bounds}
It is important to study lower bounds in order to discover how much room for
improvement exists in our upper bounds.
There are two forms of lower bounds: a lower bounds on the true error rate and
lower \emph{upper} bounds which lower bound the value our upper bound could
return (given available information). For lower bounds essentially the upper
bound theorem applies with the bound reversed. This form of lower bound allow
us to construct a two-sided confidence interval on the location of the true
error rate. Typically, such lower bounds can be formed by simply changing the
sign in upper bound arguments.
\begin{thm}
\label{th-dhlb} (Discrete Hypothesis Lower Bound) For all hypothesis spaces,
\( H \), for all \( \delta \in (0,1] \)
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \leq \delta \]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end{thm}
\begin{proof}
For every individual hypothesis, we know that:
\[
\Pr _{D^{m}}(e(h)\leq \bar{e}(m,\hat{e}(h),\delta ))\leq \delta \]
Applying the union bound for every hypothesis gives us the result.
\end{proof}
\section{Lower Upper Bounds}
\label{sec-lower_upper}
The second form of lower bound is a lower bound on the upper bound (similar
to the results of \cite{AHKV}). If we can lower bound the upper bound (as a
function of its observables), then we can be confident that the upper bound
is no looser than the gap between the lower upper bound and the upper bound.
How tight is the discrete hypothesis bound (\ref{th-DHSCP})? The answer is
\emph{sometimes} tight. In particular, we can exhibit a set of learning problem
where the discrete hypotheses bound can not be made significantly lower as a
function of the observables, \( m \), \( \delta \), \( |H| \), and \( \hat{e}(h) \).
Fix the value of these quantities and then we will construct a learning problem
for which a lower upper bound can not be stated.
Our learning problem will be defined over the input space \( X=\{0,1\}^{|H|} \).
The hypotheses will be \( h_{i}(x)=x_{i} \) where \( x_{i} \) is the \( i \)th
component of the vector \( x \). This construction allows us to vary the true
error rate of each hypothesis independent of the other hypotheses. In fact,
we can pick any true error rate for any hypothesis by simply adjusting the probability
that \( x_{i}=y \). Our learning problem can therefore generate problems according
to the following algorithm:
\begin{algorithm}
\label{alg-lp} Draw\_Sample(float \( e \))
\end{algorithm}
\begin{enumerate}
\item Pick \( y\in \{0,1\} \) uniformly
\item For \( i \), pick \( x_{i}\neq y \) with probability \( e \).
\end{enumerate}
By construction, the true error rate of each hypothesis will be \( e(h_{i}) \).
Now, we can prove the following theorem:
\begin{thm}
\label{th-dhlub}(Discrete Hypothesis lower upper bound) For all true error
rates \( e(h)>\frac{\ln |H|+\ln \frac{1}{\delta }}{m} \) there exists a learning
problem and algorithm such that:
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end{thm}
Intuitively, this theorem implies that we can not improve significantly on the
results of theorem \ref{th-DHSCP} without using extra information about our
learning problem. Some of our later results do exactly this - they use extra
information.
\begin{proof}
Using the family of learning problems implicitly defined by algorithm \ref{alg-lp},
we know that
\[
\forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq \frac{\delta }{|H|}\]
(negation)
\[
\Rightarrow \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <1-\frac{\delta }{|H|}\]
(independence)
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \forall h\, \, e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\]
(negation)
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\]
(approximation)
\[
\Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\]
\end{proof}
For small \( \delta \), we get that \( 1-e^{-\delta }\simeq \delta \) which
implies that, no significantly better true error bound can be stated for all
learning algorithms. In particular, the empirical risk minimization algorithm
(which chooses the hypothesis with minimal empirical error) will have a significant
probability of a large deviation between the empirical and true error.
\section{Structural Risk Minimization}
Structural Risk Minimization \cite{Vapnik} (SRM) is a technique used in the
learning theory community to avoid the difficulties associated with convergence
on hypothesis sets that are too ``large''. SRM works with a sequence of nested
hypothesis sets, \( H_{1}\subset H_{2}\subset ....\subset H_{l} \). For each
hypothesis set, a discrete hypothesis bound (\ref{th-DHSCP}) on the difference
between empirical and true error exists. For ``small'' hypothesis sets, this
bound may be tight while for large hypothesis sets it may be inherently loose.
However, we also expect that the best hypothesis in the hypothesis set improves
as the hypothesis set becomes larger. This naturally induces a trade-off: there
will be some hypothesis set \( H_{i} \) for which the true error bound is minimized.
We can't simply apply the discrete hypothesis bound to the meta-algorithm which
picks the algorithm (and associated hypothesis space) with the smallest true
error bound since this meta-algorithm could, potentially, output any hypothesis
in \( H_{l} \). The simplest way to retrofit the bound to include all hypothesis
sets is with a simple theorem which essentially states that we can guarantee
\emph{nearly} the same bound as would apply on the smallest hypothesis space
\( H_{i} \) containing the output hypothesis, \( h \).
\begin{thm}
\label{th-SRM} (Structural Risk Minimization) Let \( p(i) \) be some measure
across the \( l \) hypothesis sets with \( \sum _{i=1}^{l}p(i)=1 \). Then:
\[
\forall p(i):\, \, \, \Pr _{D^{m}}\left( \exists h\in H_{i}\in \{H_{1},...,H_{l}\}:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{p(i)\delta }{|H_{i}|}\right) \right) \leq \delta \]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \).
\end{thm}
\begin{proof}
Apply the union bound to the discrete hypothesis bound (\ref{th-DHSCP}).
\end{proof}
The SRM bound is slightly inefficient in the sense that the bound for all hypotheses
in \( H_{2} \) includes a bound for every hypothesis in \( H_{1} \). This
effect is typically small because the size of the hypothesis sets usually grows
exponentially, implying that the extra confidence given to a hypothesis \( h \)
in \( H_{1} \) by the bounds used on hypothesis set \( H_{2},H_{3},... \)
is small relative to the confidence given by the bound for \( H_{1} \). One
can remove this slack in Structural Risk Minimization bound by ``cutting out''
the nested portion of each hypothesis set in the formulation of \( H_{1},...,H_{l} \).
We will call this Disjoint Structural Risk Minimization (also mentioned in \cite{MC_Journal}).
\section{Incorporating a ``Prior''}
In constructing the discrete hypothesis bound (\ref{th-DHSCP}), it is notable
that an arbitrary choice was made. We decided to give the same error allowance
to every hypothesis. This is an arbitrary choice which, in practice, we will
wish to make differently. The next theorem is essentially a restatement of the
discrete hypothesis bound with this arbitrary choice made explicit. The same
bound using the Hoeffding approximation has appeared elsewhere \cite{BEHW}\cite{PB}.
\begin{thm}
\label{th-ORB} (Occam's Razor Bound) For all hypothesis spaces, \( H \), for
all ``priors'' \( p(h) \) over the hypothesis space, \( H \), for all \( \delta \in (0,1] \):
\[
\forall p(h)\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\delta p(h)\right) \right) \leq \delta \]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end{thm}
It is very important to notice that the ``prior'' \( p(h) \) must be selected
\emph{before} seeing the examples.
\begin{proof}
The proof again starts with the basic observation that:
\[
\Pr _{D^{m}}\left( e(h)\geq \bar{e}(m,\hat{e}(h),\delta )\right) \leq \delta \]
then, we apply the union bound in a \emph{nonuniform} manner. In particular,
we allocate confidence \( \delta p(h) \) to hypothesis \( h \). Since \( p \)
is normalized, we know that
\[
\sum _{h}\delta p(h)=\delta \]
which implies that the union bound completes the proof.
\end{proof}
Once again, we can relax the Occam's Razor bound (theorem \ref{th-ORB}) with
the relative entropy Chernoff bound (\ref{eq-recb}) to get a somewhat more
tractable expression.
\begin{cor}
\label{th-reorb} (Relative Entropy Occam's Razor Bound) For all hypothesis
spaces, \( H \), for all ``priors'' \( p(h) \) over the hypothesis space,
\( H \), for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))\geq \frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end{cor}
\begin{proof}
approximate the Binomial tail with (\ref{eq-recb}) and solve for the minimum.
\end{proof}
The Occam's razor bound is often nonvacuous for discrete learning algorithms
such as decision lists and decision trees. The next chapter will discuss a particular
motivated choice of the measure \( p(h) \) which can lead to much tighter bounds
in practice.
The application of the Occam's Razor bound is somewhat more complicated then
the application of the test set bound. Pictorially, the protocol for bound application
is given in figure \ref{fig-training-set}.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/training_set.eps}} \par}
\caption{\label{fig-training-set} In order to apply this bound it is necessary that
the choice of ``Prior'' be made before seeing any training examples. Then,
the bound is calculated based upon the chosen hypothesis. Note that it \emph{is}
``legal'' to chose the hypothesis based upon the prior \protect\( p(h)\protect \)
as well as the empirical error \protect\( \hat{e}_{S}(h)\protect \).}
\end{figure}
Examples of calculation of these bounds is detailed in appendix section \ref{sec-train-bound-calc}.
The ``Occam's Razor bound'' is strongly related to compression. In particular,
for any self-terminating description language, \( d(h) \), we can associate
a ``prior'' \( p(h)=2^{-d(h)} \) with the property that \( \sum _{h}p(h)\leq 1 \).
Consequently, short description length hypotheses will tend to have a tighter
convergence and the penalty term, \( \ln \frac{1}{p(h)} \) is the number of
``nats'' (bits base e).
\part{New Techniques}
\chapter{Microchoice Bounds (the algebra of choices)}
\label{sec-MC}
The work in this chapter is joint with Avrim Blum and was first presented at
ICML \cite{MC} and then in the Machine Learning Journal \cite{MC_journal}.
The presentation here generalizes, unifies, and improves the earlier work. Microchoice
bounds can be thought of as unifying ``Self-bounding Learning Algorithms''
\cite{SB} and the Occam's razor bound, \cite{BEHW}.
The bounds of the previous chapter do not use much structure on the hypothesis
space. Yet learning algorithms induce a natural structure. In particular, many
learning algorithms work by an iterative process in which they take a sequence
of steps each from a small set of choices (small in comparison to the overall
hypothesis set size). Local optimization algorithms such as hill-climbing or
simulated annealing, for example, work in this manner. Each step in a local
optimization algorithm can be viewed as making a choice from a small set of
possible steps to take. If we take into account the number of choices made and
the size (and other properties) of each choice set, can we produce a tighter
bound? This chapter introduce microchoice bounds which use this type of information
to construct high confidence bounds on the future error rate.
The microchoice bounds can be thought of in several ways. The simple microchoice
bound (given in section \ref{sec:mc}) can be thought of as a well motivated
application of the Occam's Razor bound (\ref{th-ORB}). The idea is to use the
learning algorithm itself to define a description language for hypotheses, so
that the description length of the hypothesis actually produced gives a bound
on the estimation error. The adaptive microchoice theorem (in section \ref{sec:query})
can be thought of as a computationally tractable adaptation of Self-bounding
Learning algorithms \cite{SB}. Microchoice bounds tie together and show the
relationship between these different sample complexity bounds.
Microchoice bounds also provide insight into the nature of choices. In general,
we know that choice is ``bad'' for the purposes of creating a uniform bound
on the true error rate. The microchoice bounds give a quantitative understanding
of how much choice is ``bad''. In particular, the log of the choice space
size is the natural measure of ``badness''. This is directly related to the
log of the hypothesis space size in the discrete hypothesis bound (\ref{th-DHSCP}).
There is also an indirect relationship with information theory where the log
of the alphabet size is an important parameter for specifying the number of
bits required to send a message.
Viewed as an interactive proof of learning, microchoice bounds can be described
pictorially as in figure \ref{fig-microchoice-protocol}.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/microchoice.eps}} \par}
\caption{\label{fig-microchoice-protocol} Microchoice bounds collapse the two-round
Occam's Razor style protocol into a one round protocol. This is (essentially)
done by providing a compiler for the verifier which takes a learning algorithm
as input, and extracts a choice of ``prior''. }
\end{figure}
Important early work developing approximately self-bounding learning algorithms
was also done by Domingos \cite{Domingos}.
\section{A Motivating Observation}
\label{sec:motivating}
Imagine, for the moment, that we know the (unknown) problem distribution, \( D \).
For a given learning algorithm \( A \), a distribution \( D \) on labeled
examples induces a distribution \( q(h) \) over the possible hypotheses \( h\in H \)
produced by algorithm \( A \) after \( m \) examples\footnote{
\( q(h) \) is the probability over draws of examples and any internal randomization
of the algorithm.
}. A natural choice for the Occam's Razor bound (\ref{th-ORB}) is the measure
\( p(h)=q(h) \). Is this choice optimal? The answer is ``yes'', given the
right notion of optimal. In particular, if we start with the relative entropy
Occam's razor bound (\ref{th-reorb}), we can show that \( p(h)=q(h) \) minimizes
the expected value of the bound on the Kullback-Leibler divergence between the
empirical error and true error.
\begin{thm}
(KL divergence minimization) \( p(h)=q(h) \) minimizes the expected value of
the KL divergence in the relative entropy Occam's Razor bound (\ref{th-reorb}).
\end{thm}
\begin{proof}
We need to show that
\[
q(h)=\textrm{argmin}_{p(h)}\sum _{h}q(h)\frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\]
removing terms which the minimum does not depend on, we get:
\[
\textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{1}{p(h)}\]
adding a constant, we get:
\[
\textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{q(h)}{p(h)}\]
This is equivalent to minimizing the Kullback-Leibler divergence between the
distribution of \( q(h) \) and \( p(h) \) which is minimized for \( q(h)=p(h) \).
\end{proof}
Using the KL divergence as our notion of loss is somewhat non-intuitive. However,
it is mathematically simple and not irrational. For all true errors, the KL
divergence will be upper and lower bounded between an \( l_{1} \) and an \( l_{2} \)
metric. Since these are two of the most common metrics, the choice of KL divergence
based metric should behave similarly well.
The point of these observations is to notice that if the structure of the learning
algorithm produces a choice \( p(h) \) that approximates \( q(h) \), the result
should be better estimation bounds.
\section{The Simple Microchoice Bound}
\label{sec:mc}
The simple microchoice bound is essentially a compelling and easy way to select
a measure \( p(h) \) for learning algorithms that operate by making a series
of small choices. In particular, consider a learning algorithm that works by
making a sequence of choices, \( c_{1},...,c_{d} \), from a sequence of choice
sets, \( C_{1},...,C_{d} \), finally producing a hypothesis, \( h\in H \).
Specifically, the algorithm first looks at the choice set \( C_{1} \) and the
data \( z^{N} \) to produce choice \( c_{1}\in C_{1} \). The choice \( c_{1} \)
then determines the next choice set \( C_{2} \) (different initial choices
produce different choice sets for the second level). The algorithm again looks
at the data to make some choice \( c_{2}\in C_{2} \). This choice then determines
the next choice set \( C_{3} \), and so on. These choice sets can be thought
of as nodes in a \emph{choice tree}, where each node in the tree corresponds
to some internal state of the learning algorithm, and a node containing some
choice set \( C \) has branching factor \( |C| \). Pictorially, we can draw
the tree as follows:
\resizebox*{0.91\columnwidth}{!}{\includegraphics{thesis-presentation/microchoice_tree.eps}}
Depending on the learning algorithm, sub-trees of the overall tree may be identical.
We address optimization of the bound for this case later. Eventually there is
a final choice leading to a leaf, and a single hypothesis is output.
For example, the decision list algorithm of Rivest \cite{Rivest}, applied to
a set of \( n \) features, uses the data to choose one of \( 4n+2 \) rules
(e.g., ``if \( \bar{x}_{3} \) then \( - \)'') to put at the top. Based on
the choice made, it moves to a choice set of \( 4n-2 \) possible rules to put
at the next level, then a choice set of size \( 4n-6 \), and so on, until eventually
it chooses a rule such as ``else \( + \)'' leading to a leaf.
The microchoice bound calculation program is as follows:
\begin{algorithm}
Calculate\_Microchoice
\end{algorithm}
\begin{enumerate}
\item set \( p\leftarrow 1 \)
\item while learning algorithm has not halted.
\begin{enumerate}
\item \( |C|\leftarrow \) number of possible data-dependent choices
\item \( p\leftarrow \frac{p}{|C|} \)
\end{enumerate}
\item return \( p \)
\end{enumerate}
Pictorially, this algorithm can be thought of as taking a ``supply'' of probability
at the root of the choice tree.
\vspace{0.3cm}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/microchoice_alg.eps}} \par}
\vspace{0.3cm}
The root takes its supply and splits it equally among all its children. Recursively,
each child then does the same: it takes the supply it is given and splits it
evenly among its children, until all of the supplied probability is allocated
among the leaves. If we examine some leaf containing a hypothesis \( h \),
we see that this method gives at least probability \( p(h)=\prod _{i=1}^{d(h)}\frac{1}{|C_{i}(h)|} \)
to each \( h \) for any path of depth \( d(h) \) reaching the hypothesis \( h \).
Note it is possible that several leaves will contain the same hypothesis \( h \),
and in that case one should really add the allocated measures together. However,
the microchoice bound neglects this issue, implying that it will be unnecessarily
loose for learning algorithms which can arrive at the same hypothesis in multiple
ways. The reason for neglecting this is that now, \( p(h) \) is something the
learning algorithm itself can calculate by simply keeping track of the sizes
of the choice sets it has encountered so far. It is important to notice that
this construction is defined before observing any data. Consequently, every
hypothesis has some bound associated with it before the data is used to pick
a particular hypothesis and its corresponding bound.
Another way to view this process is that we cannot know in advance which choice
sequence the algorithm will make. However, a distribution \( D \) on labeled
examples induces a probability distribution over choice sequences, inducing
a probability distribution \( q(h) \) over hypotheses. Ideally we would like
to use \( p(h)=q(h) \) in our bounds as noted above. However, we cannot calculate
\( q(h) \) (since the distribution \( D \) is unknown), so instead, our choice
of \( p(h) \) will be just an estimate. We hope that the algorithm designer
has chosen a ``good'' leaning algorithm which induces a distribution \( p(h) \)
over the final hypotheses which is near to \( q(h) \). Our estimate \( p(h) \)
is the probability distribution resulting from picking each choice uniformly
at random from the current choice set at each level (note: this is different
from picking a final hypothesis uniformly at random). I.e., it can be viewed
as the measure associated with the assumption that at each step, all choices
are equally likely.
We immediately find the following theorem:
\begin{thm}
\label{th-smb}(Microchoice Bound) For all hypothesis spaces, \( H \), for
all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),\frac{\delta }{\prod _{i=1}^{d(h)}|C_{i}(h)|})\right) \leq \delta \]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \)
\end{thm}
\noindent \textbf{Proof.} Specialization of the Occam's Razor bound (\ref{th-ORB}).
Once again, it will be worthwhile to slightly loosen this bound with the following
corollary:
\begin{cor}
\label{th-remb} (Relative Entropy Microchoice Bound) For all hypothesis spaces,
\( H \), for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |C_{i}(h)|+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end{cor}
The point of the microchoice bound is that the quantity \( \bar{e}(...) \)
is something the algorithm can calculate as it goes along, based on the sizes
of the choice sets encountered. To see this, note that the hypothesis dependent
term is \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \). The quantity \( d(h) \) can
be calculated by just noting the number of choices made before the learning
algorithm terminates. The choice sets, \( C_{i}(h) \), can often be easily
deduced by reasoning about the possible microchoices the algorithm could have
made given different datasets.
In many natural cases, a ``fortuitous distribution and target concept'' corresponds
to a shallow leaf or a part of the tree with low branching, resulting in a better
bound. For instance, in the decision list case, \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \)
is roughly \( d\ln n \) where \( d \) is the length of the list produced and
\( n \) is the number of features. Notice that \( d\ln n \) is also the description
length of the final hypothesis produced in the natural encoding, thus in this
case these theorems yield similar bounds to a simple application of Occam's
razor or SRM.
More generally, the microchoice bound is similar to Occam's razor or SRM bounds
when each \( k \)-ary choice in the tree corresponds to \( \log k \) bits
in the natural encoding of the final hypothesis \( h \). However, sometimes
this may not be the case. Consider, for instance, a local optimization algorithm
in which there are \( n \) parameters and each step adds or subtracts 1 from
one of the parameters. Suppose in addition the algorithm knows certain constraints
that these parameters must satisfy (perhaps a set of linear inequalities) and
the algorithm restricts itself to choices in the legal region. In this case,
the branching factor, at most \( 2n \), might become much smaller if we are
``lucky'' and head toward a highly constrained portion of the solution space.
One could always reverse-engineer an encoding of hypotheses based on the choice
tree, but the microchoice approach is much more natural.
There is also an opportunity to use \emph{a~priori} knowledge in the choice
of \( p(h) \). In particular, instead of splitting our confidence equally at
each node of the tree, we could split it unevenly, according to some heuristic
function \( g \). If \( g \) is ``good'' it may produce error bounds similar
to the bounds when \( p(h)=q(h) \). In fact, the method of section (\ref{sec:query})
where we combine these results with Freund's query-tree \cite{SB} approach
can be thought of as an attempt to do exactly this.
\subsection{Examples}
It is difficult to create a bound which is universally better than previous
bounds. The microchoice bound can be much better than the discrete hypothesis
bound (\ref{th-DHSCP}) and can be slightly worse. To develop some understanding
of how they compare we consider several cases.
\subsubsection{Greedy Set Cover}
Consider a greedy set cover algorithm for learning an OR function over \( F \)
Boolean features. The algorithm begins with a choice space of size \( F+1 \)
(one per feature or halt) and chooses the feature that covers the most positive
examples while covering no negative ones. It then moves to a choice space of
size \( F \) (one per feature remaining or halt) and chooses the best remaining
feature and so on until it halts. If the number of features chosen is \( k \)
then the microchoice bound is:
\[
\epsilon (h)=\frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{i=1}^{k}\ln (F-i+2)\right) \leq \frac{1}{m}\left( \ln \frac{1}{\delta }+k\ln (F+1)\right) \]
The bound of (\ref{th-DHSCP}) is:
\[
\epsilon =\frac{1}{m}\left( \ln \frac{1}{\delta }+F\ln 2\right) .\]
If \( k \) is small, then the microchoice bound is a lot better, but if \( k=O(F) \)
then the microchoice bound is slightly worse than the discrete hypothesis bound.
Notice that in this case the microchoice bound is essentially the same as the
standard Occam's razor analysis when one uses \( O(\ln F) \) bits per feature
to describe the hypothesis.
\subsubsection{Decision Trees}
Decision trees over discrete sets (say, \( \{0,1\}^{F} \)) are another natural
setting for application of the microchoice bound.
A decision tree differs from a decision list in that the size of the available
choice set is larger due to the fact that there are multiple nodes where a new
test may be applied. In particular, for a decision tree with \( K \) leaves
at an average depth of \( d \), the choice set size is \( K(F-d) \), giving
a bound noticeably worse than the bound for the decision list. This motivates
a slightly different decision algorithm which considers only one leaf node at
a time. The algorithm adds a new test or decides to never add a new test at
this node. In this case, there are \( (F-d(v)+1) \) choices for a node \( v \)
at depth \( d(v) \), implying the bound:
\begin{equation}
\label{eqn:dectreemc}
\textrm{KL}(\hat{e}(h)||e(h))\leq \frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{v}\ln (F-d(v)+1)\right)
\end{equation}
where \( v \) ranges over the nodes of the decision tree. Once again, this
is very similar to what might be produced by an Occam's Razor Bound with an
appropriate choice of prior. This result is again sometimes much better than
the Discrete Hypothesis bound and sometimes slightly worse.
\subsection{Pruning }
Decision tree algorithms for real-world learning problems often have some form
of ``pruning'' as in \cite{Quinlan} and \cite{Mingers}. The tree is first
grown to full size producing a hypothesis with minimum empirical error. Then
the tree is ``pruned'' starting at the leaves and progressing up through the
tree toward the root node using some test for the significance of an internal
node. An internal node is not significant if the reduction in total error is
small in comparison to the complexity of its children. Insignificant internal
nodes are replaced with a leaf resulting in a smaller tree.
Microchoice bounds have the property that they incidentally prove a bound for
every decision tree which can be found by pruning internal nodes. In particular,
one of the choices available when constructing a node is to make the node a
leaf. Therefore, if we begin with the tree \( T \) and then prune to the smaller
tree \( T' \), we can apply the bound (\ref{eqn:dectreemc}) to \( T' \) \emph{as
if} the algorithm had constructed \( T' \) directly rather than having gone
first through the tree \( T \). This suggests another possible pruning criterion:
prune a node if the pruning would result in an improved microchoice bound. That
is, prune if the increase in empirical error is less than the decrease in \( \epsilon (h) \).
This pruning criteria is a ``pessimistic criteria'' \cite{Yishay}.
The similarities to SRM are discussed next.
\subsection{Microchoice and Structural Risk Minimization}
The microchoice bound is essentially a compelling application of the Disjoint
SRM bound \ref{th-SRM} where the description language for a hypothesis is the
sequence of data-dependent choices which the algorithm makes in the process
of deciding upon the hypothesis. The hypothesis set \( H_{i} \) is all hypotheses
with the same description length in this language.
As an example, consider a binary decision tree with \( F \) Boolean features
and a Boolean label. The first hypothesis set, \( H_{1} \) will consist of
\( 2 \) hypotheses; always false and always true. In general, we will have
one hypothesis set for every legal configuration of internal nodes. The size
of a hypothesis set where every tree contains \( k \) internal nodes will be
\( 2^{k+1} \) because there are \( k+1 \) leaves each of which can take \( 2 \)
values. The weighting \( p(i) \) across the different hypothesis sets is defined
by the microchoice allocation of confidence.
The principle disadvantage of the microchoice bound is that the sequence of
data-dependent choices may contain redundancy. A different SRM bound with a
different set of disjoint hypothesis sets might be able to better avoid redundancy.
As an example, assume that we are working with a decision tree on \( F \) binary
features. There are \( F+2 \) choices (any of \( F \) features or \( 2 \)
labels) at the top node. At the next node down there will be \( F+1 \) choices
in both the left and right children. Repeat until a maximal decision tree is
constructed. There will be \( \prod _{i=0}^{F}(F-i+2)^{2^{i}} \) possible trees.
This number is somewhat larger than the number of Boolean functions on \( F \)
features: \( 2^{2^{F}} \).
\section{Combining Microchoice with Freund's Query Tree approach}
\label{sec:query}
The next section is devoted to an improvement of the microchoice bound called
adaptive microchoice, which arises from synthesizing Freund's query trees \cite{SB}
with the microchoice bound. This improvement is not easily expressed as a simplification
of Structural Risk Minimization. In essence, the adaptive microchoice bound
can gain from dependence on the learning problem distribution \( D \) and can
take advantage of an ``easy'' distribution.
First we require some background material in order to state and understand Freund's
bound.
\subsection{Preliminaries and Definitions}
The statistical query framework introduced by Kearns \cite{SQ} restricts learning
algorithm to only access the data using statistical queries. A statistical query
takes as input a binary predicate, \( \chi \), mapping examples to a binary
output: \( (X,Y)\rightarrow \{0,1\} \). The output of the statistical query
is the average of \( \chi \) over the examples seen. Let \( z_{i} \) be the
\( i \)th labeled example, then:
\[
\hat{D}_{\chi }=\frac{1}{m}\sum _{i=1}^{m}\chi (z_{i})\]
The output is an empirical estimate of the true value \( D_{\chi }={\textbf {E}}_{D}[\chi (z)]=\Pr _{z\sim D}(\chi (z)=1) \)
of the query under the distribution \( D \)\footnote{
In the real SQ model there is no set of examples. The algorithm asks a query
\( \chi \) and is given a response \( \hat{D}_{\chi } \) that is guaranteed
to be near to the true value \( D_{\chi } \). That is, the true SQ model is
an abstraction of the scenario described here where \( \hat{D}_{\chi } \) is
computed from an observed sample.
} . One simple example of a predicate \( \chi \) is ``the first bit is \( 1 \)''.
A more complicated predicate might be ``the third bit xor the 4th bit is \( 0 \)''.
Naturally, the distribution of \( \hat{D}_{\chi } \) will be the familiar Binomial
distribution.
It is convenient to define
\[
\bar{I}_{D_{\chi }}(\delta )=\frac{1}{m}\max _{k}\left\{ k:\, \, 1-\textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \]
and
\[
\underline{I}_{D_{\chi }}(\delta )=\frac{1}{m}\min _{k}\left\{ k:\, \, \textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \]
and let
\[
I_{\chi }(\delta )=\left[ \underline{I}_{D_{\chi }}(\delta ),\bar{I}_{D_{\chi }}(\delta )\right] \]
Intuitively, \( I_{\chi }(\delta ) \) is a (fixed) interval in which the random
variable \( \hat{D}_{\chi } \) will fall with high probability. In other words,
we know that:
\[
\Pr \left[ \hat{D}_{\chi }\not \in I_{\chi }(\delta )\right] \leq \delta \]
Now, we want to construct a confidence interval based upon the high confidence
interval \( I_{\chi }(\delta ) \). We can do this using the inversion lemma
(\ref{lem-inversion}) to get:
\[
\bar{D}_{\chi }(\delta )=\max _{p}\left\{ p:\underline{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \]
and
\[
\underline{D}_{\chi }(\delta )=\min _{p}\left\{ p:\bar{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \]
The random interval defined here contains the ``real'' answer \( D_{\chi } \)
with high probability. In other words, we have:
\[
\Pr \left[ D_{\chi }\not \in [\underline{D}_{\chi }(\delta ),\bar{D}_{\chi }(\delta )]\right] \leq \delta \]
\subsection{Background and Summary}
Freund \cite{SB} considers choice algorithms that at each step perform a Statistical
Query on the sample, using the result to determine which choice to take. For
an algorithm \( A \), tolerance \( \alpha \) (defined next), and distribution
\( D \), Freund defines the query tree \( T_{A}(D,\alpha ) \) as the choice
tree created by considering only those choices resulting from answers \( \hat{D}_{\chi } \)
to queries \( \chi \) such that \( \left| \hat{D}_{\chi }-D_{\chi }\right| \leq \alpha \).
The idea is that if a particular predicate, \( \chi \), is true with probability
\( .9 \) (for example) on a random sample it is very unlikely that the empirical
result of the query will be \( .1 \). More generally, the chance the answer
to a given query is off by more than \( \alpha \) is at most \( 2e^{-2m\alpha ^{2}} \)
by Hoeffding's inequality. So, if the entire tree contains a total of \( |Q(T_{A}(D,\alpha ))| \)
queries in it, the probability \emph{any} of these queries is off by more than
\( \alpha \) is at most \( 2\cdot |Q(T_{A}(D,\alpha ))|\cdot e^{-2m\alpha ^{2}} \).
In other words, this is an upper bound on the probability the algorithm ever
``falls off the tree'' and makes a low probability choice. The point of this
is that we can allocate half (say) of the confidence parameter \( \delta \)
to the event that the algorithm ever falls off the tree, and then spread the
remaining half evenly on the hypotheses in the tree (which hopefully is a much
smaller set than the entire hypothesis set).
Unfortunately, the query tree suffers from the same problem as the \( q(h) \)
distribution considered in section (\ref{sec:motivating}), namely that to compute
it, one needs to know \( D \). So, Freund proposes an algorithmic method to
find a super-set approximation of the tree. The idea is that by analyzing the
results of queries, it is possible to determine which outcomes were unlikely
given that the query is close to the desired outcome. In particular, each time
a query \( \chi \) is asked and a response \( \hat{D}_{\chi } \) is received,
if it is true that \( |\hat{D}_{\chi }-D_{\chi }|\leq \alpha \), then the
range \( \left[ \hat{D}_{\chi }-2\alpha ,\hat{D}_{\chi }+2\alpha \right] \)
contains the range \( \left[ D_{\chi }-\alpha ,D_{\chi }+\alpha \right] \).
Thus, under the assumption that no query in the \emph{correct} tree is answered
badly, a super-set of the correct tree can be produced by exploring all choices
resulting from responses within \( 2\alpha \) of the response actually received.
By applying this method to every node in the query tree we can generate an empirically
observable super-set of the query tree: that is, the original query tree is
a pruning of the empirically constructed tree.
A drawback of this method is that it can easily take exponential time to produce
the approximate tree, because even the smaller correct tree can have a size
exponential in the running time of the learning algorithm. Instead, we would
much rather simply keep track of the choices actually made and the sizes of
the nodes actually followed, which is what the microchoice approach allows us
to do. As a secondary point, given \( \delta \), computing a good value of
\( \alpha \) for Freund's approach is not trivial, see \cite{SB}; we will
be able to finesse that issue and use the tighter bound \( \hat{D}_{\chi }\in I_{\chi }(\delta ) \).
In order to apply the microchoice approach, we modify Freund's query tree so
that different nodes in the tree receive different confidence, \( \delta \),
much in the same way that different hypotheses \( h \) in our choice tree receive
different values of \( \delta (h) \).
\subsection{Microchoice Bounds for Query Trees}
The manipulations of the choice tree are now reasonably straightforward. We
begin by describing the \emph{true} microchoice query tree and then give the
algorithmic approximation. As with the choice tree in section (\ref{sec:mc}),
one should think of each node in the tree as representing the current internal
state of the algorithm.
We incorporate Freund's approach into the choice tree construction by having
each internal node allocate a portion, \( p' \) of its ``supply'' of failure
probability to the event that \( \hat{D}_{\chi }\not \in I_{\chi }(\delta *p') \).
The node then splits the remainder of its supply evenly among the children corresponding
to choices that result from answers \( \hat{D}_{\chi } \) with \( \hat{D}_{\chi }\in I_{\chi }(\delta *p') \).
Choices that would result from ``bad'' answers to the query are \emph{pruned
away} from the tree and get nothing. This continues down the tree to the leaves.
Pictorially, this looks like:
\vspace{0.3cm}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/amicro_tree.eps}} \par}
\vspace{0.3cm}
How should \( p' \) be chosen? Smaller values of \( p' \) result in larger
intervals \( I_{\chi }(\delta *p') \) leading to more children in the pruned
tree and less confidence given to each. Larger values of \( p' \) result in
less left over to divide among the children. Unfortunately, our algorithmic
approximation (which only sees the empirical answers \( \hat{D}_{\chi } \)
and needs to be efficient) will not be able to make this optimization. Therefore,
we define \( p' \) in the \emph{true} microchoice query tree to be \( \frac{p}{d+1} \)
where \( d \) is the depth of the current node. This choice will imply that
the adaptive microchoice bound is never much worse than the Microchoice bound,
and sometimes much better.
Since a particular query value \( \hat{D}_{\chi } \) implies a particular choice
\( c \), we can think of the interval \( I_{\chi }(\delta ) \) as containing
\emph{choices} rather than query results. After all, we only care about the
choices the algorithm makes. We can calculate the probability assigned to a
hypothesis in the \emph{true} adaptive microchoice query tree according to the
following algorithm:
\vspace{0.3cm}
\begin{algorithm}
True\_Adaptive\_Microchoice(\( \delta \))
\end{algorithm}
\begin{enumerate}
\item set \( p\leftarrow 1 \)
\item set \( d=1 \)
\item while learning algorithm has not halted.
\begin{enumerate}
\item \( d\leftarrow d+1 \)
\item \( p'\leftarrow \frac{p}{d} \)
\item Let \( C \) = the current set of possible data-dependent choices.
\item \( |\hat{C}|\leftarrow |\{c\in C|c\in I_{\chi }(\delta *p')\}| \)
\item \( p\leftarrow \frac{p-p'}{|\hat{C}|} \)
\end{enumerate}
\item return \( p \)
\end{enumerate}
There are two important things to note about this algorithm. First of all, we
could plug the value \( p(h) \) it returns into the Occam's Razor Bound \ref{th-ORB}
and receive a bound on the true error rate of our chosen classifier.
Second, this algorithm can not be executed. The essential problem is determining
whether or not \( c\in I_{\chi }(\delta *p') \) which cannot be done without
knowledge of the underlying distribution \( D \). However, we \emph{can} calculate
an approximate version of this algorithm which, with high probability, returns
a value \( p \) which is smaller. Since a smaller value is pessimistic, we
can use it in our bounds.
The algorithmic approximation uses the idea in \cite{SB} of including all choices
within the \emph{double} confidence interval of the observed value \( \hat{D}_{\chi } \).
Unlike \cite{SB}, however, we do not actually create the tree; instead we just
follow the path taken by the learning algorithm, and argue that the ``supply''
probability remaining at the leaf is no greater than the amount that would have
been there in the original tree. Finally, the algorithm outputs a bound calculated
with \( p(h) \).
Specifically, the algorithm is as follows. Suppose we are at a node of the tree
containing statistical query \( \chi \) at depth \( d(\chi ) \) and we have
a \( p \) supply of parameter. (If the current node is the root, then \( p=1 \)
and \( d(\chi )=1 \)). We choose \( p'=p/(d+1) \), ask the query \( \chi \),
and receive \( \hat{D}_{\chi } \). Let
\[
\bar{\bar{D}}_{\chi }(\delta )=\min _{k}\left\{ \frac{k}{m}:\, \, 1-\textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \]
and
\[
\underline{\underline{D}}_{\chi }(\delta )=\max _{k}\left\{ \frac{k}{m}:\, \, \textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \]
with
\[
\hat{I}_{\chi }(\delta )=\left[ \underline{\underline{D}}_{\chi }(\delta ),\bar{\bar{D}}_{\chi }(\delta )\right] \]
We now let \( k \) be the number of children of our node corresponding to answers
in the range \( \hat{I}_{\chi }(p') \). We then go to the child corresponding
to the answer \( \hat{D}_{\chi } \) that we received, giving this child a confidence
parameter supply of \( (p-p')/k \). This is the same as we would have given
it had we allocated \( p-p' \) to the children equally. We then continue from
that child. Finally, when we reach a leaf, we output the probability left for
the hypothesis. Pictorially, this looks like:
\vspace{0.3cm}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/amcicro_alg.eps}} \par}
\vspace{0.3cm}
Notice that the second choice set is larger than in the true adaptive microchoice
set tree. This can easily happen and it makes our results somewhat more pessimistic.
The approximate adaptive microchoice algorithm is specified as follows:
\begin{algorithm}
Approximate\_Adaptive\_Microchoice(\( \delta \))
\end{algorithm}
\begin{enumerate}
\item set \( p\leftarrow 1 \)
\item set \( d=1 \)
\item while learning algorithm has not halted.
\begin{enumerate}
\item \( d\leftarrow d+1 \)
\item \( p'\leftarrow \frac{p}{d} \)
\item Let \( C \) = the current set of possible data-dependent choices.
\item \( |\hat{C}|\leftarrow |\{c\in C|c\in \hat{I}_{\chi }(\delta *p')\}| \)
\item \( p\leftarrow \frac{p-p'}{|\hat{C}|} \)
\end{enumerate}
\item return \( p \)
\end{enumerate}
Let \( d(h) \) be the depth of some hypothesis \( h \) in the empirical path
and \( \hat{C}_{1}(h) \), \( \hat{C}_{2}(h) \), \( \ldots \), \( \hat{C}_{d}(h) \)
be the sequence of choice sets resulting in \( h \) in the algorithmic construction;
i.e., \( \hat{C}_{i}(h) \) is the number of unpruned children of the \( i \)-th
node. Then, the confidence placed on \( h \) will be:
\begin{equation}
\label{eqn:adaptiveMC}
p(h)=\prod _{i=1}^{d(h)}\left( \frac{i}{i+1}\frac{1}{|\hat{C}_{i}(h)|}\right) =\frac{1}{d(h)+1}\prod _{i=1}^{d(h)}\frac{1}{|\hat{C}_{i}(h)|}
\end{equation}
\begin{thm}
\label{th-amb}(Adaptive Microchoice Bound) For all hypothesis spaces, \( H \),
for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),p(h)*\delta )\right) \leq \delta \]
where \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \),
and \( p(h) \) is as defined in equation \ref{eqn:adaptiveMC}.
\end{thm}
\begin{proof}
\noindent By design, with probability \( 1-\delta \) all queries in the true
microchoice query tree receive good answers, \emph{and} all hypotheses in that
tree have their true errors within their estimates.
\noindent We will prove that in the high probability case, the output of the
Approximate\_Adaptive\_Microchoice algorithm is less than the output of the
True\_Adaptive\_Microchoice algorithm. Since a smaller \( p(h) \) makes the
bound more pessimistic, we will prove the bound.
\noindent Assume inductively that at the current node of our empirical path
the supply \( p_{\mathrm{emp}} \) is no greater than the supply \( p_{\mathrm{true}} \)
given to that node in the true tree. This is clearly satisfied in the base case
when \( p_{emp}=p_{true}=1 \).
\noindent Under the assumption that the response \( \hat{D}_{\chi } \) falls
in the interval \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \), it must be the case
that the interval \( \hat{I}_{\chi }(p_{\mathrm{emp}}/(d+1)) \) contains the
interval \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \). Therefore, the supply given
to any child in the empirical path is no greater than the supply given in the
true tree.
\end{proof}
The corresponding relative entropy corollary is:
\begin{cor}
\label{th-reamb}(Relative Entropy Microchoice Bound) For all hypothesis spaces,
\( H \), for all \( \delta \in (0,1] \),
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |\hat{C}_{i}(h)|+\ln (d(h)+1)+\ln \frac{1}{\delta }}{m}\right) \leq \delta \]
\end{cor}
The bound in theorem (\ref{th-amb}) is very similar to (\ref{th-smb}) except
that the choice complexity is slightly worsened with the \( \ln (d(h)+1) \)
term but improved by replacing \( C_{i}(h) \) with the smaller \( \hat{C}_{i}(h) \).
\subsection{Allowing batch queries}
Most natural Statistical Query algorithms make each choice based on responses
to a \emph{set} of queries, not just one. For instance, to decide what variable
to put at the top of a decision tree, we ask \( F \) queries, one for each
feature; we then choose the feature whose answer was most ``interesting''.
This suggests generalizing the query tree model to allow each tree node to contain
a set of queries, executed in batch. Requiring each node in the query tree to
contain just a single query as in the above construction would result in an
unfortunately high branching factor just for the purpose of ``remembering''
the answers received so far. \footnote{
Consider a decision tree algorithm attempting to find the right feature for
a node. If the first query returns a value of \( \hat{D}_{\chi } \) with a
confidence of \( \delta \) then the branching factor would be approximately
\( m\cdot \hat{I}_{\chi }(\delta ) \). This branching factor would be approximately
the same for further queries required by the algorithm to make a decision about
what feature to use. This results in a total multiplied choice space size of
approximately \( \left[ 4m\cdot \hat{I}_{\chi }(\delta )\right] ^{F} \) which
can be reduced to \( F \) or less using a batch query.
}
Extending the algorithmic construction to allow for batch queries is easily
done. If a node has \( q \) queries \( \chi _{1},\ldots ,\chi _{q} \), we
choose the query confidence \( \delta *p' \) as before, but we now split the
mass evenly among all \( q \) queries. We then let \( k \) be the number of
children corresponding to answers to the queries \( \chi _{1},\ldots ,\chi _{q} \)
in the ranges \( \hat{I}_{\chi _{1}}(\delta p'/q),\ldots ,\hat{I}_{\chi _{q}}(\delta p'/q) \)
respectively. We then go to the child corresponding to the answers we actually
received, and as before give the child a probability supply of \( (p-p')/k \).
Theorem (\ref{th-amb}) holds exactly as before; the only change is that \( |\hat{C}_{i}(h)| \)
means the size of the \( i \)-th choice set in the batch tree rather than the
size in the single-query-per-node tree.
\subsection{Example: Batch Queries for Decision trees}
When growing a decision tree, it is natural to make a batch of queries and then
make a decision about which feature to place in a node. The process is then
repeated to grow the full tree structure. As in the decision tree example described
in the simple microchoice section, if we have \( F \) features and are considering
adding a node at depth \( d(v) \), there are \( F-d(v)+1 \) possible features
that could be chosen for placement in a particular node. The decision of which
feature to use is made by comparing the results of \( F-d(v)+1 \) queries to
pick the best feature according to some criteria, such as information gain.
We can choose \( p'=p/(d+1) \), then further divide \( p' \) into confidences
of size \( p'/(F-d(v)+1) \), placing each divided confidence on one of the
\( F-d(v)+1 \) statistical queries. We now may be able to eliminate some of
the \( F-d(v)+1 \) choices from consideration, allowing the remaining confidence,
\( p-p' \) to be apportioned evenly amongst the remaining choices. Depending
on the underlying distribution this could substantially reduce the size of the
choice set. The best case occurs when one feature partitions all examples reaching
the node perfectly and all other features are independent of the target. In
this case the choice set will have size \( 1 \) if there are enough examples.
\subsection{Adaptive Microchoice vs.~ Basic Microchoice}
The adaptive microchoice bound is a significant improvement over the simple
microchoice bound when the distribution is such that each choice is clear. For
example, consider \( F \) Boolean features and \( N=O(F) \) examples. Suppose
that one feature is identical to the label and all the rest of the features
are determined with a coin flip independent of the label.
When we apply a decision tree to a data set generated with this distribution,
what will be the resulting bound? Given enough examples, with high probability
there will only be one significant choice for the first batch query: the feature
identical to the label. The second and third batch queries, corresponding to
the children of the root feature, will also have a choice space of size \( 1 \)
with very high probability. The ``right'' choice will be the label value.
Each choice set has size \( 1 \) resulting in a complexity of \( \ln 4 \)
due to allocation of confidence to the statistical queries necessary for learning
the decision tree. \( \ln 4 \) is considerably better than \( \ln (F+2)+2\ln (F+1) \)
which the simple version of the microchoice bound provides. Note that the complexity
reduction only occurs with a large enough number of examples \( m \) implying
that the value of \( \bar{e} \) calculated can improve faster than (inverse)
linearly in the number of examples.
The adaptive microchoice bound is never much looser than the simple microchoice
bound because under the assumption that choice sets are of size at least \( 2 \),
the penalty for using the adaptive version, \( \ln d \), is always small compared
to the complexity term for the simple microchoice bound, \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \).
\subsection{Other Adaptive Microchoice}
The adaptive microchoice bound provides a simple scheme for dividing confidence
between choices and queries. There are other choices which may be useful in
some settings. Any scheme which \emph{a priori} divides the confidence between
queries and choices at every node will generally work. Here are two schemes
which may be useful:
\begin{itemize}
\item Assign a constant proportion of confidence to the query. This scheme is more
aggressive than the one used in the adaptive microchoice bounds and may result
in a lower complexity when many choices are eliminatable. The drawback is we
no longer get the telescoping in equation (\ref{eqn:adaptiveMC}) and so the
term logarithmic in \( d(h) \) in theorem (\ref{th-reamb}) becomes linear
in \( d(h) \).
\item For a decision tree, assign a portion dependent on the depth of the node in
the decision tree that the choice set is over. It is unlikely that choices are
eliminatable from nodes not near the root because the number of examples available
at a node typically decays exponentially with the (decision tree) depth. A progressive
scheme which allocates less confidence to queries for deep nodes will probably
behave better in practice.
\end{itemize}
\subsection{Comparison with Freund's Self-Bounding algorithms}
Freund's approach for self-bounding learning algorithms can require exponentially
more computation then the microchoice approach. In its basic form, it requires
explicit construction of every path in the state space of the algorithm not
pruned in the tree. There exist some learning algorithms where this process
can be done implicitly making the computation feasible. However, in general
this does not appear to be possible.
The adaptive microchoice bound only requires explicit construction of the size
of each subset from which a choice is made. Because many common learning algorithms
work by a process of making choices from small subsets, this is often computationally
easy. The adaptive microchoice bound does poorly, however, when Freund's query
tree has a high degree of sharing; for example, when many nodes of the tree
correspond to the same query, or many leaves of the tree have the same final
hypothesis. Allowing batch queries alleviates the most egregious examples of
this. It is also possible to interpolate between the adaptive microchoice bound
and Freund's bound by a process of conglomerating the subsets of the microchoice
bound.
\subsection{Choice Set Conglomeration }
The mechanism of choice set conglomeration is a similar to the batch query technique.
It allows you to trade increased computation for a tighter bound. When starting
with the simple microchoice bound, this technique can smoothly interpolate with
the discrete hypothesis bound (\ref{th-DHSCP}). When starting with the adaptive
microchoice bound, we can interpolate with Freund's bound.
Consider a particular choice set, \( \hat{C}_{i} \), with elements \( c_{i} \).
Each \( c_{i} \) indexes another choice set, \( \hat{C}_{i+1}(c_{i}) \). If
the computational resources exist to calculate the union \( \hat{C}_{i,i+1}=\bigcup _{c_{i}\in \hat{C}_{i}}\hat{C}_{i+1}(c_{i}) \),
then \( |\hat{C}_{i,i+1}| \) can be used in place of \( |\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \)
in the adaptive microchoice bound. The conglomeration can be done repeatedly
to build large choice sets and also applies to the simple microchoice bound
(\ref{th-smb}). Conglomeration can be useful for tightening the bound when
there are multiple choice sequences leading to the same hypothesis. However,
choice set conglomeration is not always helpful because it trades away the fine
granularity of the microchoice bound. The extreme case where all choice sets
are conglomerated into one choice set and every hypothesis and query have the
same weight is equivalent to Freund's bound.
When the choices of the attached choice sets are all different, conglomeration
will have little use because the size of the union of the choice sets is the
sum of the sizes of each choice set \( |\hat{C}_{i,i+1}|=\sum _{c_{i}\in \hat{C}_{i}}|\hat{C}_{i+1}(c_{i})| \).
If the child sets each have the same size \( |\hat{C}_{i+1}| \) then this simplifies
to \( |\hat{C}_{i,i+1}|=|\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \) which results
in the same confidence applied to each choice whether conglomerating or not.
The best case for conglomeration is equivalent to the batch query case: every
sub-choice set contains the same elements. Then we have \( |\hat{C}_{i,i+1}|=|\hat{C}_{i+1}| \)
and can pay no cost for the choice set \( |\hat{C}_{i}| \).
\section{Microchoice discussion}
Microchoice bounds can be used in practice and do yield results comparable with
holdout set based techniques (see figure \@.\ref{fig-microchoice-holdout}
and the surrounding section for details). There are two significant insights
which the microchoice bounds provide.
\begin{enumerate}
\item The ``cost'' of choices is made very explicit. The cost of a choice (in terms
of sample complexity) is the log of the number of choices.
\item It is possible to improve upon the Occam's Razor Bound (theorem \ref{th-ORB})
by using information from the sample set to infer properties (such as the choice
tree) of the distribution \( D \). This can be done \emph{without} any explicit
knowledge of the distribution \( D \).
\end{enumerate}
\begin{problem}
(Open) Is there a satisfying, natural bound for the continuous case? Preliminary
work and thoughts by several people has occurred, but nothing has yet come of
it.
\end{problem}
\chapter{PAC-Bayes bounds}
\label{sec-PB}
The work presented here is also published in \cite{averaging_tech}.
PAC-Bayes bounds are a generalization of the Occam's razor bound for algorithms
which output a \emph{distribution} over classifiers rather than just a single
classifier. This includes the possibility of a distribution over a single classifier,
so it is a generalization. Most classifiers do not output a distribution over
base classifiers. Instead, they output either a classifier, or an average over
base classifiers. Nonetheless, PAC-Bayes bounds are interesting for several
reasons:
\begin{enumerate}
\item PAC-Bayes bounds are much tighter (in practice) than most common VC-related
\cite{Vapnik} approaches on continuous classifier spaces. This can be shown
by application to stochastic neural networks (see section \ref{SNN}) as well
as other classifiers. It also can be seen by observation: when specializing
the PAC-Bayes bounds on discrete hypothesis spaces, only \( O(\ln m) \) sample
complexity is lost.
\item Due to the achievable tightness, the result motivates new learning algorithms
which strongly limit the amount of overfitting that a learning algorithm will
incur.
\item The result found here will turn out to be useful for averaging hypotheses.
\end{enumerate}
PAC-Bayes bounds were first introduced by McAllester \cite{PB}.
There are three relatively independent observations in this chapter:
\begin{enumerate}
\item A quantitative improvement of the PAC-Bayes by retrofit with relative entropy
Chernoff bound \ref{eq-recb}. This retrofit is not as trivial as might be expected,
but it can be done. The result is the tightest known PAC-Bayes bound. In addition
to the quantitative improvements, this tightening simplifies the proof and adds
to our qualitative understanding of the bound.
\item A method for (partially) derandomizing the PAC-Bayes stochastic hypothesis
\item A method for stochastic evaluation of the empirical error.
\end{enumerate}
The first observation is the most important. Observation (3) is important for
many practical applications because it is safely avoids a (sometimes) very complicated
evaluation problem. Observation (2) is of little theoretical interest, but it
might interest some people who feel reassured when every classifier randomized
over has a low empirical error rate.
Figure \ref{fig-pb-protocol} shows what the PAC-Bayes bound looks like as an
interactive proof of learning.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/pac-bayes.eps}} \par}
\caption{\label{fig-pb-protocol} The PAC-Bayes bound can be viewed as a new style for
a proof of learning. The learner must commit to a ``Prior'' as in the Occam's
Razor Bound \ref{th-ORB} before seeing examples, but it does not commit to
a single hypothesis. Instead, it commits to a distribution over hypotheses,
\protect\( q(h)\protect \) and the bound applies to a randomization with respect
to the distribution \protect\( q(h)\protect \).}
\end{figure}
\section{PAC-Bayes Basics}
In the PAC-Bayes setting, a classifier is defined by a distribution \( q(h) \)
over the hypothesis space. Each classification is carried out according to a
hypothesis \emph{sampled} from \( q(h) \). We are interested in the gap between
the \emph{expected} generalization error \( e_{q}(h)\equiv E_{q}\left[ e(h)\right] \)
and the \emph{expected} empirical error \( \hat{e}_{q}(h)=E_{q}\left[ \hat{e}(h)\right] \),
where both expectations are taken with respect to \( q(h) \). The gap will
be parameterized by the Kullback-Leibler divergence (see \cite{Cover}). Recall
that:
\begin{equation}
\label{eq-relent}
\textrm{KL}(q||p)=E_{h\sim q(h)}\ln \frac{q(h)}{p(h)}
\end{equation}
If the support is finite, we have
\begin{equation}
\label{eq-frelent}
\textrm{KL}(q||p)=\sum _{h}q(h)\ln \frac{q(h)}{p(h)}
\end{equation}
The relative entropy is an asymmetric distance measure between probability
distributions, with \( \textrm{KL}(q||p)=0\Leftrightarrow q=p \) almost everywhere.
\begin{thm}
\label{th-pbb} (PAC-Bayes \cite{PB}) For all ``priors'' \( p(h) \) and
for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists q(h):\, \, e_{q}(h)\geq \hat{e}_{q}(h)+\sqrt{\frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }+2}{2m-1}}\right) \leq \delta \]
\end{thm}
\begin{proof}
Given in \cite{PB}.
\end{proof}
This PAC-Bayes bound is almost the same as the Occam's Razor bound (theorem
\ref{th-ORB}) when the distribution is peaked on a single hypothesis and the
Occam's razor bound is proved using the looser Hoeffding inequality. This can
be seen by noting that the KL-divergence when \( q \) is all on one hypothesis,
\( h \) satisfies \( \textrm{KL}(q||p)=\log \frac{1}{p(h)} \). Comparing with
the Occam's Razor bound, we see that a (small) extra term of size \( \frac{\ln m}{m} \)
has been introduced in return for the capability to average with respect to
\emph{any} posterior \( q(h) \). It is unclear yet that this \( \frac{\ln m}{m} \)
term needs to be there.
\begin{problem}
(Open) Remove the \( \frac{\ln m}{m} \) term from the sample complexity.
\end{problem}
The real power of the PAC-Bayes bound occurs when the average is over many hypotheses.
This might occur if the distribution \( q(h) \) is picked using Bayes law or
a Gibbs distribution. One of the most interesting aspects of the PAC-Bayes bound
is that it holds for finite \emph{and} infinite hypothesis spaces.
\section{A Tighter PAC-Bayes Bound}
We can tighten this bound by employing a more accurate tail bound on the Binomial
distribution. The proof of this improved lemma is not as straightforward as
a simple substitution of the Hoeffding bound with the relative entropy Chernoff
bound \ref{eq-recb} but it can be worked out nonetheless.
\begin{thm}
\label{th-repbb} (Relative Entropy PAC-Bayes bound) For all binary loss functions,
\( l(h,(x,y)) \), for all ``priors'' \( p(h) \) and for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists q(h):\, \, \textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\geq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\right) \leq \delta \]
\end{thm}
This bound is always at least as tight as the original PAC-Bayes bound \cite{PB}
and sometimes much tighter, such as when the average empirical error is near
\( 0 \). In particular, when the average empirical error is zero (\( \hat{e}_{q}(h)=0 \))
the bound can be significantly tighter as shown in figure \ref{fig-bounds}
on page \pageref{fig-bounds}.
One interesting new feature of this PAC-Bayes bound is ``dimensionally consistency''\footnote{
My thanks to Patrick Haffner for pointing this out.
}. In particular, each side of the equation is an expectation of log probabilities---``nats''.
Rewriting, we get that with high probability, approximately the following holds:
\[
m\textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq \textrm{KL}(q||p)\]
There is a coding theory interpretation of KL divergence: \( \textrm{KL}(q||p)= \)
the expected number of \emph{extra} bits required to encode symbols drawn from
\( q \) given a code designed for symbols drawn from \( p \) rather than from
\( q \).
Using the coding theory interpretation of KL divergence, this says approximately:
``With high probability the number of \emph{extra} bits required to encode
the empirical errors is less than the number of \emph{extra} bits required to
encode hypotheses drawn from the posterior.''
The retrofit of the PAC-Bayes bound is accomplished by reproving a technical
lemma about distributions. The proof relies upon two lemmas. The first is Lemma
22 from \cite{PB} which is given by:
\begin{lem}
\label{lem-opt}For all \( \beta >0,K>0 \) and \( Q,P,y\in R^{n} \) satisfying
\( P_{i}>0,Q_{i}>0 \) and \( \sum _{i}Q_{i}=1 \), if
\[
\sum _{i=1}^{n}P_{i}e^{\beta y_{i}}\leq K\]
then
\[
\sum _{i=1}^{n}Q_{i}y_{i}\leq \frac{\textrm{KL}(Q||P)+\ln K}{\beta }\]
\end{lem}
\begin{proof}
given in \cite{PB}.
\end{proof}
We will need to prove the following lemma in order to tighten the PAC-Bayes
bound. It is analogous to Lemma 17 from \cite{PB}.
\begin{lem}
\label{lem-tech}For all ``priors'' \( p(h) \), for all \( \delta ,\alpha \in (0,1) \):
\[
\Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}>\frac{1}{\alpha \delta }\right) \leq \delta \]
\end{lem}
\begin{proof}
For any given hypothesis \( h \) we will prove the following.
\begin{equation}
\label{eq-expectation}
\forall h\, \, E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }
\end{equation}
The Lemma then follows from the sequence:
\[
\Rightarrow \forall p\, \, E_{p}E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\]
\[
\Rightarrow \forall p\, \, E_{D^{m}}E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\]
\[
\Rightarrow \forall p\, \, \Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\geq \frac{1}{\alpha \delta }\right) \leq \delta \]
Consequently, we must only prove equation \ref{eq-expectation}. Given the hypothesis,
we have a fixed true error rate, \( e(h) \), and the empirical error rate \( \hat{e}(h) \)
will be distributed like a Binomial. Let \( R(e(h)) \) be the random variable
with a cumulative distribution given by the relative entropy Chernoff bound
for a hypothesis with true error \( e(h) \). In other words, define a cumulative
distribution function on \( [0,1] \) according to:
\[
e^{-m\textrm{KL}\left( R||p\right) }\]
(note that we defined \( \textrm{KL}() \) so that it is always \( 0 \) when
\( \frac{k}{m}>p \)). Note that the relative entropy Chernoff bound implies
\( R \) satisfies:
\[
\textrm{Bin}(m,mR,p)\leq e^{-m\textrm{KL}\left( R||p\right) }\]
whenever \( mR \) is integer.
Since \( e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))} \) increases monotonically
with decreasing \( \hat{e}(h) \), the probability distribution function of
\( R(e(h)) \) will have a larger expected value. In other words:
\[
E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq E_{R}e^{(1-\alpha )m\textrm{KL}(R||e(h))}\]
The probability distribution function of \( R \) is given by:
\[
f(x)=\left\{ \begin{array}{ll}
0 & \textrm{for}\; x\geq p\\
& \\
-m\frac{\partial \textrm{KL}(x||p)}{\partial x}e^{-m\textrm{KL}(x||p)} & \textrm{for}\; x\leq p
\end{array}\right. \]
Taking the expectation with respect to this distribution gives us:
\begin{eqnarray*}
E_{R}\left[ e^{(1-\alpha )m\textrm{KL}(R||e(h))}\right] & = & \int _{0}^{1}e^{(1-\alpha )m\textrm{KL}(x||e(h))}f(x)dx\\
& & \\
& = & \int _{0}^{e(h)}-m\frac{\partial \textrm{KL}(x||e(h))}{\partial x}e^{-\alpha m\textrm{KL}(x||e(h))}dx\\
& = & \left. \frac{1}{\alpha }e^{-\alpha m\textrm{KL}(x||e(h))}\right| _{0}^{e(h)}\\
& \leq & \frac{1}{\alpha }
\end{eqnarray*}
This finishes the proof of the technical lemma.
\end{proof}
Now, we can prove the relative entropy PAC-Bayes bound \ref{th-repbb}.
\begin{proof}
First, we can specialize lemma~\ref{lem-tech} with \( \alpha =\frac{1}{m} \)
to get that with probability \( 1-\delta \)
\[
E_{p}e^{(m-1)\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{m}{\delta }\]
Apply lemma \ref{lem-opt} with \( K=\frac{m}{\delta } \), \( \beta =m-1 \),
\( Q_{i}=q(h_{i}) \) and \( y_{i}=\textrm{KL}(\hat{e}(h)||e(h)) \) to get:
\[
E_{q}\textrm{KL}(\hat{e}(h)||e(h))=\sum _{i=1}^{n}Q_{i}\textrm{KL}(\hat{e}(h_{i})||e(h_{i}))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\]
Jensen's inequality, gives us:
\[
\textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq E_{q}\textrm{KL}(\hat{e}(h)||e(h))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\]
which proves the theorem for the finite case. For the infinite case, a sequence
of limits can be defined just as in \cite{PB}.
\end{proof}
\section{PAC-Bayes Approximations}
\subsection{Approximating the empirical error}
\label{pb-approx}
In practice, it is not always easy to calculate some of the observable variables
in the PAC-Bayes bound. In particular, \( \hat{e}_{q}(h) \) is not necessarily
easy to calculate when \( q \) is some continuous distribution. We can avoid
the need for a direct evaluation by a Monte Carlo evaluation and a bound on
the tail of the Monte Carlo evaluation. Let \( \hat{e}_{\hat{q}}(h)\equiv \Pr _{\hat{q},S}(h(x)\neq y) \)
be the observed rate of failure of \( n \) random hypotheses drawn according
to \( q(h) \) and applied to a random training example. Once again, we have
a familiar Binomial distribution. Direct calculation will give us:
\begin{thm}
(Monte Carlo Sampling Bound) For all \( \delta \in (0,1] \)
\[
\Pr _{q^{n}}\left( \, \, \hat{e}(h)>\bar{e}\left( n,\hat{e}_{\hat{q}}(h),\delta \right) \right) \leq \delta \]
where \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \)
\end{thm}
\begin{proof}
Observer that the Monte Carlo estimate is distributed like a Binomial distribution
and apply the Binomial Tail bound.
\end{proof}
In order to calculate a bound on the expected true error rate, we can first
bound the expected empirical error rate \( \hat{e}_{q}(h) \) with confidence
\( \frac{\delta }{2} \) then bound the expected true error rate \( e_{q}(h) \)
with confidence \( \frac{\delta }{2} \), using our bound on \( \hat{e}_{q}(h) \).
Since the total probability of failure is only \( \frac{\delta }{2}+\frac{\delta }{2}=\delta \)
our bound will hold with probability \( 1-\delta \).
\subsection{Derandomizing the PAC-Bayes bound}
It is sometimes desirable to derandomize the PAC-Bayes bound. There are several
ways to do this. The next chapter will talk about replacing the randomization
over \( q(h) \) with a thresholded average. Another technique is to simply
pick a hypothesis according to \( q(h) \). While this would probably be effective
in practice, the theoretical guarantees that can be made for this technique
are weak. Strong theoretical guarantees \emph{can} be made for a similar technique.
Suppose we make \( n \) draws form \( q(h) \). Let the drawn hypotheses be
\( \{h_{1},...,h_{n}\} \). We can form a new distribution \( \hat{q}(h) \)
which is uniform over the \( n \) draws. The true error rate of this distribution
can be bound with high probability according to the following theorem.
\begin{thm}
(PAC-Bayes Derandomization) For all \( \delta \in (0,1] \)
\[
\Pr _{q^{n}}\left( \, \, e_{\hat{q}}(h)>\bar{e}\left( n,e_{q}(h),\delta \right) \right) \leq \delta \]
where \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \)
\end{thm}
\begin{proof}
Observer that the distribution of \( e_{\hat{q}}(h) \) is distributed like
a Binomial around \( e_{q}(h) \) and apply the Binomial Tail bound.
\end{proof}
Note the this theorem and the last theorem are essentially the same theorem.
This theorem allows us to do an (incomplete) derandomization. Instead of drawing
from \( q(h) \) in order to evaluate an input, we can draw from \( \hat{q}(h) \)
which requires a fixed finite number of bits. This may allow for more efficient
algorithms, and some people may find it reassuring that every hypothesis in
\( \{h_{1},...,h_{n}\} \) has a low empirical error. The same confidence splitting
trick of the last section can be used in order to guarantee \( e_{q}(h) \)
is bounded and \( e_{\hat{q}}(h) \) is bounded given that \( e_{q}(h) \) is
bounded.
It is worth mentioning that \emph{no} assumption of independence applies to
either this theorem or the last theorem since we explicitly control (and create)
the independence ourselves. These theorems hold for totally verifiable preconditions.
\section{Application of the PAC-Bayes bound}
The goal of this chapter is making PAC-Bayes bounds more applicable. This is
done by tightening the analysis from a Hoeffding-like to a Chernoff-like statement,
and by noting that we can use monte-carlo evaluation to safely bound the stochastic
empirical error rate quickly.
In work detailed in chapter \ref{SNN}, results for the application of PAC-Bayes
bounds to stochastic neural networks is presented. PAC-Bayes bounds are one
of very few approaches capable of producing nonvacuous learning theory bounds
on continuous valued classifiers for real-world problems.
\chapter{Averaging Bounds (Improved margin)}
\label{sec-average}
The work in this chapter is joint with Matthias Seeger and Nimrod Megiddo. It
was first presented at ICML \cite{averaging_icml}.
Averaging bounds are specialized for averaging classifiers. An average has the
form
\[
f(x)=\int q(h)h(x)dx\textrm{ or }f(x)=\int _{i=1}^{N}q(h_{i})h_{i}(x)dx\]
where \( q(h_{i})\geq 0 \) and \( \int q(h)dh=1 \). Averaging classifiers
have the form:
\[
c(x)=\textrm{sign}\left( f(x)\right) \]
Averaging bounds are especially interesting because there are many learning
algorithms which use averaging. These techniques include:
\begin{enumerate}
\item Boosting \cite{Boosting}
\item Bayes-Optimal learning (see section 6.7 of \cite{Tom})
\item Support Vector Machines \cite{SVM}
\item Bagging \cite{Bagging}
\item Maximum Entropy classification \cite{ME}
\end{enumerate}
Viewed as an interactive proof of learning (see figure \ref{fig-averaging-protocol})
the bound presented here is almost the same as the PAC-Bayes bound except that
it applies to the \emph{average} over the posterior rather than to stochastic
choices over the posterior.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/averaging.eps}} \par}
\caption{\label{fig-averaging-protocol} For the averaging bound, the learning must
commit to some measure \protect\( p(h)\protect \), receive examples, and then
commit to another measure \protect\( q(h)\protect \). The true error bound
applies to the \emph{average} over \protect\( q(h)\protect \) rather than stochastic
choices as in the PAC-Bayes bound \ref{sec-PB}.}
\end{figure}
The bound in this section is a qualitative improvement on prior results for
averaging bounds. For the average learning algorithms listed above, the form
of the improvements is most interesting for Maximum Entropy Classification and
Bayes-Optimal classification. All (currently known) specialized averaging bounds
use as a parameter the ``margin''. For this section only, suppose the label
and the hypotheses have value \( -1 \) or \( 1 \). (\( y\in \{-1,1\} \) and
\( h(x)\in \{-1,1\} \)). Then, the margin will be defined as
\[
t(x,y)=yf(x)\]
Some simple observations are immediate:
\begin{enumerate}
\item The margin is bounded. \( t(x,y)\in [-1,1] \)
\item If the classifier is correct, the margin is positive. \( c(x)=y\Rightarrow t(x,y)\in (0,1] \)
\end{enumerate}
The \emph{error} at some margin is the quantity actually used in the averaging
bounds. The empirical error at margin \( \theta \) is defined by:
\[
\hat{e}_{\theta }(c)=\Pr _{S}(t(x,y)\leq \theta )\]
and the true error at margin \( \theta \) is defined by:
\[
e_{\theta }(c)=\Pr _{D^{m}}(t(x,y)\leq \theta )\]
Note that \( e_{0}(c)=e_{D}(c) \) (the true error rate of \( c \)).
The ``margin'' is a useful way to parameterize our learning algorithms. It
will turn out that the sample complexity will be low (and the guarantees we
can make better) when the margin of most of the training examples is large.
There is, however, a price associated with using the margin: some hypotheses
have no notion of a margin. Thus the theory in this chapter is less general
than appears elsewhere.
\section{Earlier Results}
\label{sec-earlier}
The improved averaging bound arises from improving one critical step in the
proof of the original margin bound \cite{Margin} which is stated next.
\begin{thm}
(Margin Bound \cite{Margin}) \label{th-margin} For all \( \delta \in (0,1] \),
for all base hypothesis spaces, \( H \),
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)>\hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\ln |H|}{\theta ^{2}}\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \]
\end{thm}
\begin{proof}
Given in \cite{Margin}. A simplification of the improved averaging bound proof.
\end{proof}
Here, the notation \( b(m)=O(a(m)) \) means there exists a constant \( C \)
such that \( b(m)\leq C\cdot a(m) \) for all \( m \). This margin bound implies
that if most training examples have a large margin \( \theta \) (i.e. \( t(x,y)>\theta \)
for most \( (x,y)\in S \)) and the hypothesis space is not too large, then
the generalization error cannot be large. This theorem can only be non-vacuous
when the base hypothesis space is finite. There are various extensions (see
\cite{Margin}) of this bound for continuous hypothesis spaces based upon VC
dimension and covering number techniques. However, the extensions tend to result
in extremely loose guarantees and are not relevant to the discussion here. One
of the advantages of the improved averaging bound is that it \emph{can} apply
in a non-vacuous way to infinite hypothesis spaces. This generalization comes
about with essentially zero loosening of the underlying bound.
\section{A generalized averaging bound}
Before discussing the main theorem, it is important to notice that the averaging
classifier, \( c(x) \) \emph{implies} a distribution over the base hypothesis
space \( H \). This implied distribution is \( q_{c}(h) \) where
\[
c(x)=\textrm{sign}(\int h(x)q_{c}(h)dh)\]
The distribution \( q_{c} \) is used in the following theorem.
\begin{thm}
\label{th-averaging} (Relative Entropy Averaging Theorem) For all distribution
\( p(h) \), for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, \textrm{KL}\left( \hat{e}_{\theta }(c)||e(c)\right) \geq O\left( \frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}\right) \right) \leq \delta \]
\end{thm}
\begin{proof}
Given in the next section.
\end{proof}
The main theorem uses a KL-divergence based pseudodistance which is a bit hard
to understand intuitively. In order to gain intuition, we can relax the tightness
of the proof with an inequality.
\[
\textrm{KL}(\hat{e}_{\theta }(c)||e(c))\geq 2(\hat{e}_{\theta }(c)-e(c))^{2}\]
This relaxation gives us an immediate corollary.
\begin{cor}
(Relative Entropy Averaging Theorem) For all distribution \( p(h) \), for all
\( \delta \in (0,1] \):
\end{cor}
\begin{thm}
\[
\Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)\geq \hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \]
\end{thm}
This theorem improves upon theorem \ref{th-margin} because \( \textrm{KL}(q_{c}||p) \)
is used instead of \( \ln |H| \). For the case of a uniform distribution on
\( |H| \) different base classifiers, these results will agree when the average
is over just one classifier. As the average becomes ``broader'' the results
will improve. In the limit when the average is over nearly all classifiers,
the term \( \textrm{KL}(q_{c}||p) \) will be nearly \( 0 \).
The theorems are stated in an asymptotic fashion which is not be very useful
in practical applications. Section \ref{sec-tighten} gives some ideas of how
to tighten the result, and the non-asymptotic form (\ref{eq-finalbound}) given
at the end of the proof can be used directly in practice.
The improved averaging bound applies to averages over continuous hypothesis
spaces. In this setting, the average needs to be an integral over an uncountably-infinite
set of hypotheses or the KL-divergence will not converge to a finite value.
It is exactly because of this limitation that the improvements of this bound
are most applicable to Bayes Optimal and Maximum Entropy classifiers.
In practice, the limitation may not be a significant problem because machine
learning algorithms over large hypothesis spaces typically have some parameter
stability. In other words, a small shift in the parameters of the learned model
produces a small change in the prediction of the hypothesis. With hypothesis
stability, we can convert any average over a finite set of hypotheses into an
average over an infinite set of hypotheses without significantly altering the
predictions of the average. This technique is explored in chapter \ref{SNN}
with positive results.
\section{Proof of main theorem}
\label{sec-main-proof}
\subsection{Definitions and observations}
The proof has the same structure as the original margin bound (\ref{th-margin})
proof with one step replaced by the application of the relative entropy PAC-Bayes
theorem (\ref{th-repbb}).
Let \( N \) be any natural number; later, the choice of \( N \) will be optimized.
In the first part of the proof, we regard \( \theta \) and \( N \) as fixed.
Later we generalize this so that they may depend on the sample \( S \).
We construct the distribution \( q_{N} \) as follows. Draw \( N \) hypotheses
\( h_{i}\sim q(h) \) independently. \( Q_{N} \) might therefore be viewed
as the product distribution
\begin{equation}
q_{c}(h)^{N}.
\end{equation}
Given the \( h_{i} \) we define
\begin{equation}
\label{eq-sample}
g(x)=\frac{1}{N}\sum _{i=1}^{N}h_{i}(x)
\end{equation}
For fixed \( x,\, y \), the value of \( yh(x) \) are i.i.d. Bernoulli variables
with the mean equal to the expected margin:
\begin{equation}
E_{q\sim h}yg(x)=yf(x)
\end{equation}
Therefore, \( \forall x,y\, \, E_{g\sim Q_{N}}[yg(x)]=yf(x) \) and \( yg(x) \)
is distributed according to the familiar Binomial distribution. Later, we will
apply a Binomial tail bound on this quantity.
Before writing out the proof mathematically, it is helpful to consider a graphical
view of what we will prove. We will force convergence between three quantities.
\vspace{0.3cm}
{\par\centering \resizebox*{0.5\columnwidth}{!}{\includegraphics{averaging.eps}} \par}
\vspace{0.3cm}
The convergences are then tied together with triangle inequalities resulting
in the overall proof. The critical improvement of this paper is a refined version
of the second convergence.
\subsection{The Proof}
For every \( \theta \in (0,1] \) and for every (fixed) \( g \), the following
simple inequality holds:
\begin{equation}
\label{eq-simple}
e(f)=\Pr _{D}[yf(x)\leq 0]
\end{equation}
\[
=\Pr _{D}[yg(x)\leq \frac{\theta }{2},\, yf(x)\leq 0]+\Pr _{D}[yg(x)>\frac{\theta }{2},\, yf(x)\leq 0]|\, yf(x)\leq 0]\]
\[
\leq \Pr _{D}[yg(x)\leq \frac{\theta }{2}]+\Pr _{D}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\]
Note that the left-hand side does not depend on \( g \). By taking the expectation
over \( g\sim Q_{N} \) (and exchanging the order of expectations in the second
term on the right-hand side), we arrive at
\begin{equation}
e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ P_{g\sim Q_{N}}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\right] .
\end{equation}
For the rightmost term, we can apply a Binomial tail bound to get:
\begin{equation}
e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ 1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \right]
\end{equation}
\begin{equation}
=E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right)
\end{equation}
We would like to apply the PAC-Bayes theorem\ref{th-repbb} to the right-hand
term. Here we use the loss function \( I_{\{yg(x)\leq \theta /2\}} \). The
PAC-Bayes theorem applies for any ``prior'' distribution. We use as the ``prior''
the distribution \( P_{N}=p(h)^{N} \) where \( p(h) \) is the ``prior''
over the base hypothesis space. The choice of this prior allows us to use the
identity:
\[
\textrm{KL}(Q_{N}||P_{N})=N\textrm{KL}\{q(h)||p(h))\]
It follows from Theorem \ref{th-repbb} that: with probability at least \( 1-\delta \)
over random choices of \( S \), for every \( Q \),
\begin{equation}
\label{eq-usemcall}
\Pr _{D}\left( \exists q(h):\, \, \textrm{KL}\left( \hat{e}_{\frac{\theta }{2}}(g)||e_{\frac{\theta }{2}}(g)\right) \leq \frac{\textrm{KL}(Q_{N}||P_{N})+\ln \frac{m}{\delta }}{m-1}\right)
\end{equation}
where \( e_{\frac{\theta }{2}}(g)=1 \) with probability \( E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] \)
and \( 0 \) otherwise, and \( \hat{e}_{\frac{\theta }{2}}(g)=1 \) with probability
\( E_{g\sim Q_{N}}\left[ \Pr _{S}[yg(x)\leq \frac{\theta }{2}]\right] \) and
\( 0 \) otherwise.
By the same argument as in (\ref{eq-simple}), we get:
\begin{equation}
\hat{e}_{\frac{\theta }{2}}(g)=\Pr _{S}[yg(x)\leq \frac{\theta }{2}]\leq \Pr _{S}[yg(x)\leq \frac{\theta }{2},\; yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ]
\end{equation}
\begin{equation}
\leq \Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ]
\end{equation}
\begin{equation}
=\Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\hat{e}_{\theta }(f)
\end{equation}
Again, we take expectations over \( g\sim Q_{N} \) on both sides, interchange
the order of the expectations and apply the Binomial tail bound to get:
\begin{equation}
\label{eq-secondhoeff}
E_{S}\left[ P_{g\sim Q_{N}}[yg(x)\leq \frac{\theta }{2}]\right] \leq \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] .
\end{equation}
Combining our inequalities, we conclude that with probability at least \( 1-\delta \),
for every \( q(h) \)
\begin{equation}
\textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}
\end{equation}
where \( q_{S}=1 \) with probability \( \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] \)
and \( 0 \) otherwise, and \( p_{D}=1 \) with probability \( \Pr _{D}[yf(x)\leq 0]-1+\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \)
and \( 0 \) otherwise.
This bound holds for any fixed \( N \) and \( \theta \), which is not yet
what we need here, since we want to allow these to depend on the data \( S \).
In essence, the bound we proved so far is a statement about certain events,
parameterized by \( N \) and \( \theta \), namely the probability of each
event is smaller than \( \delta \). However, we need to prove that the probability
of the \emph{union} of all these events is smaller than \( \delta \). To this
end, we first observe that this union is contained in the union over only a
\emph{countable} number of events. Note that \( g(x)\in \{(2k-N)/N|k=0,1,\ldots ,N\} \).
Thus, even with all the possible (positive) values of \( \theta \), there
are no more than \( N+1 \) events of the form \( \{yg(x)\leq \theta /2\} \).
Denote by \( k(\theta ,N) \) the largest integer \( k \) such that \( k/N\leq \theta /2 \).
We observe that for every \( \theta >0 \), every \( g \) and every distribution
over \( (x,y) \):
\begin{equation}
\Pr \left[ yg(x)\leq \theta /2\right] =\Pr \left[ yg(x)\leq \frac{k(\theta ,N)}{N}\right] .
\end{equation}
This means that the middle step in the proof above, i.e. the application of
theorem \ref{th-repbb}, depends on \( (N,\theta ) \) only through \( (N,k) \).
Since the other steps are true with probability one, we see that we can restrict
ourselves to the union of countably many events, indexed by \( (N,k) \). Now,
we ``allocate'' parts of the confidence quantity \( \delta \) to each of
these events, namely \( (N,k) \) receives \( \delta _{N,k}=\delta /(N(N+1)^{2}),\; N=1,2,\dots ;\, k=0,\dots ,N \).
It follows easily that the union of all these events has probability at most
\( \sum _{N,k}\delta _{N,k}=\delta \). Therefore we have proved that with
probability at least \( 1-\delta \) over random choices of \( S \) it holds
true that for \emph{all} \( N \) and \emph{all} \( \theta \in (0,1] \),
\begin{equation}
\label{eq-finalbound}
\textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta _{N,k}}}{m-1}\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta }+3\ln N+1}{m-1}
\end{equation}
We can now choose \( N \) to minimize this bound. \( N \) may depend on \( \theta ,\, Q \)
and the sample \( S \).
The asymptotic bound stated in the theorem can be derived by choosing \( N \)
(with respect to \( \theta \) and \( Q \)) so as to \emph{approximately}
minimize the bound we have derived above. The first step in this minimization
is the replacement of the Binomial tail with a looser approximation such as
the Hoeffding bound which gives us:
\[
1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\]
\[
\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\]
We can then choose
\[
N=\left\lceil 8\frac{\ln \frac{m}{\textrm{KL}(q||p)}}{\theta ^{2}}\right\rceil .\]
This choice gives us:
\[
e^{-\frac{N\theta ^{2}}{8}}=\frac{\textrm{KL}(q||p)}{m}\]
Which implies we have an equation of the form:
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) \leq \frac{\theta ^{-2}\textrm{KL}(q||p)\ln m+\ln m+\ln \frac{1}{\delta }}{m}
\end{equation}
In order to prove the result, we must show that:
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) =O\left( \textrm{KL}\left( q||p\right) \right)
\end{equation}
We can do this by taking the difference to get an equation of the form:
\( (q+k)\ln \frac{q+k}{p-k}+(1-q-k)\ln \frac{1-q-k}{1-p+k}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
If \( p-q>2k \) and \( k<\frac{1}{2} \) then we have:
\( \simeq (q+k)\ln \frac{q+k+\frac{k}{p}}{p}+(1-q-k)\ln \frac{1-q-k-\frac{k}{1-p}}{1-p}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\( \simeq (q+k)[\ln \frac{q}{p}+\frac{k+\frac{k}{p}}{\frac{q}{p}}+(1-q-k)\ln \frac{1-q}{1-p}-\left[ \frac{k+\frac{k}{1-p}}{\frac{1-q}{1-p}}\right] -q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\( \simeq (q+k)[\ln \frac{q}{p}+k(\frac{p+1}{q})]+(1-q-k)[\ln \frac{1-q}{1-p}-k(\frac{2-p}{1-q})]-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \)
\( \simeq k(1+p)-k(2-p) \)
\( \simeq k \)
Since we have that the difference
\begin{equation}
\textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) -\textrm{KL}\left( q||p\right) \simeq [\textrm{KL}\left( q||p\right) ]
\end{equation}
the main theorem follows immediately.
\section{Methods for tightening}
\label{sec-tighten}
The previous section showed a bound in asymptotic form which is good for understanding
the trade-offs between the number of examples (\( m \)), the size of the hypothesis
space (\( |H| \)), the margin (\( \theta \)) and the entropy of the average
(\( H(Q) \)). However, it is not a good form for those interested in quantitative
application of the bound to specific problems. We state improvements which aid
in the development of a quantitatively applicable bound. We can tighten the
bound above through several techniques:
\begin{enumerate}
\item Parameterizing and then optimizing the parameterization of arbitrary choices
within the proof. In the (improved) margin bound proof, we arbitrarily decided
to work with the margin of the randomly produced function \( g(x) \) at \( \frac{\theta }{2} \).
This is a good heuristic, but not the optimal choice when we use the improved
tail bounds. Since the decision of the margin for the random function \( g(x) \)
is a parameter of the proof, we are free to optimize it.
\item Tighter argument within the proof. The optimal value of \( N \) is a function
of \( \theta ,m,\textrm{KL}(q||p) \) and \( \delta \). All of these are known
in advance except for \( \textrm{KL}(q||p) \). If we can estimate in advance
the value of \( \textrm{KL}(q||p) \), then it becomes possible to optimize
the value of \( N \) in a data-independent manner. Consequently, it becomes
unnecessary to split our confidence over the possible values of \( N \) and
we need only split the confidence over the values of \( \theta \) in proving
the bound. The effect of this improvement is reducing \( 1/\delta _{N,k}=1/(N(N+1)^{2}) \)
to \( 1/\delta _{N,k}=1/(N(N+1)) \) giving us a small improvement in the low
order terms of the improved averaging bound.
\end{enumerate}
\section{Final thoughts for Averaging Bounds}
The practical implication of this more-refined analysis of averaging bounds
is more support for the practice of averaging. Sufficient averaging over the
hypothesis space can reduce the ``complexity'' allowing tight estimates on
the true error rate---possibly even tighter than could be guaranteed with a
single hypothesis.
The improved averaging bound is not yet the tightest possible and it appears
there are several possible theoretical improvements.
\begin{enumerate}
\item Remove the low order terms from the bound to make it more quantitatively applicable.
\item Improve the argument to take into account the \emph{distribution} of the margin
rather than the margin at some point.
\item Prove a lower bound which corresponds to the upper bound given here. Since no
good lower bound yet exists, we do not know that large improvements in the upper
bound are not possible.
\end{enumerate}
Open Problem: In practice, is the averaging bound (\ref{eq-finalbound}) ever
better than the PAC-Bayes bound \ref{th-repbb}? The extra triangle inequality
applications in the averaging bound may make it worse than the PAC-Bayes bound
in practice.
\chapter{Computable Shell bounds }
\label{sec-shell}
The first shell bound paper was joint work with David McAllester and was presented
at Colt \cite{Shell}. The work presented here incorporates significant refinement,
generalization, and simplification of the first Colt paper.
Roughly speaking, the shell bound (usually) provides much tighter true error
rate upper bounds on learned hypotheses than conventional Occam's Razor bound
(theorem \ref{th-ORB}) or PAC-Bayes bounds (theorem s\ref{th-repbb}). It does
this without violating lower upper bounds \ref{sec-lower_upper} by incorporating
\emph{much} more information into the bound calculation.
The inspiration behind the work on Shell bounds rests on two pieces of work.
In \cite{old_shell} by Haussler, Kearns, Seung, and Tishby, learning theory
curves are investigated from an omniscient point of view where the true error
rates of various hypotheses are known. The principle improvement in this paper
is that our bounds are reduced to \emph{observable} quantities. Put another
way, we do not need to know the underlying learning distribution, \( D \).
In \cite{Tobias}, an analysis was made assuming some distribution over true
error rates. Our analysis does not rely on any assumption about the distribution
of true error rates---only the independence assumption is made. Despite using
only observable information and making no extra assumptions, the results here
are quite tight and yield practical results.
We start with the distribution of empirical errors over hypotheses and subtract
a small amount from the empirical error rates to create a pessimistic distribution.
With high probability, the cumulative of the pessimistic distribution will lower
bound the cumulative distribution of hypothesis true error rates. Given this,
we can directly calculate a bound on the probability that a ``large'' hypothesis
will produce a misleadingly small error. This bound can be \emph{much} tighter
than standard union bound techniques although the quantity of improvement is
highly problem dependent.
After presenting the first bound, we will transform it into a bound on continuous
hypothesis spaces using a PAC-Bayes like approach \cite{PB}.
Viewed as an interactive proof of learning (figure \ref{fig-shell-protocol}),
the stochastic shell bound is much like the PAC-Bayes bound.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/shell.eps}} \par}
\caption{\label{fig-shell-protocol} The stochastic shell bound, as an interactive proof
of learning, has the same general outline as the PAC-Bayes bound except that
\emph{much} more information is required in order to calculate the bound. The
shell bound (proved first below) is a simplification which is somewhat tighter
when the ``Posterior'' places all mass on one hypothesis.}
\end{figure}
The strongest criticism of shell bounds is, in fact, that too much information
is required. While this information is always theoretically observable, it may
not be tractable to collect. There are two answers to this criticism given here.
The first is an empirical employment on decision tree learning algorithms which
shows that in practice, there are often shortcuts which make the information
gathering feasible. The second answer is to construct a sampled version of the
shell bound which approximate versions of the information required by the shell
bound. We will:
\begin{enumerate}
\item Present the discrete shell bound
\item Present the sampled version of the shell bound.
\item Extend the discrete shell bound to continuous spaces
\end{enumerate}
\section{The Discrete Shell Bound}
\subsection{Knowledge of learning distribution}
Let \( B(m,K,p)=\binom{m}{K}e(h)^{K}(1-e(h))^{m-K} \) be the probability that
hypothesis with true error \( p \) produces \( K \) errors on \( m \) independent
examples.
The discrete shell bound works directly with the probability that there will
be a confusingly small empirical error. Let
\[
P(\epsilon ,K)\equiv \sum _{h:e(h)>K+\epsilon }B(m,K,e(h))\]
Intuitively, \( P(\epsilon ,K) \) is a bound on the probability that a hypotheses
with a true error rate larger than \( \frac{K}{m}+\epsilon \) will have an
empirical error rate of \( \frac{K}{m} \). The contribution to the sum will
fall off exponentially as the true error, \( e(h) \), increases.
Our first step is stating a shell bound which requires unknown information.
The purpose of this bound is motivational - it provides incite into why we can
expect a large improvement. Later, we will remove the unknown information requirements
and recover a useful bound.
\begin{thm}
\label{Full_knowledge}(Full knowledge theorem) For all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
where \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, P(\epsilon ,K)=\frac{\delta }{m}\right\} \)
\end{thm}
The full knowledge theorem relies on unobservable information---the true error
rates of all hypotheses. This theorem is not (quite) trivial because it does
not rely upon information about \emph{which} hypothesis has a particular true
error.
\begin{proof}
For every hypothesis with a true error rate of
\[
e(h)>\frac{K}{m}+\epsilon (\delta ,K)\]
the probability of producing an empirical error of
\[
\hat{e}(h)=\frac{K}{m}\]
is \( B(K,e(h),m) \). Applying the union bound over all hypotheses and all
\( m \) possible nontrivial values of \( K \) completes the proof.
\end{proof}
There are a couple things to note about this theorem. First, for most balanced
machine learning problems most hypotheses typically have a true error rate near
to \( 0.5 \). Given this, and noticing that Binomial tails fall of exponentially,
dramatic improvements in the bound are feasible.
Second, we must use \( \frac{\delta }{m} \) rather than \( \delta \) in order
to make the proof work. It is possible that theorem holds without splitting
\( \delta \) ``\( m \)-ways''. Removing the factor of \( m \) is an open
problem. For the special case of empirical risk minimization algorithms, McDiarmid's
inequality \cite{McD} implies that the range of hypotheses with minimum empirical
error is of size \( O(\sqrt{m}) \) with high probability. Therefore, we need
only apply the union bound to \( O(\sqrt{m}) \) possible minimum empirical
error rates.
\subsection{No knowledge of learning distribution}
Applying the full knowledge theorem (\ref{Full_knowledge}) is not practical
in almost all learning settings because we do not know the distribution of true
error rates. Therefore, it is necessary to construct a slightly looser theorem
which relies upon only observable quantities. Surprisingly, this is possible
while introducing only slightly more slack.
First, we need a couple of definitions.
\[
\bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ \frac{K}{m}+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,i,p)\geq \delta \right\} \right\} \]
Intuitively, \( \bar{e}(\epsilon ,K,\delta ,\frac{i}{m}) \) is a \emph{lower}
bound on the true error rate of the hypothesis with an empirical error of \( \frac{i}{m} \).
\[
\hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{h}B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\]
The quantity \( \hat{P}(\epsilon ,K,\delta ) \) is an upper bound on the probability
that one of the hypotheses with a true error rate of \( \bar{e} \) or more
could produce \( K \) empirical errors.
Noting that there are only \( m+1 \) possible empirical errors, we can first
let
\[
c\left( \frac{i}{m}\right) =\left| \left\{ h:\, \hat{e}(h)=\frac{i}{m}\right\} \right| \]
be the count of empirical errors at \( \frac{i}{m} \). Then we can redefine:
\[
\hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}c\left( \frac{i}{m}\right) B(m,K,\bar{e}(\epsilon ,K,\delta ,\frac{i}{m}))\]
Later, we will prove that with high probability, \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \).
Given that this is so, we can prove a theorem which \emph{only} relies on observable
quantities.
\begin{thm}
\label{Observable}(Observable Shell Bound) For all \( \delta >0 \):
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
where \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \)
\end{thm}
The observable shell bound preserves the important locality property of the
full knowledge shell bound. In particular, when most of the true error rates
are ``far'' from the empirical error rate (and \( m \) is large enough),
we expect to make large (functional) improvements on the discrete hypothesis
bound \ref{th-DHSCP}.
The proof rests upon a technical lemma which allows us to bound the unobservable
``probability of a misleading event'' with an observable event.
\begin{lem}
\label{unobservable bound}(Unobservable bound) For all empirical errors, \( K \),
for all distributions \( Q(h) \), for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( 2\sum _{h}Q(h)B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\geq \sum _{h:e(h)>K+\epsilon }Q(h)B(m,K,e(h))\right) \leq \delta \]
\end{lem}
This lemma is powerful because it bounds the unobservable right hand side in
terms of the observable left hand side.
\begin{proof}
Let \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, 1-\textrm{Bin}(m,k,p)\geq \delta \} \)
\[
\forall \alpha \in (0,1]\, \, \forall h:\, \, E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\[
\Rightarrow \forall P(h)\, \, E_{P}E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\[
\Rightarrow \forall P(h)\, \, E_{D^{m}}E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \]
\[
\Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\geq \frac{\alpha }{\delta }\right) \leq \delta \]
\[
\Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( \Pr _{P}\left( e(h)<\bar{e}(m,\hat{e}(h),\alpha )\right) \geq \frac{\alpha }{\delta }\right) \leq \delta \]
Let \( P(h)\propto Q(h)B(m,K,e(h)) \). Then:
\[
\Rightarrow \forall Q(h)\, \, \Pr _{D^{m}}\left( \frac{\sum _{h:e(h)>\frac{K}{m}+\epsilon \wedge e(h)>\bar{e}(m,\hat{e}(h),\alpha )}Q(h)B(m,K,e(h))}{\sum _{h:e(h)>\frac{K}{m}+\epsilon }Q(h)B(m,K,e(h))}\geq \frac{\alpha }{\delta }\right) \leq \delta \]
set \( \alpha =\frac{\delta }{2} \)and replace \( e(h) \) with \( \bar{e}(m,\hat{e}(h),\alpha ) \)
to achieve the result.
\end{proof}
We now have all the tools required to prove the theorem.
\begin{proof}
(of theorem \ref{Observable}) Choose \( Q(h) \) = the uniform distribution
on a our hypothesis space. Then, we know that with probability \( 1-\delta \),
\( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \). Therefore, we can (arbitrarily)
allocate a \( \frac{\delta }{2} \) probability of failure to the unobservable
bound \ref{unobservable bound} and a \( \frac{\delta }{2} \) probability of
failure to the full knowledge bound \ref{Full_knowledge}. Assuming \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \),
the observable bound will be more pessimistic than the full knowledge bound.
\end{proof}
The Observable Shell bound behaves in a strange manner which is unlike other
true error bounds. In particular, the true error bound can be discontinuous
in the value of \( \delta \). This discontinuity implies that relatively small
improvements in the shell bounds can result in dramatic improvements in the
value of the true error bound. While dramatic improvements can happen with small
improvements in the shell bound, we expect that in practice, such large improvements
will not be too common, simply because a small improvement is unlikely (amongst
all learning problems) to shift the bound across one of these discontinuous
transitions.
\section{Sampling Shell Bound}
The Shell bound relies upon the distribution of empirical errors across the
entire hypothesis space. Calculating \( c\left( \frac{i}{m}\right) \), while
theoretically possible, is often practically intractable. For example, the
space of all binary functions on \( n \) features has size \( 2^{2^{n}} \)
and for a decision tree with a number of nodes \( k=O(m) \) there are more
than \( 2^{k} \) hypotheses with the same number of nodes. In order to avoid
this difficulty, we will use sampling which is made possible by noticing that
the Shell bound does not require exact knowledge of the empirical error distribution.
Instead, we can safely count a hypothesis twice because over-counting monotonically
worsens the bound. Assume that we have an oracle which can be used to sample
uniformly from the set of all hypotheses. Then, we can bin the sampled hypotheses
according to their error rate on the example set. After repeating \( k \) times
we will arrive at an empirical distribution over error rates which can be altered
into a bound on the true distribution of error rates by upper bounding the count
\( c\left( \frac{i}{m}\right) \). Let \( \hat{c}\left( \frac{i}{m}\right) \)
be the observed number of hypotheses out of \( k \) uniform choices with empirical
error \( \frac{i}{m} \) and define:
\[
\bar{c}\left( k,\frac{\hat{c}}{k},\delta ,|H|\right) \equiv |H|*\max _{p}\left\{ p:\, \, \textrm{Bin}(k,\hat{c},p)=\frac{\delta }{m}\right\} \]
Intuitively, \( \bar{c} \) will be a high confidence upper bound on the number
of hypotheses with empirical error rate \( \hat{c} \). Given these quantities,
we can calculate an approximate \( \hat{P} \) according to the formula:
\[
\hat{\hat{P}}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}\bar{c}\left( k,\frac{\hat{c}\left( \frac{i}{m}\right) }{k},\frac{\delta }{2m},|H|\right) B\left( m,K,\bar{e}\left( \epsilon ,K,\frac{\delta }{2},\frac{i}{m}\right) \right) \]
\begin{thm}
\label{th-sample_shell} (Sampling Shell Bound) For all \( \delta >0 \):
\[
\Pr _{D^{m}}(\exists h\in H\, \, e(h)\leq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \]
where \( \epsilon (\delta ,K)=\min \left\{ \epsilon :\, \, \hat{\hat{P}}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \)
\end{thm}
\begin{proof}
For every \( i \), we know that \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \)
with probability at least \( 1-\frac{\delta }{2} \). This implies that \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq P\left( \epsilon ,K\right) \)
with probability at least \( 1-\delta \). Applying the union bound with the
full knowledge theorem gives us the result.
\end{proof}
The Sampling Shell bound is still relatively fast to calculate given the sampling
results, but it is worth noting that \emph{many} samples are required---the
bound will typically tighten linearly with \( \ln k \). In other words, an
exponentially large \( k \) is required to realize all of the gains of the
shell bound. Thus, the sampling shell bound will interpolate between the discrete
hypothesis bound \ref{th-DHSCP} and the shell bound as \( \ln k \) moves from
\( 1 \) to \( \ln |H| \).
\section{Lower Bounds}
The lower upper bound \ref{sec-lower_upper} does not apply to shell bounds
because we are using more information than just the empirical error rate of
a learned hypothesis. In particular, we are using the empirical error rates
of \emph{all} the hypotheses in calculating the bound. Is there a lower upper
bound which applies for the information used by the shell bound? The same independent
hypothesis technique will allow us to lower bound the full knowledge theorem
\ref{Full_knowledge}. In particular, assume that we are given a set of independent
hypotheses, each with some true error \( e(h) \). What is a lower bound on
the probability that one of these hypotheses will have an empirical error of
\( \frac{K}{m} \)?
If \( A \) and \( B \) are independent events, then:
\[
\Pr (A\textrm{ or }B)=\Pr (A)+\Pr (B)-\Pr (A\textrm{ and }B)\]
\[
=\Pr (A)+\Pr (B)-\Pr (A)\Pr (B)\]
\[
=(1-\Pr (B))\Pr (A)+\Pr (B)\]
This implies that we can ``add'' the independent probabilities together as
long as we rescale. In particular, \( B \) might be the probability of a ``bad''
hypothesis in some set of hypotheses and \( A \) might be the probability that
some new hypothesis with a large true error rate has a small empirical error.
Using this fact, we get the following theorem:
\begin{thm}
(Lower Upper Shell Bound) For all true errors \( e(h) \), \( K \):
\[
\Pr _{D^{m}}\left( \exists h:e(h)>\frac{K}{m}+\epsilon \textrm{ and }\hat{e}(h)=\frac{K}{m}\right) \geq P(\epsilon ,K)(1-P(\epsilon ,K))\]
\end{thm}
\begin{proof}
The proof is by finite induction on the set of hypotheses with a large true
error rate. Let
\[
P_{H}(\epsilon ,K)=\sum _{h\in H}B(m,K,e(h))\]
be the sum of the probabilities that each hypothesis in \( H' \) produces an
empirical error of \( \frac{K}{m} \). Now, we want to prove that:
\[
\forall H\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\]
This is true for the base case of \( |H|=1 \). Assuming that it is true for
the case of \( |H|=n \), we need to prove it for the case of \( H'=H\cup |h| \).
In particular, we have assumed
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\]
Using the earlier independent principle, we get that:
\[
\Pr _{D^{m}}\left( \exists h\in H':\, \, \hat{e}(h)=\frac{K}{m}\right) \]
\[
\geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)))B(K,e(h),m)\]
\[
\geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K))B(K,e(h),m)\]
\[
=P_{H'}(\epsilon ,K)(1-P_{H}(\epsilon ,K))\]
\[
\geq P_{H'}(\epsilon ,K)(1-P_{H'}(\epsilon ,K))\]
By induction, this property therefore holds for the set of all hypotheses with
a large true error rate.
\end{proof}
Assuming that \( \delta <\frac{1}{2} \), the lower upper shell bound is tight
to within a factor (in \( \delta \)) of \( 2m \) with the full knowledge
bound \ref{Full_knowledge}. Given the exponential behavior of Binomial tails,
this usually (but not always) implies a small impact on the true error bound.
One important question remains: how does this bound compare to the observable
shell bound \ref{Observable}? In the observable shell bound, the distribution
of true errors is replaced with a pessimistic distribution based upon the observed
empirical errors. The ``size'' of this pessimism in terms of the true error
bound is, in general, of size \( \frac{1}{\sqrt{m}} \). Thus, the gap between
the lower upper shell bound and the upper shell bound is typically of size \( \frac{1}{\sqrt{m}} \).
\section{Shell Bounds for Continuous Spaces}
Applying Shell bounds to continuous hypothesis spaces is not easy. In fact,
upon first inspection, this appears to be impossible since shell bounds require
knowledge of the number of hypotheses with a particular empirical error. We
can avoid these difficulties, while introducing a small amount of slack, by
always being concerned with the \emph{measure} rather than the count of hypotheses.
In particular, we will keep track of the \emph{measure} of hypotheses with a
confusingly small error and the \emph{measure} of the hypotheses that we pick.
The approach here is similar to the approach in section \ref{sec-PB} although
more simplistic.
First assume that there is some measure \( Q \) over the hypothesis space \( H \).
Suppose that we choose some subset, \( U \), of the hypotheses with measure
\( Q(U) \). We will be concerned with the \emph{largest} empirical error rate
\( \hat{e}_{U}(h)=\max _{h\in U}\hat{e}(h) \) and the \emph{average} true error
rate, \( e_{U}(h)=E_{Q_{U}}e(h) \). A bound on the gap between the largest
empirical error rate and the average true error rate will imply a true error
bound for the stochastic classifier which chooses randomly and evaluates.
We will need a different definition of \( P \). Let:
\[
P_{s}(\epsilon ,\, \, K)\equiv \int _{h:e(h)>K+\epsilon }Q(h)\textrm{Bin}(K,e(h),m)dh\]
We will also need the concept of rounding. Choose \( i \) such that \( \frac{1}{m^{i}}\geq Q(U)\geq \frac{1}{m^{i+1}} \)
then, define:
\[
\left\lfloor Q(U)\right\rfloor =\frac{1}{m^{i+1}}\]
Now, the following theorem holds:
\begin{thm}
(Stochastic Full knowledge) \label{Stochastic Full Knowledge} For all distributions
\( Q \), For all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \]
where \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{m^{3}}\right\} \)
\end{thm}
\begin{proof}
Call a hypothesis with a large true error (\( e(h)>\frac{K}{m}+\epsilon \))
and small empirical error (\( \hat{e}(h)\leq \frac{K}{m} \)) a ``bad'' hypothesis.
\( P_{s}(\epsilon ,K) \) is the expected measure of the bad hypotheses. We
will use Markov's inequality to bound the actual measure of bad hypotheses.
Then, given that the quantity is bounded, we can bound the expected true error
by assuming that we included every bad hypothesis in the set \( U \).
Let
\[
\epsilon _{Ki}=\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta }{m^{3+i}}\right\} \]
Intuitively, \( \epsilon _{Ki} \) is the value we will use when \( \hat{e}=\frac{K}{m} \)
and \( \left\lfloor Q(\hat{e})\right\rfloor =\frac{1}{m^{i}} \). Also let
\[
\hat{P}_{s}(\epsilon ,K)=\int _{h:e(h)>K+\epsilon \wedge \hat{e}(h)\leq K}p(h)dh\]
be the actual measure of bad hypotheses. Then, Markov's inequality tells us:
\[
\forall \delta >0\, \, \Pr _{D}(\hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \frac{\delta }{m^{2}}\]
Taking the union bound over all values of \( \epsilon _{Ki} \), we get:
\[
\forall \delta >0\, \, \Pr _{D}(\forall K,i\, \, \hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \delta \]
So, with probability \( 1-\delta \) for all values of \( K \) and \( i \),
we have: \( \hat{P}_{s}(\epsilon _{Ki},K)\leq \frac{1}{m^{1+i}} \). Therefore,
if \( Q(U)\geq \frac{1}{m^{i}} \), we know that \( \frac{Q(U)}{\hat{P}_{s}(\epsilon _{Ki},K)}\geq m \).
Assuming that all of the bad hypotheses have true error \( 1 \) and all the
rest have true error at most \( \frac{K}{m}+\epsilon _{Ki} \), we get the following
true error bound:
\[
e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon _{\hat{e}i}\]
\end{proof}
The stochastic full knowledge bound is loose when applied to the full knowledge
setting by \( \delta \rightarrow \frac{\delta }{m^{2}} \). Typically, this
results in only a \( \frac{2\ln m}{m} \) increase in the size of \( \epsilon \),
although the increase can sometimes be much larger near phase transitions. These
factors of \( \frac{\ln m}{m} \) may be removable with improved argumentation.
Naturally, the stochastic full knowledge bound can be used to prove a stochastic
observable bound.
The next theorem is the observable analog of the stochastic full knowledge bound.
Here, we eliminate the unobservable quantities to produce a stochastic observable
shell bound. The observable quantity we will use is:
\[
\hat{P}_{s}(\epsilon ,K,\delta )\equiv 2\sum _{h}Q(h)\textrm{Bin}(K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)),m)\]
where
\[
\bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ K+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,K,p)\geq \delta \right\} \right\} \]
is a slightly pessimal estimate of the true error rate given the empirical
error rate as before.
\begin{thm}
(Stochastic Observable Shell Bound) For all distributions \( Q \), For all
\( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \]
where \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{2m^{3}}\right\} \)
\end{thm}
\begin{proof}
First note that the unobservable bound lemma \ref{unobservable bound} implies
that with probability \( 1-\frac{\delta }{2} \), we have \( \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\geq P_{s}(\epsilon ,K) \).
Given that this is the case, our choice of \( \epsilon \) will be at least
as pessimistic as the choice defined by the Stochastic Full Knowledge bound
\ref{Stochastic Full Knowledge}. We thus have two sources of failure: the unobservable
bound lemma will fail with probability at most \( \frac{\delta }{2} \) and
the stochastic full knowledge bound will fail with probability \( \frac{\delta }{2} \).
The union bound then implies that the Stochastic Observable Shell Bound holds
with probability \( 1-\frac{\delta }{2} \).
\end{proof}
The information requirements for the continuous space shell bound are even more
severe than the requirements for the discrete space shell bound. In particular,
we need to know the exact measure of the hypotheses with any particular empirical
error. Clearly, this is unrealistic. We can relax this requirement to observations
computable in a finite amount of time using the sampling techniques of theorem
\ref{th-sample_shell}. In particular, given a well-behaved distribution \( Q \),
we can make a bounded estimate of the measure of hypotheses with some empirical
error rate. Given this bounded estimate, we can then calculate a pessimistic
shell bound for the continuous space.
\section{Conclusion}
Details for calculating the shell bound are given in appendix section \ref{sec-shell-bound-calc}.
Shell bounds are easily calculable and can provide large improvements when we
can afford to enumerate the hypotheses. Their application is less clear when
it is not possible to enumerate the hypotheses because the computational burden
may become too large. Via sampling techniques, it is possible to smoothly improve
from earlier techniques to the best achievable shell bound result in an anytime
fashion.
There remain several important open questions:
\begin{enumerate}
\item Can we remove the remaining division of \( \delta \) by \( m \)? This would
make shell bounds a bit more elegant and tight.
\item Is there a natural lower bound on the true error rate which uses the shell approach?
This improvement is of principally theoretical interest.
\item For the continuous space shell bounds, two additional divisions by \( m \)
were introduced. Is it possible to remove these factors with an improved argument?
This improvement would clean up the continuous shell bound.
\item Our extension to the continuous case was done in the style of PAC-Bayes bounds,
but a more common technique for extending to the continuous case is via the
use of covering numbers. Is there a natural way to extend Shell bounds to the
continuous case using the concept of a covering number? This is another approach
which might yield a result requiring less calculation.
\end{enumerate}
\chapter{Tight covering number bounds}
\label{sec-cover}
\section{Introduction}
Covering number bounds are used to bound the true error rate of classifiers
chosen from an infinite hypothesis space using \( m \) examples \cite{Haussler}.
A cover is a finite set of hypotheses which satisfies the following property:
every hypothesis in the infinite space is ``near'' some element in the finite
cover. When a Lipschitz condition holds on the hypothesis space, it is generally
possible to construct these covers and the existence of a cover is required
for learnability \cite{BI}. Alternatively, Sauer's lemma (see \cite{Pollard}
or \cite{Vapnik}) bounds the size of the cover in terms of the VC dimension
which is defined combinatorially: the VC dimension is the largest number of
examples which the hypothesis space can classify in an arbitrary manner.
The principal disadvantage of covering number results is that they are notoriously
loose, to the point that they are often useless when applied in practice (see
``criticisms'' in \cite{Haussler}). Here, ``useless'' means that the bound
on the true error rate is ``always wrong''. The amount of ``looseness''
can be quantified by comparison with other bounds in the regimes where other
bounds hold. On a finite hypothesis space we have near-perfect agreement between
the upper bound \ref{th-DHSCP} and the lower upper bound \ref{th-dhlub} for
independent hypotheses. In fact, as the number of examples goes to infinity,
the agreement is perfect, regardless of the size of the hypothesis space. When
we apply covering number bounds to this problem such properties do not arise.
Since part of the argument involves splitting the examples into \( 2 \) sets,
the difference between a covering number based upper bound and the lower upper
bound can be large even when the number of examples goes to infinity. In practice,
the covering number bound at least squares the discrete hypothesis size. Obviously,
further loosening of covering number bounds with Sauer's lemma result in even
worse bounds.
Can we construct a calculable true error upper bound for continuous hypothesis
spaces which is at least asymptotically tight? In a sense, this has already
been done with PAC-Bayes bounds in chapter \ref{sec-PB}, but there are drawbacks
in applicability to that approach since PAC-Bayes bounds do not apply in a meaningful
way to a single hypothesis drawn from an infinite hypothesis space. A covering
number argument would hopefully apply in a meaningful way to a singly hypothesis.
A covering number bound which is asymptotically tight on \emph{some} learning
problems does exist and is covered next.
\section{The Setting and Prior Results}
We will first discuss standard covering number bounds. Define a ``distance''
in terms of how often hypotheses disagree according to:
\[
d_{D}(h,f)=\Pr _{D}(h(x)\neq f(x))\]
Now, start with an epsilon net defined by:
\[
N(H,\epsilon ,d_{D})=\textrm{inf}_{F}\left| F:\, \, \forall h\in H\exists f\in F:\, \, d_{D}(h,f)\leq \epsilon \right| \]
An epsilon net is the minimum size of a set which contains an element ``near''
to every element in \( H \).
Then a covering number is defined as:
\[
C(H,\epsilon )=\textrm{sup}_{D}N(H,\epsilon ,d_{D})\]
The covering number is the worst epsilon net.
\begin{thm}
\label{th-covering} (Covering number bound) For all \( \delta \in (0,1] \):
\[
\forall H\, \, \Pr _{D^{m}}\left( e(h)\geq \hat{e}(h)+4\sqrt{\frac{\ln 4C(H,\epsilon )+\ln \frac{1}{\delta }}{m}}\right) \leq \delta \]
\end{thm}
\begin{proof}
In \cite{Haussler}.
\end{proof}
How tight is this bound when applied to a finite independent hypothesis space?
We can improve the constants by using an argument with fewer triangle inequalities
in the discrete case and get the following results:
\[
\Pr _{D}\left( e(h)-\hat{e}(h)\geq 2\sqrt{\frac{\ln 4|H|+\ln \delta }{m}}\right) \leq \delta \]
Comparing this with a very loose application of the discrete hypothesis bound
\ref{th-adhscb} we see that the penalty term in the covering number bound is
worse by factor of \( 2\sqrt{2} \). Put another way, dividing the number of
samples by \( 8 \) or increasing the hypothesis space size to \( |H|^{8} \)
and then applying a sloppy discrete hypothesis bound is about equivalent to
applying a very specialized covering number bound. We seek a covering number
bound which does not divide the effective value of a hypothesis by \( 8 \).
\section{Bracketing Covering Number Bound }
There is an alternative version of the covering number bounds which has been
little used in learning theory. This is mentioned as the ``direct method''
in \cite{Haussler} and Section II.2 of \cite{Pollard} discusses this approach
but is only concerned with asymptotic consistency rather than rates of convergence.
We start with a more restricted notion of covering---a covering in which we
bracket the hypotheses. Let \( Z \) be the space of \( (x,y) \) labeled examples
and \( h(Z)=\{(x,y):h(x)\neq y\} \) in:
\[
N_{[]}(H,\gamma ,d_{D})=\textrm{inf}_{F}\left| F:\forall h\in H\, \, \exists (f,f')\in F:\, \, \begin{array}{cc}
d_{D}(f,f')\leq \gamma & \\
\textrm{and }f(Z)>h(Z)>f'(Z) &
\end{array}\right| \]
In other words, \( f \) and \( f' \) are two sets which are ``above'' and
``below'' every hypothesis that they cover. Note that it is very important
in this definition that the sets \( f \) and \( f' \) are not required to
correspond to functions in \( H \). In fact, they can simply be understood
as arbitrary sets of \( (x,y) \) pairs.
It is immediately obvious that:
\[
N_{[]}(H,\gamma ,d_{D})\geq N(H,\gamma ,d_{D})\]
However, the relationship between \( N_{[]}(H,\gamma ,d_{D}) \) and \( C(H,\gamma ) \)
is ambiguous: either could be larger.
Using this alternative notion of a covering, we have the following theorem:
\begin{thm}
\label{th-bracketing_cover} Let \( f(h) \) be the upper bracket of any hypothesis,
\( h \). For all \( \gamma \in (0,1] \), for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c}
e(h)\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \textrm{ }\\
\textrm{or }\hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right)
\end{array}\right) \leq \delta \]
where \( \begin{array}{c}
\bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\
\textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\}
\end{array} \).
\end{thm}
This theorem constrains the distance between \( \hat{e}(f(h)) \) and \( e(h) \)
and the distance between \( \hat{e}(f(h)) \) and \( \hat{e}(h) \). Consequently,
it constrains the distance between \( e(h) \) and \( \hat{e}(h) \). The proof
of the theorem rests on the following two lemmas which each assert a convergence.
Graphically, the proof has the following form:
\vspace{0.3cm}
{\par\centering \resizebox*{0.5\columnwidth}{!}{\includegraphics{bracketing_cover.eps}} \par}
\vspace{0.3cm}
\begin{lem}
Let \( f(h) \) be the upper bracket of any hypothesis, \( h \). For all \( \gamma \in (0,1] \),
for all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \leq \delta \]
where \( b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\} \)
\end{lem}
\begin{proof}
If \( f(h),f'(h) \) bracket \( h \), we have:
\[
e(f(h))-e(f'(h))\leq \gamma \]
Therefore a coin which is ``heads'' when \( f(h)(z)\neq f'(h)(z) \) has a
bias of \( \gamma \) or less. Since \( f'(h) \) is correct everywhere that
\( h \) is correct, we also know:
\[
\forall \epsilon \, \, \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(h)\geq \epsilon )\leq \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(f'(h))\geq \epsilon )\]
Therefore, we have at most \( N_{[]}(H,\gamma ,d_{D}) \) Binomial distributions,
each with bias at most \( \gamma \), and wish to bound the probability of
a large deviation. Using an upper binomial tail calculation, we get the result.
\end{proof}
\begin{lem}
For all \( \delta \in (0,1] \):
\[
\Pr _{D^{m}}\left( \exists f,f'\in N_{[]}(H,\gamma ,d_{D}):\, \, e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \]
where \( \bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \} \)
\end{lem}
\begin{proof}
Application of the discrete hypothesis space bound \ref{th-DHSCP} for \( f \)
in every \( (f,f') \) pair.
\end{proof}
We can now combine the lemmas to get the theorem.
\begin{proof}
(of theorem \ref{th-bracketing_cover}) Allocate \( \frac{\delta }{2} \) confidence
to each lemma and use the union bound with both lemmas to get:
\[
\Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c}
e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \\
\hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right)
\end{array}\textrm{ }\right) \leq \delta \]
where \( \begin{array}{c}
\bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\
\textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\}
\end{array} \). Since \( e(h)\leq e(f(h)) \) by construction, the proof is complete.
\end{proof}
The alternative covering number argument has a significant advantage over the
standard argument: when restricted to a finite hypothesis space, the argument
becomes tight. In particular, on a finite hypothesis space, we can set \( \gamma =0 \)
and get: \( N_{[]}(H,0,d_{D})\leq |H| \). The bound reduces to the standard
discrete hypothesis space bound \ref{th-DHSCP}. Consequently, there exist learning
problems where this bound is quite tight.
The bracketing covering approach has the following advantages and disadvantages:
\begin{enumerate}
\item Advantage: Covering defined in terms of the actual problem rather than a worst
case over all problems.
\item Advantage: Asymptotically tight for some learning problems.
\item Disadvantage: Less general. There may exist spaces which are not coverable in
the alternative approach.
\item Disadvantage: \( N_{[]}(H,\gamma ,d_{D}) \) is more difficult to compute than
\( N(H,\gamma ,d_{D}) \).
\end{enumerate}
In a sense, this bound is useless because it requires knowledge of the unknown
distribution \( D \) in order to calculate a covering number. In the next section,
we will see that bounds on the bracketing covering number which hold for all
\( D \) can often be found.
\section{Covering number calculations}
It is important to demonstrate that this covering number is feasible to calculate
and gives a better answer than the traditional approach. We will do this by
first calculating the bracketing covering number for a very simple continuous
classifier and then comparing the results with the traditional covering number
approach.
Bracketing covering numbers have already been proved for many function classes
\cite{Dudley}\cite{VW}. Here, we will present a proof for the simplest of
continuous hypothesis spaces: the step function on a line segment. Each hypothesis
will be indexed by a number \( a\in [0,1] \) according to:
\[
h_{a}(x)=\textrm{sign}(x-a)\]
What is \( N_{[]}(H,\gamma ,d_{D}) \) for this hypothesis set?
\begin{thm}
Assume that \( D \) can be described by a probability density function, then:
\[
N_{[]}(H,\gamma ,d_{D})\leq \frac{1}{\gamma }+1\]
\end{thm}
\begin{proof}
Consider a range of hypotheses from \( h_{a} \) to \( h_{b} \). For this range
of hypotheses, we can choose a bracketing pair, \( (f,f') \). In particular,
we can choose \( f \) and \( f' \) which agree on \( [0,a) \) and \( [b,1] \)
and always predict either incorrectly (\( f \)) or correctly (\( f' \)) on
\( [a,b) \). The distance between these functions satisfies:
\[
d_{D}(f,f')=\Pr _{D}(x\in [a,b))\]
and every hypothesis \( h_{c} \) in \( [a,b) \) satisfies \( \forall z\, \, f(z)\geq h_{c}(z)\geq f'(z) \).
Consequently, if \( a \) and \( b \) are chosen appropriately, we will observe
\( d_{D}(f,f')\leq \gamma \).
If \( D \) can be described by a probability distribution, then we can simply
calculate the marginal distribution, \( D_{x} \), and the cumulative distribution
of the margin, \( F_{D}(x) \). Now, find \( a_{i} \) for which \( F_{D}(a_{i})=\frac{i}{\gamma } \)
for \( i<\frac{1}{\gamma } \). Choose \( b_{i}=a_{i+1} \). There are at most
\( \frac{1}{\gamma }+1 \) intervals, each with a measure (according to \( D \))
of at most \( \gamma \). Consequently, we can cover \( H \) with \( \frac{1}{\gamma }+1 \)
pairs of \( (f_{i},f_{i}') \).
\end{proof}
Given that the bracketing cover is \( \frac{1}{\gamma }+1 \), we can use theorem
\ref{th-bracketing_cover} to define a constraint that the true error rate must
satisfy with high probability. Setting \( \gamma =\frac{1}{m-1} \), we get:
\[
\hat{e}(f(h))\leq \hat{e}(h)+b\left( m,\frac{1}{m-1},\frac{\delta }{2m}\right) \]
and
\[
e(h)\leq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2m}\right) \]
To be fair in comparison to the standard covering number approaches, we should
relax our theorem to use the Hoeffding approximation. Note that this is a bit
unfair because the first inequality is (inherently) a highly biased Binomial
with lower variance. Relaxing to the Hoeffding bound, we get:
\[
\hat{e}(f(h))\leq \hat{e}(h)+\frac{1}{m-1}+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\]
and
\[
e(h)\leq \hat{e}(f(h))+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\]
which implies:
\[
e(h)\leq \hat{e}(h)+\frac{1}{m-1}+2\sqrt{\frac{\ln 2m+\ln \frac{1}{\delta }}{2m}}\]
Note again that we are being ``unfair'' to the new approach by using the Hoeffding
approximation rather than exact Binomial-tail bounds. The standard covering
number approach has not yet been reduced to exact Binomial-tail bounds. Using
the standard approach, the covering number, we get \( C(H,\frac{1}{m-1})=m \).
This implies a bound of:
\[
e(h)\leq \hat{e}(h)+4\sqrt{\frac{\ln 8m+\ln \frac{1}{\delta }}{m}}\]
Comparing the bounds, we see that the new approach is about \( \frac{16}{2}=8 \)
times more efficient in the number of examples required to achieve a bound on
a given deviation.
\subsection{Note on the Bracketing Cover proof}
There are several important things to note about this proof.
\begin{enumerate}
\item We used the property that a small change in the hypothesis only affected the
prediction on a small portion of the input space.
\item The bound on \( N_{[]} \) holds for all \( D \) with a density function, not
just the \( D \) which we happen to observe.
\item The bound on \( N_{[]}(H,\gamma ,D) \) is exactly the same as a bound on \( N\left( H,\frac{\gamma }{2},D\right) \).
\end{enumerate}
In fact, the proof can be extended to \emph{all} \( D \) (even ones with point
masses) at the cost of a factor of \( 2 \) worsening and a messier argument.
Property (2) is desirable because it is often not the case that we know the
distribution \( D \) when we wish to apply the bound. Property (1) is an essential
technique that can be used to prove other covering number bounds for this notion
of covering number.
Can we show partial order covering number bounds for other classifiers? There
is a straightforward extension of the previous proof for classifiers which consist
of axis parallel intervals in \( R^{n} \). More work is required to prove partial
order covering numbers for the hypothesis spaces of standard learning algorithms.
\section{Conclusion and Future Work}
We presented an alternative covering number argument and showed that the true
error rate bounds constructed using this argument are within \( O\left( \frac{\ln m}{m}\right) \)
of the lower bound on some learning problems. This is a significant improvement
over prior results which just bound the ratio of the lower and upper bounds
up to a constant. We also presented a simple improvement on PAC-Bayes bounds
for stochastic classifiers which achieves a similar \( O\left( \frac{\ln m}{m}\right) \)
difference between the lower and upper bounds.
It is interesting to examine the relationship between the bracketing covering
number and the PAC-Bayes bound. With this notion of covering number we can guarantee
that all of the hypotheses covered by the same bracketing pair have similar
empirical as well as true errors. Thus, we can relate the error rate of an individual
hypothesis to a set of hypotheses with a significant measure --- exactly the
setting where the PAC-Bayes bound is tight.
Much work remains to be done in order to fulfill a quest for quantitatively
tight learning bounds.
\begin{enumerate}
\item Proofs on the size of the partial order covering number need to be made for
common learning algorithms.
\item Can this alternate form of covering number be related to the VC dimension or
to the standard definition of covering number?
\item Can we extend the class of problems for which the lower and upper bounds differ
by only \( O\left( \frac{\ln m}{m}\right) \) to a larger set?
\end{enumerate}
\chapter{Holdout bounds: Progressive Validation}
\label{sec-PV}
\section{Progressive Validation Technique}
Progressive validation is a technique which allows you to use almost \( \frac{1}{2} \)
of the data in a holdout set for training purposes while still providing the
same guarantee as the holdout bound. It first appeared in \cite{Progressive}
and is discussed in a more refined and detailed form here.
Suppose that you have a training set of size \( m_{\textrm{train}} \) and test
set of size \( m_{\textrm{pv}} \). Progressive validation starts by first learning
a hypothesis on the training set and then testing on the first example of the
test set. Then, we train on training set plus the first example of the test
set and test on the second example of the test set. The process continues \( m_{\textrm{test}} \)
iterations. Let \( m \) abbreviate \( m_{\textrm{pv}} \). Then, we have \( m \)
hypotheses, \( h_{1},...,h_{m} \) and \( m \) error observations, \( \hat{e}_{1},...,\hat{e}_{m} \).
The hypothesis output by progressive validation is the randomized hypothesis
which chooses uniformly from \( h_{1},...,h_{m} \) and evaluates to get an
estimated output. Note that this protocol is similar to those in \cite{Littlestone89}
and the new thing here is an analysis of performance.
Since we are randomizing over hypotheses trained on \( m_{\textrm{train}} \)
to \( m_{\textrm{train}}+m_{\textrm{pv}}-1 \) examples, the expected number
of examples used by any hypothesis is \( m_{\textrm{train}}+\frac{m_{\textrm{pv}}-1}{2} \).
Given that training can exhibit phase transitions, the extra few examples can
greatly improve the accuracy of the trained example.
Viewed as an interactive proof of learning, the progressive validation technique
follows the protocol of figure \ref{fig-pv-protocol}.
\begin{figure}
{\par\centering \resizebox*{0.9\columnwidth}{!}{\includegraphics{thesis-presentation/progressive_validation.eps}} \par}
\caption{\label{fig-pv-protocol} The progressive validation protocol has a learner
repeatedly commit to a hypothesis before it is given a new example. Based upon
the test errors, a bound on the true error rate of the metahypothesis which
chooses randomly from each of \protect\( (h_{1},...,h_{m})\protect \) before
each evaluation is provided. }
\end{figure}
The true error rate of this randomized hypothesis will be:
\[
e_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}e(h_{i})\]
where \( e(h_{i})=\Pr _{D}(h_{i}(x)\neq y) \) and the empirical error estimate
of this randomized hypothesis will be:
\[
\hat{e}_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}\hat{e}_{i}\]
\section{Variance Analysis}
Bounding the deviation of \( \hat{e}_{\textrm{pv}} \) is more difficult than
bounding the deviation of a holdout error. To understand this, we can think
of two games, the holdout game and the progressive validation game.
In the holdout game, your opponent chooses a bias and then nature flips \( m \)
coins with that bias. If the deviation of the average number of heads is larger
than \( \epsilon \), then you lose. Otherwise, you win.
In the progressive validation game, the opponent chooses the bias of \emph{each}
coin just before it is flipped. The goal of the opponent remains the same, and
the opponent wins if a large deviation is observed.
The progressive validation opponent is at least as strong as the holdout opponent
since the progressive validation opponent could choose the same bias for every
coin. Nonetheless, we will see that the progressive validation opponent is \emph{not}
much stronger.
There are two ways in which we can show that the progressive validation opponent
is not much stronger. The first technique will show that the variance of the
progressive validation estimate is smaller than might be expected. Then we will
show that the \emph{deviations} of the progressive validation opponent behave
much like the deviations of an independent opponent.
\begin{thm}
Suppose we test the progressive validation hypothesis on \( m_{\textrm{test}}=m_{\textrm{pv}} \)
additional examples. Let \( \hat{e}_{\textrm{test}} \) be the empirical error
on these examples. Then, we have:
\[
E_{(x,y)^{m_{\textrm{pv}}+m_{\textrm{test}}}\sim D^{m_{\textrm{pv}}+m_{\textrm{test}}}}(\hat{e}_{\textrm{test}}-e_{\textrm{pv}})^{2}\]
\[
\geq E_{(x,y)^{m_{\textrm{pv}}}\sim D^{m_{\textrm{pv}}}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\]
\end{thm}
\begin{proof}
Every example on the left hand side can be though of as a coin with bias \( e_{\textrm{pv}}. \)
The variance of the LHS is then \( m*e_{\textrm{pv}}(1-e_{\textrm{pv}}) \).
The right hand side is:
\[
E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\]
\[
=E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i=1}^{m}\hat{e}_{i}-e_{i}\right] ^{2}\]
where \( e_{i}=e(h_{i}) \)
\[
=E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i\neq j}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}(\hat{e}_{i}-e_{i})^{2}\right] \]
\[
=\sum _{i\neq j}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})^{2}\]
The cross product term is:
\[
E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})\]
Without loss of generality, assume that \( i