\documentstyle[colt00e,times]{article}
%\usepackage[T1]{fontenc}
%\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
%\providecommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\@}
%\newcommand{\noun}[1]{\textsc{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
%\theoremstyle{plain}
\newtheorem{thm}{Theorem}[section]
%\numberwithin{equation}{section} %% Comment out for sequentially-numbered
%\numberwithin{figure}{section} %% Comment out for sequentially-numbered
%\theoremstyle{plain}
\newtheorem{lem}[thm]{Lemma} %%Delete [thm] to re-start numbering
\newtheorem{stm}{Statement}[section]
\newenvironment{proof_outline}{\par{\bf Proof Outline:}}{\qed \par}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
%\input{preamble}
%\input{psfig}
\textwidth = 6.5in
\textheight = 8.5in
\setlength\topmargin{-0.25in}
\setlength\oddsidemargin{-0.25in}
\setlength\evensidemargin{-0.25in}
\newcommand{\ignore}[1]{}
\newcommand{\mtt}[1]{\mbox{\tt #1}}
\newcommand{\calh}{{\mathcal H}}
\newcommand{\calx}{{\matccal X}}
\newcommand{\tuple}[1]{{\mbox{$\langle#1\rangle$}}}
\newcommand{\hateps}{\hat{e}}
\newcommand{\hate}{\hat{e}}
\newcommand{\hats}{\hat{s}}
\newcommand{\hathats}{\hat{s}^+}
\newcommand{\tils}{\bar{s}}
\newcommand{\hatL}{\hat{L}}
\newcommand{\hatl}{\hat{\ell}}
\newcommand{\hatR}{\hat{R}}
\newcommand{\hathateps}{\hat{\hat{e}}}
\newcommand{\Cgamma}{{\cal C}_\gamma}
\newcommand{\bareps}{\overline{\epsilon}}
\newcommand{\hatP}{\hat{P}}
\newcommand{\hatZ}{\hat{Z}}
\newcommand{\myfloor}[1]{\lfloor#1\rfloor}
\newcommand{\myceil}[1]{\lceil#1\rceil}
\newcommand{\dceil}[1]{\lceil \lceil #1 \rceil \rceil}
\newtheorem{theorem}{\noindent Theorem}
\newtheorem{lemma}[theorem]{\noindent Lemma}
\newtheorem{corollary}[theorem]{\noindent Corollary}
\newtheorem{conjecture}[theorem]{\noindent Conjecture}
\newtheorem{proposition}[theorem]{\noindent Proposition}
\newtheorem{example}{\noindent Example}
\newtheorem{definition}{\noindent Definition}
\newtheorem{claim}[theorem]{\noindent Claim}
\newtheorem{fact}[theorem]{\noindent Fact}
\newenvironment{proof}{\par{\bf Proof:}}{\qed \par}
\newcommand{\qed}{\mbox{$\;\;\;\Box$}}
\newcommand{\lub}{\mtt{lub}}
\newcommand{\glb}{\mtt{glb}}
%\newlength{\linelength}
%\setlength{\linelength}{0.25\textwidth}
\title{Computable Shell Decomposition Bounds}
\author{ {\bf John Langford} \\
School of Computer Science \\
Carnegie Mellon University\\
jcl@cs.anyu.edu \\
www.cs.cmu.edu/\~{ }jcl \\
\And
{\bf David McAllester} \\
AT\&T Labs-Research \\
dmac@research.att.com \\
www.research.att.com/\~{ }dmac
}
\begin{document}
\maketitle
\begin{abstract}
Haussler, Kearns, Seung and Tishby introduced the notion of a shell
decomposition of the union bound as a means of understanding certain
empirical phenomena in learning curves such as phase transitions. Here
we use a variant of their ideas to derive an upper bound on the
generalization error of a hypothesis computable from its training
error and the histogram of training errors for the hypotheses in the
class. In most cases this new bound is significantly tighter than
traditional bounds computed from the training error and the
cardinality or VC dimension of the class. Our results
can also be viewed as providing PAC theoretical foundations for a
model selection algorithm proposed by Scheffer and Joachims.
\end{abstract}
\maketitle
\section{Introduction}
For an arbitrary finite hypothesis class we consider the hypothesis of
minimal training error. We give a new upper bound on the
generalization error of this hypothesis computable from the training
error of the hypothesis and the histogram of the training errors of
the other hypotheses in the class. This new bound is typically much
tighter than more traditional upper bounds computed from the training
error and cardinality or VC dimension of the class.
As a simple example, suppose that we observe that all but one
empirical error in a hypothesis space is $1/2$ and one empirical error
is $0$. Furthermore, suppose that the sample size is large enough
(relative to the size of the hypothesis class) that with high
confidence we have that, for all hypotheses in the class, the true
(generalization) error of a hypothesis is within $1/5$ of its training
error. This implies, that with high confidence, hypotheses with
training error near $1/2$ have true error in $[3/10,7/10]$.
Intuitively, we would expect that the true error of the hypothesis
with minimum empirical error to be very near to $0$ rather than simply
less than $1/5$ because none of the hypotheses which produced an
empirical error of $1/2$ could have a true error close enough to $0$
that there exists a significant probability of producing $0$ empirical
error. The bound presented here validates this intuition. We show
that you can ignore hypotheses with training error near $1/2$ in
calculating an ``effective size'' of the class for hypotheses with
training error near $0$. This new effective class size allows us to
calculate a tighter bound on the difference between training error and
true error for hypotheses with training error near $0$. The new bound
is proved using a distribution-dependent application of the union
bound similar in spirit to the shell decomposition introduced by
Haussler, Kearns, Seung and Tishby \cite{HKST96}.
We actually give two upper bounds on generalization error --- an
uncomputable bound and a computable bound. The uncomputable bound is
a function of the unknown distribution of true error rates of the
hypotheses in the class. The computable bound is, essentially, the
uncomputable bound with the unknown distribution of true errors
replaced by the known histogram of training errors. Our main
contribution is that this replacement is sound, i.e., the computable
version remains, with high confidence, an upper bound on
generalization error.
When considering asymptotic properties of learning theory bounds it is
important to take limits in which the cardinality (or VC dimension) of
the hypothesis class is allowed to grow with the size of the sample.
In practice more data typically justifies a larger hypothesis class.
For example, the size of a decision tree is generally proportional the
amount of training data available. Here we analyze the asymptotic
properties of our bounds by considering an infinite sequence of
hypothesis classes $\calh_m$, one for each sample size $m$, such that
$\frac{\ln|\calh_m|}{m}$ approaches a limit larger than zero. This
kind of asymptotic analysis provides a clear account of the
improvement achieved by bounds that are functions of error rate
distributions rather than simply the size (or VC dimension) of the
class.
We give a lower bound on generalization error showing that the
uncomputable upper bound is asymptotically as tight as possible ---
any upper bound on generalization error given as a function of the
unknown distribution of true error rates must asymptotically be
greater than or equal to our uncomputable upper bound.
Our lower bound on generalization error also shows that
there is essentially no loss in working with an upper bound computed from the
true error distribution rather than expectations computed from this
distribution as used by Scheffer and Joachims \cite{Tobias}.
Asymptotically, the computable bound is simply the uncomputable bound
with the unknown distribution of true errors replaced with observed
histogram of training errors. Unfortunately, we can show that in
limits where $\frac{\ln|\calh_m|}{m}$ converges to a value greater
than zero, the histogram of training errors need not converge to the
distribution of true errors --- the histogram of training errors is a
``smeared out'' version of the distribution of true errors. This
smearing loosens the bound even in the large-sample asymptotic limit.
We give a precise asymptotic characterization of this smearing effect
for the case where distinct hypotheses have independent training errors.
In spite of the divergence between the uncomputable and computable
bounds, the computable bound is still significantly tighter than
classical bounds not involving error distributions.
The computable bound can be used for model selection. In the case of
model selection we can assume an infinite sequence of finite model
classes $\calh_0, \calh_1, ... $ where each $\calh_j$ is a finite
class with $\ln|\calh_j|$ growing linearly in $j$. To perform model
selection we find the hypothesis of minimal training error in each
class and use the computable bound to bound its generalization error.
We can then select, among these, the model with the smallest upper
bound on generalization error. Scheffer and Joachims propose
(without formal justification) replacing the distribution of true errors
with the histogram of training errors.
Under this replacement, the model selection algorithm based on our
computable upper bound is asymptotically identical to the algorithm
proposed by Scheffer and Joachims.
The shell decomposition is a distribution-dependent use of the union
bound. Distribution-dependent uses of the union bound have been
previously exploited in so-called self-bounding algorithms. Freund
\cite{Freund} defines, for a given learning algorithm \emph{and data
distribution}, a set \( S \) of hypotheses such that with high
probability over the sample, the algorithm will always return a
hypothesis in that set. Although \( S \) is defined in terms of the
unknown data distribution, Freund gives a way of computing a set \( S'
\) from the given algorithm and the sample such that, with high
confidence, \( S' \) contains \( S \) and hence the ``effective size''
of the hypothesis class is bounded by \( |S'| \). Langford and Blum
\cite{Microchoice} give a more practical version of this
algorithm. Given an algorithm and data distribution they conceptually
define a weighting over the possible executions of the algorithm.
Although the data distribution is unknown, they give a way of
computing a lower bound on the weight of the particular execution of
the algorithm generated by the sample at hand. In this paper we
consider distribution dependent union bounds defined independently of
any particular learning algorithm.
\section{Mathematical Preliminaries}
For an arbitrary measure on an arbitrary sample space we use the
notation \( \forall ^{\delta }S\; \Phi [S,\delta] \) to mean that with
probability at least \( 1-\delta \) over the choice of the sample \( S
\) we have that \( \Phi [S,\delta] \) holds. In practice \( S \) is the
training sample of a learning algorithm. Note that \( \forall x\;
\forall ^{\delta }S\; \Phi [x,\; S,\; \delta] \) does not imply \( \forall
^{\delta }S\; \forall x\; \Phi [x,\; S,\; \delta] \). If $X$ is a finite set,
and for all $x \in X$ we have the assertion $\forall \delta > 0
\;\forall^\delta S\;\Phi[S,x,\delta]$ then by a standard application
of the union bound we have the assertion $\forall \delta > 0
\;\forall^\delta S\;\forall x \in X \;\Phi[S,x,\frac{\delta}{|X|}]$. We will
call this the quantification rule. If $\forall \delta > 0
\;\forall^\delta S\;\Phi[S,\delta]$ and $\forall \delta > 0 \;\forall^\delta
S\;\Psi[S,\;\delta]$ then by a standard application of the union bound
we have $\forall \delta > 0 \;\forall^\delta S\;\Phi[S,\frac{\delta}{2}]
\wedge \Psi[S,\frac{\delta}{2}]$. We will call this the conjunction rule.
The KL-divergence
of p from q, denoted \( D(q||p) \), is \( q\ln (\frac{q}{p})+(1-q)\ln (\frac{1-q}{1-p}) \)
with \( 0 \ln (\frac{0}{p}) =0 \) and \( q \ln (\frac{q}{0}) = \infty \).
Let \( \hat{p} \) be
the fraction of heads in a sequence \( S \) of \( m \) tosses of a biased
coin where the probability of heads is \( p \). For $\hat{p} \geq p$ we have the following
inequality given by Chernoff in 1952 \cite{Ch52}.
\begin{equation}
\label{eqn:relent-prelim}
\forall q\in [p,1]:\; \; Pr(\hat{p}\geq q)\; \; \; \leq \; \; \; e^{-mD(q||p)}
\end{equation}
This bound can be rewritten as follows.
\begin{equation}
\label{eqn:relent}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\max(\hat{p},p)||p)\leq \frac{\ln (\frac{1}{\delta })}{m}
\end{equation}
To derive (\ref{eqn:relent}) from (\ref{eqn:relent-prelim}) note that
$Pr(D(\max(\hat{p},p)||p) \geq \frac{\ln (\frac{1}{\delta })}{m})$
equals $Pr(\hat{p} \geq q)$ where $q \geq p$ and $D(q||p) = \frac{\ln (\frac{1}{\delta})}{m}$.
By (\ref{eqn:relent-prelim}) we then have that this probability
is no larger than $e^{-mD(q||p)} = \delta$.
It is just as easy to derive (\ref{eqn:relent-prelim}) from (\ref{eqn:relent})
so the two statements are equivalent.
By duality, i.e., by considering the problem defined by replacing $p$ by $1-p$, we get the following.
\begin{equation}
\label{eqn:relent2}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\min(\hat{p},p)||p)\leq \frac{\ln ( \frac{1}{\delta })}{m}
\end{equation}
Conjoining (\ref{eqn:relent}) and (\ref{eqn:relent2}) yields the following
corollary of (\ref{eqn:relent-prelim}).
\begin{equation}
\label{eqn:relent-main}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\hat{p}||p)\leq \frac{\ln (\frac{2}{\delta })}{m}
\end{equation}
%This bound can be thought of as essentially the same
%as the standard Chernoff bound except that instead of an $l_2$ metric the
%more natural $D(\hat{p}||p)$ is used.
Using the inequality \( D(q||p)\geq 2(q-p)^{2} \) one can show that
(\ref{eqn:relent-main}) implies the following better known form of the Chernoff bound.
\begin{equation}
\label{eqn:chernoff}
\forall \delta >0\; \forall ^{\delta }S\; \; \; |p-\hat{p}| \leq \sqrt{\frac{\ln (\frac{2}{\delta })}{2m}}
\end{equation}
Using the inequality $D(q||p) \geq \frac{(p-q)^2}{2q}$, which holds for $q \leq p$, we can show
that (\ref{eqn:relent2}) implies the following.\footnote{A derivation of this formula
can be found in \cite{Mc1} or \cite{Mc2}. To see the need for the last term consider the case
where $\hat{p}=0$.}
\begin{equation}
\label{eqn:chernoff2}
\forall \delta >0\; \forall ^{\delta }S\; \; \; p \leq \hat{p} + \sqrt{\frac{2\hat{p}\ln (\frac{1}{\delta })}{m}}
+ \frac{2\ln(\frac{1}{\delta})}{m}
\end{equation}
Note that for small values of $\hat{p}$ formula (\ref{eqn:chernoff2}) gives a tighter upper bound on $p$ than does
(\ref{eqn:chernoff}).
The upper bound on $p$ implicit in (\ref{eqn:relent-main}) is somewhat tighter than
the minimum of the bounds given by (\ref{eqn:chernoff}) and (\ref{eqn:chernoff2}).
We now consider a formal setting for hypothesis learning. We assume a finite
set ${\mathcal H}$ of hypotheses and a space
\( {\mathcal{X}} \) of instances. We assume that each hypothesis represents
a function from \( {\mathcal{X}} \) to \( \{0,1\} \) where we write \( h(x) \)
for the value of the function represented by hypothesis \( h \) when applied
to instance \( x \). We also assume a distribution \( D \) on pairs \( \tuple {x,\; y} \)
with $x \in {\mathcal X}$ and $y \in \{0,1\}$.
For any hypothesis \( h \) we define the error rate of \( h \), denoted \( e(h) \),
to be \( P_{\tuple {x,\; y}\sim D}(h(x)\not =y) \). For a given sample \( S \)
of \( m \) pairs drawn from \( D \) we write \( \hateps (h) \) to denote
the fraction of the pairs \( \tuple {x,\; y} \) in \( S \) such that \( h(x)\not =y \).
Quantifying over $h \in \calh$ in (\ref{eqn:relent-main}) yields the following
second corollary of (\ref{eqn:relent-prelim}).
\begin{equation}
\label{eqn:tight-naive}
\forall ^{\delta }S\; \; \; \forall h \in \calh\; \; \; D(\hateps (h)||e(h))\leq
\frac{\ln|{\mathcal H}|+\ln (\frac{2}{\delta})}{m}
\end{equation}
By consider bounds on $D(q||p)$ we can derive the following more well known corollary
of (\ref{eqn:tight-naive}).
\begin{equation}
\label{eqn:naive2}
\forall ^{\delta }S\; \; \; \forall h\in \calh \; \; \;|e(h)-\hateps(h)|
\leq \sqrt{\frac{\ln|\calh| +\ln (\frac{2}{\delta })}{2m}}
\end{equation}
These two formulas both limit the distance between $\hateps(h)$ and
$e(h)$. In this paper we will work with (\ref{eqn:tight-naive}) rather
than (\ref{eqn:naive2}) because it yields an (uncomputable)
upper bound on generalization error that is optimal up to asymptotic equality.
\ignore{
(\ref{eqn:tight-naive}) is a strict improvement on
(\ref{eqn:naive2}) which requires no extra information over what is
used in standard PAC bounds. Consequently, (\ref{eqn:tight-naive})
provides a simple quick improvement for all ``agnostic'' PAC bounds.
Now we want to learn what can be done with more information.
}
\section{The Upper Bound}
\label{sec:upper}
Our goal now is to improve on (\ref{eqn:tight-naive}). Our first step
is to divide the hypotheses in $\calh$ into $m$ disjoint sets based on their
true error rates. More specifically, for $p \in [0,1]$ define $\dceil{p}$
to be $\frac{\max(1,\myceil{mp})}{m}$. Note that $\dceil{p}$ is of the form $\frac{k}{m}$
where either $p = 0$ and $k = 1$ or $p > 0$ and $p \in (\frac{k-1}{m},\;\frac{k}{m}]$.
In either case we have $\dceil{p} \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ and
if $\dceil{p} = \frac{k}{m}$ then $p \in [\frac{k-1}{m},\;\frac{k}{m}]$.
Now we define $\calh(\frac{k}{m})$ to be the set of $h \in \calh$ such that $\dceil{e(h)} = \frac{k}{m}$.
We define $s(\frac{k}{m})$ to be $\ln(\max(1,|\calh(\frac{k}{m})|))$.
We now have the following lemma.
\begin{lem}
\label{lem:main1} $\forall \delta > 0 \; \forall^\delta S\;\;\; \forall h \in \calh$
\medskip
$$D(\hateps(h)||e(h)) \leq \frac{s(\dceil{e(h)}) + \ln(\frac{2m}{\delta})}{m}$$
\end{lem}
\begin{proof}
Quantifying over $p \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ and $h \in \calh(p)$ in
(\ref{eqn:relent-main}) gives $\forall \delta > 0$, $\forall^\delta S$,
$\forall p \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$, $\forall h \in \calh(p)$,
\medskip
$$D(\hateps(h)||e(h)) \leq \frac{\ln|\calh(p)| + \ln(\frac{2m}{\delta})}{m}$$
But this implies the lemma.
\end{proof}
Lemma~\ref{lem:main1} imposes a constraint, and hence a bound, on
$e(h)$. More specifically, we have the following
where $\lub\;\{x:\;\Phi[x]\}$ denotes the least upper bound
(the maximum) of the set $\{x:\;\Phi[x]\}$.
{\footnotesize \begin{equation}
\label{eqn:main1-equation}
e(h) \leq \lub\;\{q: D(\hateps(h)||q) \leq \frac{s(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\}
\end{equation}}
This is our uncomputable bound. It is uncomputable because
the $m$ numbers $s(\frac{1}{m})$, $\ldots$, $s(\frac{m}{m})$ are
unknown. Ignoring this problem, however, we can see that this bound
is typically significantly tighter than (\ref{eqn:tight-naive}). More
specifically, we can rewrite (\ref{eqn:tight-naive}) as follows.
\begin{equation}
\label{eqn:tight-naive2}
e(h) \leq \lub\;\{q :D(\hateps(h)||q) \leq \frac{\ln|\calh| + \ln(\frac{2}{\delta})}{m}\}
\end{equation}
Since $s(\frac{k}{m}) \leq \ln|\calh|$, and since $\frac{\ln m}{m}$ is small for large $m$,
we have that (\ref{eqn:main1-equation}) is never significantly looser than (\ref{eqn:tight-naive2}).
Now consider a hypothesis $h$ such that the bound on $e(h)$ given by
(\ref{eqn:tight-naive}), or equivalently, (\ref{eqn:tight-naive2}),
is significantly less than 1/2. Assuming $m$ is large, the bound given by
(\ref{eqn:main1-equation}) must also be significantly less than 1/2. But for $q$ significantly
less than 1/2 we will typically have that $s(\dceil{q})$ is significantly smaller
than $\ln|\calh|$. For example, suppose $\calh$ is the set of all decision trees
of size $m/10$. For large $m$, a random decision tree of this size
will have error rate near 1/2. The set of decision trees with error rate significantly
smaller than 1/2 will be an exponentially small faction of the set of all possible trees.
So for $q$ small compared to 1/2 we get that $s(\dceil{q})$ is significantly smaller than $\ln|H|$.
This will make the bound given by (\ref{eqn:main1-equation}) significantly tighter than the bound
given by (\ref{eqn:tight-naive2}).
We now show that the distribution of true errors can be replaced, essentially,
by the histogram of training errors. We first introduce the following definitions.
{\scriptsize \begin{eqnarray*}
\ignore{\hat{\calh}\left(\frac{k}{m}\right) & \equiv & \left\{h\in \calh:\;\; \hate(h) = \frac{k}{m}\right\} \\
\\
\hat{s}\left(\frac{k}{m}\right) & \equiv
& \ln\left(\max\left(1,\;\left|\hat{\calh}\left(\frac{k}{m}\right)\right|\right)\right) \\
\\}
\hat{\calh}\left(\frac{k}{m},\delta\right)\!\!\!\!\! & \equiv & \!\!\left\{h\in \calh:\;\; \left|\hateps(h) - \frac{k}{m}\right|
\leq \frac{1}{m} + \sqrt{\frac{\ln(\frac{16m^2}{\delta})}{2m-1}}\right\} \\
\\
\hats\left(\frac{k}{m},\;\delta\right) \!\! & \equiv &
\ln\left(\max\left(1,\;2\left|\hat{\calh}\left(\frac{k}{m},\;\delta\right)\right|\right)\right)
\end{eqnarray*}}
\ignore{The value $\hats\left(\frac{k}{m}\right)$ corresponds to the natural notion of the empirical histogram of training
errors. However, for the computable upper bound to be sound we need to use the slightly larger
value $\hathats(\frac{k}{m},\;\delta)$.
The width of the interval of values of $\hate(h)$ allowed in
the definition of $\hat{\calh}^+\left(\frac{k}{m}\right)$
declines as $O(\sqrt{\frac{\ln (m)}{m}})$. If $m$ is large, and the histogram of empirical errors
is ``smooth'' in the sense that $\hats(\frac{k'}{m})$ is roughly constant for
$\frac{k'}{m}$ in the interval allowed in the definition of $\hat{\calh}^+\left(\frac{k}{m}\right)$, then
we get that $\frac{\hathats(\dceil{p},\;\delta)}{m}$ must be near $\frac{\hats(\dceil{p})}{m}$.
In the limit
defined in section~\ref{sec:asym-upper} we have the following for any fixed value
of $p$ and $\delta$.
$$\lim_{m\rightarrow \infty} \frac{\hats(\dceil{p})}{m} =
\lim_{m\rightarrow \infty} \frac{\hathats(\dceil{p},\;\delta)}{m}$$
}
The definition of $\hats\left(\frac{k}{m},\;\delta\right)$ is motivated by the following lemma.
\begin{lem}
\label{lem:main2}
$\forall \delta > 0$, $\forall^\delta S$, $\forall q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$,
\medskip
$$s(q) \leq \hats(q,\;2\delta)$$
\end{lem}
Before proving lemma ~\ref{lem:main2} we note that
by conjoining (\ref{eqn:main1-equation}) and lemma~\ref{lem:main2}
we get the following. This is our main result.
\begin{thm}
\label{thm:main1} $\forall \delta > 0$, $\forall^\delta S$, $\forall h \in \calh$,
\medskip
{\footnotesize $$e(h) \leq \lub\;\left\{q: D(\hateps(h)||q)
\leq \frac{\hats(\dceil{q},\;\delta) + \ln (\frac{4m}{\delta})}{m}\right\}
$$}
\end{thm}
As for lemma~\ref{lem:main1}, the bound implicit in theorem~\ref{thm:main1} is typically
significantly tighter than the bound in (\ref{eqn:tight-naive}) or its equivalent
form (\ref{eqn:tight-naive2}).
The argument
for the improved tightness of theorem~\ref{thm:main1} over (\ref{eqn:tight-naive2})
is similar to the argument for (\ref{eqn:main1-equation}). More specifically,
consider a hypothesis $h$ for which the bound in (\ref{eqn:tight-naive2}) is significantly
less than 1/2. Since $\hats(\dceil{q},\;\delta) \leq \ln|\calh|$,
the set of values of $q$ satisfying the condition in theorem~\ref{thm:main1}
must all be significantly less than 1/2. But for large $m$ we have that
$\sqrt{\frac{\ln(16m^2/\delta)}{2m-1}}$ is small. So if $q$ is significantly less
than 1/2 then all hypotheses in $\hat{\calh}(\dceil{q}, \delta)$ have empirical
error rates significantly less than 1/2. But for most hypothesis classes, e.g., decision trees,
the set of hypotheses with empirical error rates far from 1/2 should be an exponentially small
fraction of the class. Hence we get that $\hats(\dceil{q},\;\delta)$
is significantly less than $\ln|\calh|$ and theorem~\ref{thm:main1} is tighter
than (\ref{eqn:tight-naive2}).
The remainder of this section is a proof of lemma~\ref{lem:main2}.
Our departure point for the proof is the following lemma from \cite{PACBayes2}.
\ignore{
\begin{lem}
\label{lem:helper1}
\[ \forall \delta > 0 \;\forall ^{\delta }S \;\;\;\frac{1}{|H|}\sum_h e^{(m-1)D(\hat{e}(h)||e(h))}
\leq \frac{2m}{\delta } \]
\end{lem}
\begin{proof}
First, for any given hypothesis $h$ we prove the following.
\begin{equation}
\label{eqn:expectation}
E_{S}[e^{(m-1)D(\hate(h)||e(h))}]\leq 2m
\end{equation}
Lemma~\ref{lem:helper1} follows from (\ref{eqn:expectation}) by taking an expectation over selecting $h$
according to a uniform distribution on $H$, reversing the two expectations, and applying Markov's inequality.
We now show that (\ref{eqn:expectation}) follows from (\ref{eqn:relent-prelim}) and its dual.
More specifically,
we maximize $\int_0^1 e^{(m-1)D(x||p)}f(x)dx$ over all functions $f(x)$
satisfying the following for all $q_1 \geq e(h)$ and $q_2 \leq e(h)$.
$$\int_{q_1}^1 f(x)dx \leq e^{-mD(q_1||e(h))}$$
$$\int_0^{q_2} f(x)dx \leq e^{-mD(q_2||e(h))}$$
The value of $E_{S}[e^{(m-1)D(\hate(h)||e(h))}]$ must be less than this maximum.
The integral $\int_0^1 e^{(m-1)D(x||e(h))}f(x)dx$ is maximized when $f(x)$
is as ``spread out'' as possible, i.e., when the above inequalities are
replaced by equalities. This gives the following.
$$f(x) = \left\{ \begin{array}{ll} m\frac{\partial D(x||p)}{\partial x} e^{-mD(x||p)} & \mbox{for}\; x \geq p \\
\\
- m\frac{\partial D(x||p)}{\partial x} e^{-mD(x||p)} & \mbox{for}\; x \leq p
\end{array} \right.$$
Which, in turn, gives the following.
\begin{eqnarray*}
E_S[E^{(m-1)D(\hate(h)||e(h))}] & \leq & \int_0^1 e^{(m-1)D(x||e(h))}f(x)dx \\
\\
& = & \int_0^{e(h)} -m \frac{\partial D(x||e(h))}{\partial x} e^{-D(x||e(h))} dx
+ \int_{e(h)}^1 m \frac{\partial D(x||e(h))}{\partial x} e^{-D(x||e(h))} dx \\
& \leq & 2m
\end{eqnarray*}
\end{proof}
}
\begin{lem}[McAllester 99]
\label{lem:helper1}
For any measure on any hypothesis class we have
the following where $\mtt{E}_h f(h)$ denotes the expectation of $f(h)$ under the
given measure on $h$.
\[ \forall \delta > 0 \;\forall ^{\delta }S \;\;\;\mtt{E}_h e^{(2m-1)(\hat{e}(h)-e(h))^2}
\leq \frac{4m}{\delta } \]
\end{lem}
Intuitively, this lemma states that with high confidence over the choice of the sample
most hypotheses have empirical error near their true error. This will allow us to prove
that $\hats(\dceil{q},\;\delta)$ bounds $s(\dceil{q})$. More specifically,
by considering the uniform distribution on $\calh(\frac{k}{m})$, lemma
\ref{lem:helper1} implies the following.
{\scriptsize
\begin{eqnarray*}
\mtt{E}_{h\sim \calh\left(\frac{k}{m}\right)} \left( e^{(2m-1)(\hat{e}(h)-e(h))^2} \right)
& \leq & \frac{4m}{\delta } \\
\\
Pr_{h\sim \calh\left(\frac{k}{m}\right)} \left(e^{(2m-1)(\hat{e}(h)-e(h))^2} \geq \frac{8m}{\delta }\right)
& \leq & \frac{1}{2} \\
\\
Pr_{h\sim \calh\left(\frac{k}{m}\right)} \left(e^{(2m-1)(\hat{e}(h)-e(h))^2} \leq \frac{8m}{\delta }\right)
& \geq & \frac{1}{2} \\
\\
\left|\left\{h\in \calh(\frac{k}{m}):\;|\hateps(h)-e(h)|
\leq \sqrt{\frac{\ln(\frac{8m}{\delta})}{2m-1}}\right\}\right|
& \geq & \frac{1}{2}|\calh(\frac{k}{m})| \\
\\
\left|\left\{h\in \calh(\frac{k}{m}):\;|\hateps(h) - \frac{k}{m}| \leq \frac{1}{m} +
\sqrt{\frac{\ln(\frac{8m}{\delta})}{2m-1}}\right\}\right|
& \geq & \frac{1}{2}|\calh(\frac{k}{m})| \\
\end{eqnarray*}
$$\left|\calh(\frac{k}{m})\right| \leq 2\left|\hat{\calh}\left(\frac{k}{m},\;2m\delta\right)\right|$$
}
Lemma~\ref{lem:main2} now follows by quantification over $q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$.
\qed
\section{Asymptotic Analysis and Phase Transitions}
\label{sec:asym-upper}
The bounds given in (\ref{eqn:main1-equation}) and theorem~\ref{thm:main1}
exhibit phase transitions. More specifically, the bounding expression
can be discontinuous in $\delta$ and $m$, e.g., arbitrarily small changes in $\delta$
can cause large changes in the bound. To see how this happens consider
the following constraint on the quantity $q$.
\begin{equation}\label{eqn:constraint}
D(\hateps(h)||q) \leq \frac{s(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}
\end{equation}
The bound given by~(\ref{eqn:main1-equation}) is the least upper bound of the
values of $q$ satisfying~(\ref{eqn:constraint}).
Assume that $m$ is sufficiently large that we can think of $\frac{s(\dceil{q})}{m}$
as a continuous function of $q$ which we will write as $\tils(q)$.
We can then rewrite (\ref{eqn:constraint}) as follows where $\lambda$
is a quantity not depending on $q$ and $\tils(q)$ does not depend on $\delta$.
\begin{equation}
\label{eqn:constraint2}
D(\hateps(h)||q) \leq \tils(q) + \lambda
\end{equation}
For $q \geq \hateps(h)$ we that $D(\hateps(h)||q)$ is a monotonically
increasing function of $q$. It is reasonable to assume that for $q
\leq 1/2$ we also have that $\tils(q)$ is a monotonically increasing
function of $q$. But even under these conditions it is possible that
the feasible values of $q$, i.e., those satisfying
(\ref{eqn:constraint2}), can be divided into separated regions.
Furthermore, increasing $\lambda$ can cause a new feasible region to
come into existence. When this happens the bound, which is the
least upper bound of the feasible values, can increase discontinuously. At a
more intuitive level, consider a large number of high error concepts
and smaller number of lower error concepts. At a certain confidence
level the high error concepts can be ruled out. But as the confidence
requirement becomes more stringent suddenly (and discontinuously) the
high error concepts must be considered. A similar discontinuity can
occur in sample size. Phase transitions in shell decomposition bounds
are discussed in more detail by Haussler et al. \cite{HKST96}.
Phase transition complicate asymptotic analysis. But asymptotic
analysis illuminates the nature of phase transitions. As mentioned in
the introduction, in the asymptotic analysis of learning theorem
bounds it is important that one not hold $\calh$ fixed as the sample
size increases. If we hold $\calh$ fixed then $\lim_{m\rightarrow
\infty} \frac{\ln|\calh|}{m} = 0$. But this is not what one expects
for large samples in practice. As the sample size increases
one typically uses larger hypothesis classes. Intuitively, we
expect that even for very large $m$ we have that
$\frac{\ln|\calh|}{m}$ is far from zero.
For the asymptotic analysis of the bound in~(\ref{eqn:main1-equation})
we assume an infinite sequence of hypothesis classes
$\calh_1$, $\calh_2$, $\calh_3$ $\ldots$ and an infinite sequence
of data distributions
$D_1$, $D_2$, $D_3$, $\ldots$. Let $s_m(\frac{k}{m})$ be $s(\frac{k}{m})$
defined relative to $\calh_m$ and $D_m$.
In the asymptotic analysis we assume that
the sequence of functions $\frac{s_m(\dceil{q})}{m}$, viewed as functions of $q \in [0,1]$, converge uniformly to
a continuous function $\tils(q)$. This means that for any $\epsilon > 0$ there
exists a $k$ such that for all $m > k$ we have the following.
$$\forall q \in [0,1] \;\; |\frac{s_m(\dceil{q})}{m} - \tils(q)| \leq \epsilon$$
Given the functions $\frac{s_m(\dceil{p})}{m}$ and their limit function
$\tils(p)$, we define the following functions of an empirical error
rate $\hat{e}$.
{\footnotesize \begin{eqnarray*}
B_m(\hat{e}) & \equiv & \lub\;\left\{q:D(\hat{e}||q) \leq \frac{s_m(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\right\} \\
\\
B(\hat{e}) & \equiv & \lub\;\{q:D(\hat{e}||q) \leq \tils(q)\}
\end{eqnarray*}}
The function $B_m(\hat{e})$ corresponds directly to the upper bound in (\ref{eqn:main1-equation}).
The function $B(\hat{e})$ is intended to be the large $m$ asymptotic limit of $B_m(\hat{e})$.
However, phase transitions complicate asymptotic analysis. The bound $B(\hat{e})$
need not be a continuous function of $\hat{e}$. A value of $\hat{e}$ where
the bound $B(\hat{e})$ is discontinuous corresponds to a phase transition in the bound.
At a phase transition the sequence $B_m(\hat{e})$ need not converge.
Away from phase transitions, however, we have the following theorem.
\begin{thm}
\label{thm:upper-limit}
If the bound $B(\hat{e})$ is continuous at the point $\hat{e}$ (so we are not at
a phase transition), and the functions $\frac{s_m(\dceil{q})}{m}$, viewed as functions
of $q \in [0,1]$, converge uniformly to a continuous function $\tils(q)$, then
we have the following.
$$\lim_{m\rightarrow \infty} B_m(\hat{e}) = B(\hat{e})$$
\end{thm}
\begin{proof}
Define the set $F_m(\hat{e})$ as follows.
{\footnotesize $$F_m(\hat{e}) \equiv \left\{q: \;D(\hat{e}||q)
\leq \frac{s_m(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\right\}$$}
This gives the following.
$$B_m(\hat{e}) = \mtt{lub}\; F_m(\hat{e})$$
Similarly, define $F(\hat{e},\;\epsilon)$ and $B(\hat{e},\;\epsilon)$
as follows.
\begin{eqnarray*}
F(\hat{e},\;\epsilon) & \equiv & \{q \in [0,1]:\;D(\hat{e}||q) \leq \tils(q) + \epsilon\} \\
B(\hat{e},\;\epsilon) & \equiv & \mtt{lub}\; F(\hat{e},\;\epsilon)
\end{eqnarray*}
We first show that the continuity of $B(\hat{e})$ at the point $\hat{e}$
implies the continuity of $B(\hat{e},\;\epsilon)$ at the point
$\tuple{\hat{e},\;0}$.
We note that there exists a continuous function $f(\hat{e},\;\epsilon)$
with $f(\hat{e},\;0) =\hat{e}$ and such that
for any $\epsilon$ sufficiently near 0 we have the following.
$$D(f(\hat{e},\;\epsilon)||q) = D(\hat{e}||q) - \epsilon$$
We then get the following equation.
$$B(\hat{e},\;\epsilon) = B(f(x,\;\epsilon))$$
Since $f$ is continuous,
and $B(\hat{e})$ is continuous at the point $\hat{e}$, we get that
$B(\hat{e},\;\epsilon)$ is continuous at the point $\tuple{\hat{e},\;0}$.
We now prove the lemma.
The functions of the form $\frac{s_m(\dceil{q}) + \ln\frac{2m}{\delta}}{m}$ converge uniformly
to the function $\tils(q)$. This implies that for any $\epsilon > 0$
there exists a $k$ such that for all $m > k$ we have the following.
$$F(\hat{e},\;-\epsilon) \subseteq F_m(\hat{e}) \subseteq F(\hat{e},\;\epsilon)$$
But this in turn implies the following.
\begin{equation}
\label{eqn:proof-bounds}
B(\hat{e},\;-\epsilon) \leq B_m(\hat{e}) \leq B(\hat{e},\;\epsilon)
\end{equation}
The lemma now follows from the continuity of the function $B(\hat{e},\;\epsilon)$
at the point $\tuple{\hat{e},\;0}$.
\end{proof}
Theorem~\ref{thm:upper-limit} can be interpreted as saying that for large sample
sizes, and for values of $\hat{e}$ other than the special phase transition values,
the bound has a well defined value independent of the confidence parameter $\delta$
and determined only by a smooth function $\tils(q)$.
A similar statement can be made for the bound in
theorem~\ref{thm:main1} --- for
large $m$, and at points other than phase transitions, the bound is
independent of $\delta$ and is determined by a smooth limit curve.
For the asymptotic analysis of theorem~\ref{thm:main1}
we assume
an infinite sequence $\calh_1$, $\calh_2$, $\calh_3$, $\ldots$ of hypothesis
classes and an infinite sequence $S_1$, $S_2$, $S_3$, $\ldots$ of samples
such that sample $S_m$ has size $m$. Let $\calh_m(\frac{k}{m},\;\delta)$
and $\hats_m(\frac{k}{m},\;\delta)$ be $\calh(\frac{k}{m},\;\delta)$ and
$\hats(\frac{k}{m},\;\delta)$ respectively defined relative to hypothesis class $\calh_m$
and sample $S_m$.
Let $U_m(\frac{k}{m})$ be the set of hypotheses in $\calh_m$ having an empirical
error of exactly $\frac{k}{m}$ in the sample $S_m$.
Let $u_m(\frac{k}{m})$ be $\ln(\max(1,\;|U_m(\frac{k}{m})|)$.
In the analysis of theorem~\ref{thm:main1} we allow that the functions
$\frac{u_m(\dceil{q})}{m}$ are only locally uniformly convergent
to a continuous function $\bar{u}(q)$, i.e., for
any $q \in [0,1]$ and any $\epsilon> 0$ there exists an integer
$k$ and real number $\gamma > 0$ satisfying the following.
$$\forall m > k, \;\forall p \in (q-\gamma,\;q+\gamma)\; | \frac{u_m(\dceil{p})}{m} - \bar{u}(p)| \leq \epsilon$$
Locally uniform convergence plays a role in the analysis in section~\ref{sec:smearing}.
\begin{thm}
\label{thm:upper-agreement}
If the functions $\frac{u_m(\dceil{q})}{m}$ converge locally
uniformly to a continuous function $\bar{u}(q)$ then,
for any fixed value of $\delta$,
the functions $\frac{\hats_m(\dceil{q},\;\delta)}{m}$ also converge locally
uniformly to $\bar{u}(q)$. If the convergence of $\frac{u_m(\dceil{q})}{m}$
is uniform, then so is the convergence of $\frac{\hats_m(\dceil{q},\;\delta)}{m}$.
\end{thm}
\begin{proof} Consider an arbitrary value $q \in [0,1]$ and $\epsilon > 0$.
We will construct the desired $k$ and $\gamma$. More specifically,
select $k$ sufficiently large and $\gamma$ sufficiently small that
we have the following properties.
{\footnotesize
$$\forall m > k, \;\forall p \in (q-2\gamma,\;q+2\gamma)\; \left| \frac{u_m(\dceil{p})}{m} - \bar{u}(p)\right|
< \frac{\epsilon}{3}$$}
\medskip
$$\forall p \in (q-2\gamma,\;q+2\gamma)\;\; |\bar{u}(p)-\bar{u}(q)| \leq \frac{\epsilon}{3}$$
\medskip
$$\frac{1}{k} + \sqrt{\frac{\ln(\frac{16k^2}{\delta})}{2k-1}} < \gamma$$
\medskip
$$\frac{\ln k}{k} \leq \frac{\epsilon}{3}$$
Consider an $m > k$ and $p \in (q-\gamma,\;q+\gamma)$.
It now suffices to show the following.
$$\left|\frac{\hats_m(\dceil{p},\;\delta)}{m} - \bar{u}(p)\right| \leq \epsilon$$
Because $U_m(\dceil{p})$ is a subset of $\calh_m(\dceil{p},\;\delta)$
we have the following.
$$\frac{\hats_m(\dceil{p},\;\delta)}{m} \geq \frac{u_m(\dceil{p})}{m} \geq \bar{u}(p) - \frac{\epsilon}{3}$$
We can also upper bound $\frac{\hats_m(\dceil{p},\;\delta)}{m}$
as follows.
\begin{eqnarray*}
|\calh_m(\dceil{p},\delta)| & \leq & \sum_{|\frac{k}{m} - p|
\leq \gamma} \left|U_m\left(\frac{k}{m}\right)\right| \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{u_m(\frac{k}{m})} \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{m(\bar{u}(\frac{k}{m}) + \frac{\epsilon}{3})} \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{m(\bar{u}(p) + \frac{2\epsilon}{3})} \\
\\
& \leq & me^{m(\bar{u}(p) + \frac{2\epsilon}{3})} \\
\\
\frac{\hats(\dceil{p},\;\delta)}{m} & \leq & \bar{u}(p) + \frac{2\epsilon}{3} + \frac{\ln m}{m} \\
\\
& \leq & \bar{u}(p) + \epsilon
\end{eqnarray*}
A similar argument shows that if $\frac{u_m(\dceil{q})}{m}$ converges
uniformly to $\bar{u}(q)$ then so does $\frac{u_m(\dceil{q})}{m}$.
\end{proof}
Given quantities $\frac{\hats_i(\dceil{q},\;\delta)}{m}$ that converge uniformly to $\bar{u}(q)$
the remainder of the analysis is identical to that for the asymptotic analysis of
(\ref{eqn:main1-equation}). We define the following upper bounds.
{\scriptsize \begin{eqnarray*}
\hat{B}_m(\hat{e}) &\equiv & \lub\;\left\{q: D(\hat{e}||q)
\leq \frac{\hats_m(\dceil{q},\;\delta) + \ln\left(\frac{4m}{\delta}\right)}{m}\right\} \\
\\
\hat{B}(\hat{e}) & \equiv & \lub\;\{q: D(\hat{e}||q) \leq \bar{u}(q)\}
\end{eqnarray*}}
Again we say that $\hat{e}$ is at a phase transition if the function $\hat{B}(\hat{e})$
is discontinuous at the value $\hat{e}$.
We then get the following whose proof is identical to that of theorem~\ref{thm:upper-limit}.
\begin{thm}
\label{thm:upper-limit2}
If the bound $\hat{B}(\hat{e})$ is continuous at the point $\hat{e}$ (so we are not at
a phase transition), and the functions $\frac{u_m(\dceil{q})}{m}$
converge uniformly to $\bar{u}(q)$, then
we have the following.
$$\lim_{m\rightarrow \infty} \hat{B}_m(\hat{e}) = \hat{B}(\hat{e})$$
\end{thm}
\section{Asymptotic Optimality of (\ref{eqn:main1-equation})}
\label{sec:lower}
Formula~(\ref{eqn:main1-equation})
can be viewed as providing an upper bound on $e(h)$ as a function of
$\hateps(h)$ and the function $s$. In this section we show that for
any entropy curve $s$ and value $\hateps$
there exists a hypothesis class and data distribution such that
the upper bound in (\ref{eqn:main1-equation}) is realized
up to asymptotic equality. Up to asymptotic equality,
(\ref{eqn:main1-equation}) is the tightest possible bound computable from
$\hateps(h)$ and the $m$ numbers $s(\frac{1}{m})$, $\ldots$, $s(\frac{m}{m})$.
The classical VC dimensions bounds are nearly optimal
over bounds computable from $\hateps(h^*)$ and the class $\calh$.
The $m$ numbers $s(\frac{1}{m})$, $\ldots$, $s(\frac{m}{m})$ depend on both $\calh$ and
the data distribution. Hence the bound in (\ref{eqn:main1-equation})
uses information about the distribution and hence can be tighter than classical VC bounds.
A similar statement applies to the bound in theorem (\ref{thm:main1})
computed from the empirically observable numbers $\hats(\frac{1}{m})$, $\ldots$, $\hats(\frac{m}{m})$.
In this case, the bound uses more information from the sample than just
$\hateps(h)$.
The optimality theorem given here also differs from the traditional lower
bound results for VC dimension in that here the lower bounds match the upper
bounds up to asymptotic equality.
The departure point for our optimality analysis
is the following lemma from~\cite{TnC}.
\begin{lem}[Cover and Thomas]
\label{lem:ct2}
If $\hat{p}$ is the fraction of heads out of $m$ tosses of a coin
where the true probability of heads is $p$ then for $q \geq p$ we have
the following.
$$Pr(\hat{p} \geq q) \geq \frac{1}{m+1} e^{-mD(q||p)}$$
\end{lem}
This lower bound on $Pr(\hat{p} \geq q)$ is very close to Chernoff's 1952
upper bound~(\ref{eqn:relent-prelim}). The tightness of (\ref{eqn:main1-equation})
is a direct reflection of the tightness (\ref{eqn:relent-prelim}). To exploit
Lemma~\ref{lem:ct2} we need to construct hypothesis classes and data distributions
where distinct hypotheses have independent training errors. More specifically,
we say that a set of hypotheses $\{h_1,\;\ldots,\;h_n\}$ has independent training errors
if the random variables
$\hateps(h_1)$, $\ldots$, $\hateps(h_n)$ are independent.
By an argument similar to the derivation of (\ref{eqn:relent2}) from (\ref{eqn:relent-prelim})
we can prove the following from Lemma \ref{lem:ct2}.
{\footnotesize \begin{equation}
\label{eqn:lower-relent}
Pr\left(D(\min(\hat{p},p)||p) \geq \frac{\ln(\frac{1}{\delta}) - \ln(m+1)}{m}\right) \geq \delta
\end{equation}}
\begin{lem}
\label{lem:exists}
Let $X$ be any finite set, $S$ a random variable,
and $\Theta[S,x,\delta]$ a formula such that
for every $x\in X$ and $\delta >0$
we have \( Pr(\Theta [S,x,\delta ])\geq \delta \), and
$Pr(\forall x\in X\; \Theta [S,\; x,\; \delta])=\prod _{x\in X}Pr(\Theta [S,\; x,\; \delta])$.
We then have $\forall \delta >0 \;\forall ^{\delta }S \;\;\;\exists x\in X \;\;\Theta [S,x,\frac{\ln (\frac{1}{\delta })}{|X|}]$.
\end{lem}
\begin{proof}
\[
\begin{array}{ccc}
Pr(\Theta [S,x,\frac{\ln (\frac{1}{\delta })}{|X|}]) & \geq & \frac{\ln (\frac{1}{\delta })}{|X|}\\
Pr(\neg \Theta [S,x,\frac{\ln(\frac{1}{\delta })}{|X|}]) & \leq & 1-\frac{\ln (\frac{1}{\delta })}{|X|}\\
& \leq & e^{-\frac{\ln (\frac{1}{\delta })}{|X|}}\\
Pr(\forall x \in X \neg \Theta[S,\;x,\frac{\ln(\frac{1}{\delta })}{|X|}]) & \leq & e^{-\ln ({\frac{1}{\delta}})} = \delta
\end{array}\]
\end{proof}
Now define $h^*(\frac{k}{m})$ to be the hypothesis of minimal training
error in the set $\calh(\frac{k}{m})$. Let $\glb\;\{x:\;\Phi[x]\}$
denote the greatest lower bound (the minimum) of the set $\{x:\;\Phi[x]\}$.
We now have the following lemma.
\begin{lem}
\label{lem:ct-consequence}
If the hypotheses in the class $\calh(\dceil{q})$ are independent then
$\forall \delta > 0$, $\forall^\delta S$, $\forall q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$,
{\scriptsize
$$\hateps(h^*(q)) \leq \glb\;\left\{\hateps: \begin{array}{l}
D(\min(\hateps,\;q-\frac{1}{m})||q) \\
\\
\leq \frac{s(q) - \ln(m+1) - \ln(\ln(\frac{m}{\delta}))}{m}
\end{array}\right\}$$
}
\end{lem}
\begin{proof}
To prove lemma~\ref{lem:ct-consequence} let $q$ be a fixed rational number of the form $\frac{k}{m}$.
Assuming independent hypotheses we can applying Lemma~\ref{lem:exists} to (\ref{eqn:lower-relent})
to get $\forall \delta > 0$, $\forall^\delta S$, $\exists h \in \calh(\frac{k}{m})$,
{\scriptsize $$D(\min(\hateps(h),e(h))||e(h)) \geq \frac{s(q) - \ln(m+1) -
\ln(\ln(\frac{1}{\delta}))}{m}$$}
Let $w$ be the hypothesis in $\calh(q)$ satisfying this formula.
We now have $\hateps(h^*(q)) \leq \hateps(w)$
and $q-\frac{1}{m} \leq e(w) \leq q$.
These two conditions imply $\forall \delta > 0$, $\forall^\delta S$,
$$\begin{array}{l}
D(\min(\hateps(h^*(q)),q-\frac{1}{m})||q) \\
\\
\geq \frac{s(q) - \ln(m+1) -\ln(\ln\frac{1}{\delta})}{m}
\end{array}$$
This implies the following.
{\footnotesize
$$\hateps(h^*(q)) \leq \glb\;\left\{\hateps: \begin{array}{l}
D(\min(\hateps,\;q-\frac{1}{m})||q) \\
\\
\leq \frac{s(q) - \ln(m+1) - \ln(\ln(\frac{1}{\delta}))}{m}
\end{array}\right\}$$}
Lemma~\ref{lem:ct-consequence} now follows by quantification over
$q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$.
\end{proof}
For $q \in [0,1]$ we have that
lemma~\ref{lem:main1} implies the following.
{\scriptsize
$$\hateps(h^*(\dceil{q})) \geq \glb\;\left\{\hateps:\begin{array}{l}
D\left(\hateps||\dceil{q}-\frac{1}{m}\right) \\
\\
\leq \frac{s(\dceil{q}) + \ln\left(\frac{2m}{\delta}\right)}{m}
\end{array}\right\}$$
}
We now have upper and lower bounds on the quantity $\hateps(h^*(\dceil{q}))$
which agree up to asymptotic equality --- in a large $m$ limit where
$\frac{s_m(\dceil{q})}{m}$ converges (pointwise) to a continuous function
$\tils(q)$ we have that the upper and lower bound on
$\hateps(h^*(\dceil{q}))$ both converge (pointwise) to the following.
$$\hateps(h^*(q)) = \glb\;\{\hateps: D(\hateps||q) \leq \tils(q)\}$$
This asymptotic value of $\hateps(h^*(q))$ is a continuous
function of $q$. Since $q$ is held fixed in calculating the bounds on $\hat{e}(\dceil{q})$,
phase transitions are not an issue and uniform
convergence of the functions $\frac{s_m(\dceil{q})}{m}$ is not required.
Note that for
large $m$ and independent hypotheses we get that $\hateps(h^*(q))$ is determined as a
function of the true error rate $q$ and $\frac{s(\dceil{q})}{m}$.
The following lemma states that any limit function $\tils(p)$ is consistent
with the possibility that hypotheses are independent. This, together with
lemma~\ref{lem:ct-consequence} implies that no uniform bound on $e(h)$
as a function of $\hateps(h)$ and $|\calh(\frac{1}{m})|$, $\ldots$,
$|\calh(\frac{m}{m})|$ can be asymptotically tighter than
(\ref{eqn:main1-equation}).
\begin{thm}
\label{thm:construction}
Let $\tils(p)$ be any continuous function of $p \in [0,1]$.
There exists an infinite sequence of hypothesis spaces $\calh_1$, $\calh_2$,
$\calh_3$, $\ldots$, and sequence of data distributions $D_1$, $D_2$, $D_3$, $\ldots$
such that each class $\calh_m$ has independent hypotheses for data distribution
$D_m$ and such that
$\frac{s_m(\dceil{p})}{m}$ converges (pointwise) to $\tils(p)$.
\end{thm}
\begin{proof}
First we show that if $|\calh_m(\frac{i}{m})| =
e^{m\tils(\frac{i}{m})}$ then the functions
$\frac{s_m(\dceil{p})}{m}$ converge (pointwise) to $\tils(p)$.
Assume $|\calh_m(\frac{i}{m})| =
e^{m\tils(\frac{i}{m})}$. In this case
we have the following.
$$\frac{s_m(\dceil{p})}{m} = \tils(\dceil{p})$$
Since $\tils(p)$ is continuous, for any fixed value of $p$ we get that
$\frac{s_m(\dceil{p})}{m}$ converges to $\tils(p)$.
Recall that $D_m$ is a probability distribution on pairs $\tuple{x,\;y}$
with $y \in \{0,1\}$ and $x \in X_m$ for some set $X_m$.
We take $\calh_m$ to be a disjoint union of sets
$\calh_m(\frac{k}{m})$ where $|\calh_m(\frac{k}{m})|$ is selected as above.
Let $f_1$, $\ldots$, $f_N$ be the elements of $\calh_m$
with $N = |\calh_m|$.
Let $X_m$ be the set of all $N$-bit bit strings and define
$f_i(x)$ to be the value of $i$th bit of the bit vector $x$.
Now define the distribution $D_m$ on pairs $\tuple{x,\;y}$
by selecting $y$ to be 1 with probability 1/2 and then selecting
each bit of $x$ independently where the $i$th bit is selected to disagree
with $y$ with probability $\frac{k}{m}$ where $k$ is such that $f_i \in \calh_m(\frac{k}{m})$.
\end{proof}
\section{Relating $\hat{s}$ and $s$}
\label{sec:smearing}
In this section we show that in large $m$ limits of
the type discussed in section~\ref{sec:asym-upper} the
histogram of empirical errors need not converge to the histogram
of true errors. So even in the large $m$ asymptotic limit,
the bound given by theorem~\ref{thm:main1} is significantly weaker
than the bound given by~(\ref{eqn:main1-equation}).
To show that $\hats(\dceil{q},\;\delta)$ can be asymptotically
different from $s(\dceil{q})$ we consider the case of independent
hypotheses. More specifically, given a continuous function $\tils(p)$
we construct an infinite sequence of
hypothesis spaces $\calh_1$, $\calh_2$, $\calh_3$, $\ldots$ and an
infinite sequence of data distributions $D_1$, $D_2$, $D_3$, $\ldots$
using the construction in the proof of theorem~\ref{thm:construction}.
We note that if $\tils(p)$ is differentiable with
bounded derivative then the functions $\frac{s_m(\dceil{p})}{m}$
converge uniformly to $\tils(p)$.
For a given infinite sequence data distributions
we generate an infinite sample sequence
$S_1$, $S_2$, $S_3$, $\ldots$, by selecting $S_m$ to consists of $m$
pairs $\tuple{x,\;y}$ drawn IID from distribution $D_m$. For a given sample
sequence and $h \in \calh_m$ we define $\hateps_m(h)$ and
$\hats_m(\frac{k}{m},\;\delta)$ in a manner similar to $\hateps(h)$
and $\hats(\frac{k}{m},\;\delta)$ but for sample $S_m$.
The main result of this section is the following.
\begin{stm}
\label{thm:smearing}
If each $\calh_m$ has independent hypotheses under data distribution $D_m$,
and the functions $\frac{s_m(\dceil{p})}{m}$ converge uniformly
to a continuous function $\tils(p)$, then for any $\delta > 0$ and $p \in [0,1]$,
we have the following with probability 1 over the generation of the sample sequence.
$$\lim_{m\rightarrow \infty} \frac{\hats_m(\dceil{p},\delta)}{m} = \sup_{q\in[0,1]} \tils(q) - D(p||q)$$
\end{stm}
We call this a statement rather than a theorem because the proof
has not been worked out to a high level of rigor.
Nonetheless, we believe the proof sketch given below can be
expanded to a fully rigorous argument.
Before giving the proof sketch we note that the limiting value of
$\frac{\hats_m(\dceil{p},\;\delta)}{m}$ is independent of $\delta$. This is
consistent with theorem~\ref{thm:upper-agreement}. Define
$\bar{\hats}(p)$ as follows.
$$\bar{\hats}(p) \equiv \sup_{q \in [0,1]}\; \tils(q) - D(p||q)$$
Note that $\bar{\hats}(p) \geq \tils(p)$. This gives an
asymptotic version of lemma~\ref{lem:main2}. But since $D(p||q)$ can
be locally approximated as $c(p-q)^2$ (up to its second order Taylor
expansion), if $\tils(p)$ is increasing at the point $p$ then
we also get that $\bar{\hats}(p)$ is strictly larger than $\tils(p)$.
\begin{proof_outline}
To prove statement~\ref{thm:smearing} we first define $\calh_m(p,\;q)$
for $p,q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ to be the set of
all $h \in \calh_m(q)$ such that $\hateps_m(h) = p$. Intuitively,
$\calh_m(p,\;q)$ is the set of concepts with true error rate near $q$
that have empirical error rate $p$. Ignoring factors that are only
polynomial in $m$, the probability of a hypothesis with true error
rate $q$ having empirical error rate $p$ can be written as
(approximately) $e^{-mD(p||q)}$. So the expected size of
$\calh_m(p,\;q)$ can be written as $|\calh_m(q)|e^{-mD(p||q)}$, or
alternatively, (approximately) as $e^{m\tils(q)}e^{-mD(p||q)}$ or
$e^{m(\tils(q)-D(p||q))}$. More formally, we have the following for
any fixed value of $p$ and $q$.
$$\lim_{m \rightarrow \infty} \frac{\ln(\max(1,\;\mtt{E}(|\calh_m(\dceil{p},\;\dceil{q})|)))}{m} $$
$$ = \max(0,\;\tils(q) - D(p||q))$$
We now show that the expectation can be eliminated from the above
limit.
First, consider distinct values of $p$ and $q$ such that $\tils(q) - D(p||q) > 0$.
Since $p$ and $q$ are distinct, the probability that a fixed
hypothesis in $\calh_m(\dceil{q})$ is in $\calh_m(\dceil{p},\;\dceil{q})$ declines
exponentially in $m$. Since $\tils(q) -D(p||q) > 0$ the expected size of
$\calh_m(\dceil{p},\;\dceil{q})$ grows exponentially in $m$.
Since the hypotheses are independent, the distribution of possible values of
$|\calh_m(\dceil{p},\;\dceil{q})|$ becomes essentially a Poisson mass distribution
with an expected number of arrivals growing exponentially in $m$.
The probability that $|\calh_m(\dceil{p},\;\dceil{q})|$ deviates
from its expectation by as much as a factor of 2 declines exponentially in $m$.
We say that a sample sequence is safe after $k$ if for all $m > k$
we have that $|\calh_m(\dceil{p},\;\dceil{q})|$ is within a factor of 2 of its expectation.
Since the probability of being unsafe at $m$ declines exponentially in $m$,
for any $\delta$ there exists a $k$
such that with probability at least $1-\delta$ the sample sequence is
safe after $k$. So for any $\delta > 0$ we have that with probability at least $1-\delta$
the sequence is safe after some $k$. But since this holds for
all $\delta > 0$, with probability 1 such a $k$ must exist.
$$\lim_{m\rightarrow \infty} \;\frac{\ln(\max(1,|\calh_m(\dceil{p},\;\dceil{q})|))}{m} $$
$$= \tils(q) - D(p||q)$$
We now define $s_m(\dceil{p},\;\dceil{q})$ as follows.
$$s_m(\dceil{p},\;\dceil{q}) \equiv \ln(\max(1,\;|\calh_m(\dceil{p},\;\dceil{q})|))$$
It is also possible to show for $p=q$ we have that with probability 1
we have that $\frac{s_m(\dceil{p},\;\dceil{q})}{m}$ approaches $\tils(p)$
and that for distinct $p$ and $q$ with $\tils(q)-D(p||q) \leq 0$ we have
that $\frac{s_m(\dceil{q},\;\dceil{q})}{m}$ approaches 0. Putting these together
yields that with probability 1 we have the following.
{\footnotesize \begin{equation}
\label{eqn:smearing-limit1}
\lim_{m\rightarrow \infty} \;\frac{s_m(\dceil{p},\;\dceil{q})}{m} =
\max(0,\;\tils(q) - D(p||q))
\end{equation}}
Define $U_m(\frac{k}{m})$ and $u_m(\frac{k}{m})$ as in section~\ref{sec:asym-upper}.
We now have the following equality.
$$U_m(p) = \cup_{q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}}
\calh_m(p,\;q)$$ We will now show that with probability 1
we have that $\frac{u_m(p)}{m}$ approaches $\bar{\hats}(p)$. First,
consider a $p \in [0,1]$ such that $\bar{\hats}(p) > 0$. Let Since
$\tils(q)-D(q||p)$ is a continuous function, and $[0,1]$ is a compact
set, $\sup_{q\in[0,1]} \tils(q) - D(p||q)$ must be
realized at some value $q^* \in [0,1]$. Let $q^*$ be such that
$\tils(q^*)-D(p||q^*)$ equals $\bar{\hats}(p)$. We have that
$u_m(\dceil{p}) \geq s_m(\dceil{p},\;\dceil{q^*})$. This, together
with (\ref{eqn:smearing-limit1}), implies the following.
$$\liminf_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} \geq
\bar{\hats}(p)$$ We will now say that the sample sequence is safe at
$m$ and $\frac{k}{m}$ if $|\calh_m(\dceil{p},\;\dceil{\frac{k}{m}})|$
does not exceed twice the expectation of
$|\calh_m(\dceil{p},\;\dceil{q^*})|$. Assuming uniform convergence of
$\frac{s_m(\dceil{p})}{m}$, the probability of not being
safe at $m$ and $\frac{k}{m}$ declines exponentially in $m$ at a rate
at least as fast as the rate of decline of the probability of not
being safe at $m$ and $\dceil{q^*}$. By the union bound this implies
that for a given $m$ the probability that there exists an unsafe
$\frac{k}{m}$ also declines exponentially. We say that the sequence
is safe after $N$ if it is safe for all $m$ and $\frac{k}{m}$ with $m
> N$. The probability of not being being safe after $N$ also declines
exponentially with $N$. By an argument similar to that given above,
this implies that with probability 1 over the choice of the sequence
there exists a $N$ such that the sequence is safe after $N$. But if
we are safe at $m$ then $|U_m(\dceil{p})| \leq
2mE|\calh_m(p,\;\dceil{q^*})|$. This implies the following.
$$\limsup_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} \leq \bar{\hats}(p)$$
Putting the two bounds together we get the following.
$$\lim_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} = \bar{\hats}(p)$$
The above argument establishes (to some level of rigor) pointwise
convergence of $\frac{u_m(\dceil)}{m}$ to $\bar{\hats}(p)$. It is
also possible to establish a convergence rate that is a continuous
function of $p$. This implies that the convergence of
$\frac{u_m(\dceil{p})}{m}$ can be made locally uniform.
Theorem~\ref{thm:upper-agreement} then implies the desired result.
\end{proof_outline}
\section{Future Work}
A practical difficulty with the bound implicit in
theorem~\ref{thm:main1} is that it is usually impossible to enumerate
the elements of an exponentially large hypothesis class and hence
impractical to compute the histogram of training errors for the
hypotheses in the class. In practice the values of $s(\frac{k}{m})$
might be estimated using some form of Monte-Carlo Markov chain
sampling over the hypotheses. For certain hypothesis spaces it might
also be possible to directly calculate the empirical error
distribution without evaluating every hypothesis.
\ignore{Consider the set of all maximal size decision trees made from
k binary features. For a particular example, half of the decision
trees will predict correctly and half incorrectly. A second different
example will be classified incorrectly by half the trees independent
of whether or not the first example was classified correctly. To see
this, just note that the leaf which classifies the first and second
examples differ because of the assumption the binary tree is full. In
general, the distribution of empirical errors will be a binomial in
the number of distinct examples. One important direction of future
work will be analyzing particular hypothesis spaces in this manner to
make the computable bound ``fast''. }
\ignore{ Given that a failure has occurred, what can be said in the
typical PAC setting? With an application of the conditional limit
theorem \cite{TnC} we can say that with high probability ``bad''
events in the typical PAC setting will not be ``very bad'' i.e. the
true error will not be far from the \( \hat{e}_{min}+\epsilon \). This
is no longer the case when using a distribution dependent PAC bound -
``bad'' events can be ``very bad'' (an error arbitrarily large) with
high probability when the bound is used. There are two caveats to
this statement. First, whenever ``very bad'' errors can occur the
corresponding standard PAC bound will use a ``very large'' value of
$\epsilon$. The other caveat is that the ``very bad'' errors can only
occur when a phase transition in the Distribution Dependent PAC bound
will occur at a larger $m$. }
Here we have emphasized asymptotic properties of our bound but we have
not addressed rates of convergence. For finite sample sizes the rate
at which the bound converges to its asymptotic behavior can be
important. Before mentioning some ways that the convergence rate
might be improved, however, we note that near phase transitions
standard notions of convergence rate are inappropriate. Near a phase
transition the bound is ``unstable'' --- replacing $\delta$ by
$\delta/2$ can alter the bound significantly. In fact, near a phase
transition it is likely that $e(h^*)$ is significantly different for
different samples even though $\hate(h^*)$ is highly predictable.
Intuitively, we would like a notion of convergence rate that measures
the size of the ``region of instability'' around a phase transition.
As the sample size increases the phrase transition becomes sharper and
the region of instability smaller. It would be nice to have a formal
definition for the region of instability and the rate at which the
size of this region goes to zero, i.e., the rate at which phase
transitions in the bound become sharp.
The rate of convergence of our bound might be improved in various
ways:
\begin{itemize}
\item Removing the discretization of true errors.
\item Using one-sided bounds.
\item Using nonuniform union bounds over discrete values of the form $\frac{k}{m}$.
\item Tightening the Chernoff bound using direct calculation of
Binomial coefficients.
\item Improving Lemma \ref{lem:helper1}.
\end{itemize}
The above ideas may allow one to remove
all $\ln(m)$ terms from the statement of the bound.
\section{Conclusion}
Traditional PAC bounds are stated in terms of the training error
and class size or VC dimension.
The computable bound given here is typically much tighter because it exploits
the additional information in the histogram of training errors.
The uncomputable bound uses the additional (unavailable) information
in the distribution of true errors. Any distribution of true errors
can be realized in a case with independent hypotheses. We have shown
that in such cases this uncomputable bound is asymptotically equal to actual
generalization error. Hence this is the tightest possible bound,
up to asymptotic equality, over all bounds expressed as functions of
$\hate(h^*)$ and the distribution of true errors. We have also shown
that the use of the histogram of empirical errors results in a bound
that, while still tighter than traditional bounds,
is looser than the uncomputable bound even in the large sample asymptotic limit.
{\bf Acknowledgments:}
Yoav Freund, Avrim Blum, and Tobias Scheffer all provided useful
discussion in forming this paper.
\begin{thebibliography}{1}
\bibitem{HKST96}David Haussler, Michael Kearns, H. Sebastian Seung,
and Naftali tishby, ``Rigorous learning curve bounds from statistical
mechanics'', Machine Learning 25, 195-236, 1996.
\bibitem{TnC}Thomas \& Cover. ``Elements of Information Theory'', Wiley, 1991.
\bibitem{Ch52}H. Chernoff. ``A measure of asymptotic efficiency for test of a hypothesis based on
the sum of observations'', Annals of Mathematical Statistics, 23:493-507, 1952.
\bibitem{Tobias}Tobias Scheffer and Thorsten Joachims, ``Expected error analysis for model
selection'', International Conference on Machine Learning (ICML), 1999.
\bibitem{Freund} Yoav Freund, ``Self bounding algorithms'', COLT, 1998.
\bibitem{PACBayes2}David McAllester, ``Pac-Bayesian model averaging'', COLT, 1999.
\bibitem{Microchoice}John Langford and Avrim Blum, ``Microchoice and self-bounding algorithms'', COLT,
1999.
\bibitem{Mc1}Yishay Mansour and David McAllester, ``Generalization Bounds for Decision Trees'', COLT-2000.
\bibitem{Mc2}David McAllester and Robert Schapire, ``On the Convergence Rate of Good-Turing Estimators'', COLT-2000.
\end{thebibliography}
\end{document}