%
% Things to do:
% DONE: 1. cite Mike's paper.
% DONE: 2. Say what WE mean by CV
% (OKAY) DONE: 3. Make algorithm clear 
% DONE: 4. Stick in new graph
% DONE: 5. Discussion Section: Why should not relate to training data in obvious way.
% DONE: 6. Intro: expand on notion of "no complexity"
% DONE:	     - hypotheses not organized into hierarchy.
% DONE:	     - individual hypotheses drawn iid; not some trained on 
% DONE:			data using a more complex hypothesis class.
%        Then point to discussion section, to redirect objections.

%\documentstyle[ml97,named]{article} 
\documentstyle[ml97,psfig]{article}

\newlength{\linelength}
\setlength{\linelength}{0.35\textwidth}

\def\Proof {{\bf Proof: \enspace}}
\def\argmin {{\rm argmin}}
\def\hb {\hfil\break}
\def\entropy {{\cal H}}
\def\implies {\Rightarrow}
%\def\ehat {\hat\epsilon}
\def\ehat {\hat\varepsilon}
\def\nhat {\hat{n}}
\def\ehatbar{\overline{\hat\varepsilon}}
\def\iid {{\it i.i.d.}}
\def\Pr {{\rm Pr}}
\def\E {{\rm E}}
\def\nopt {{n_{\it opt}}}
\def\nopthat {\widehat{n_{\it opt}}}
\def\testeq {\stackrel{?}{=}}
\def\hstar {h^{\ast}}
\def\kopt {{k_{\it opt}}}
\def\kopthat {\widehat{\kopt}}

\newcounter{my-figures-ctr}


\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{observation}{Interesting Observation}
 
\title{Preventing ``Overfitting'' of Cross-Validation Data}
 
\author{ {\bf Andrew Y.~Ng} \\
School of Computer Science \\
Carnegie Mellon University \\
Pittsburgh PA 15213 \\
Andrew.Ng@cs.cmu.edu \\
} 
 
\begin{document} 
 
\maketitle 
 
\begin{abstract}

Suppose that, for a learning task, we have to select one hypothesis
out of a set of hypotheses (that may, for example, have been generated 
by multiple applications of a randomized learning algorithm). A common 
approach is to evaluate each hypothesis in the set on some previously 
unseen cross-validation data, and then to select the hypothesis that had 
the lowest cross-validation error. But when the cross-validation data
is partially corrupted such as by noise, and if the set of hypotheses 
we are selecting from is large, then ``folklore'' also warns about 
``overfitting'' the cross-validation data \cite{klockars86:mc,tukey49:cimitaov,tukey53:tpomc}.
In this paper, we explain how this ``overfitting'' really occurs, and 
show the surprising result that it can be overcome by selecting a hypothesis 
with a {\em higher} cross-validation error, over others with lower 
cross-validation errors. We give reasons for not selecting the hypothesis 
with the lowest cross-validation error, and propose a new algorithm, 
LOOCVCV, that uses a computationally efficient form of leave--one--out 
cross-validation to select such a hypothesis. Finally, we present experimental 
results for one domain, that show LOOCVCV consistently beating picking 
the hypothesis with the lowest cross-validation error, even when using 
reasonably large cross-validation sets.

\end{abstract}

% Sample LaTeX file for creating a paper in the Morgan Kaufmannn two 
% column, 8 1/2 by 11 inch proceedings format. 

%\newpage
%\thispagestyle{empty}
%\ \ 
%\newpage
%\setcounter{page}{1}

\section{Introduction}
		\label{section-intro}	

A basic problem in learning is that of selecting a hypothesis 
out of a set of hypotheses -- that is, determining which in 
a given set of hypotheses is ``best.'' 
When given such a set of hypotheses (that was, for example,
generated by multiple applications of a randomized learning 
algorithm to a set of training data, or equivalently, applying a 
deterministic algorithm multiple times but using randomly 
re-chosen parameters each time),
a common approach is to test all of them on some set 
of previously unseen cross-validation (CV) data, and then to 
pick the hypothesis that had the smallest CV error 
\cite{mosteller68:dais,stone74;cvcaaosp,stone78:afaacv}.
For example, we may generate a set of hypotheses by training
multiple Neural Networks 
with Backpropagation on a fixed set of training data, with each 
Neural Network having different initial weights, and then pick
the Neural Network that has the lowest error on a set of 
hold-out CV data. But if the CV data is partially corrupted 
by noise, then ``folklore'' also warns that if we used too small a 
set of CV data to test too large a number of hypotheses, we may
end up picking a poor hypothesis that had fit the corrupted CV 
data well ``just by chance'' 
\cite{klockars86:mc,tukey49:cimitaov,tukey53:tpomc}.
%\footnote{There was some debate on whether this should be called
%``overfitting'' since there is no notion of complexity. However,
%some other authors have used this terminology in similar context,
%and we will continue to do so, in the understanding our paper will
%continue to make clear what we are mean by ``overfitting.''}

In this paper, we examine this problem of ``overfitting'' of 
CV data. This is an unusual form of overfitting because, unlike
overfitting by single applications of learning algorithms such 
as Decision Trees and Neural Networks, 
{\em this overfitting comes from 
having tested too many hypotheses, rather than from 
having chosen a too complex a hypothesis.} 
For example, within the
set of hypotheses we are selecting from, there is no notion 
of a structure or a sequence of nested hypothesis classes 
of increasing complexity such as assumed in some 
models \cite{kearns96:aboteocv,vapnik71:otucorfoettp}, 
or of some hypotheses having been trained using a more complex 
hypothesis class. Because of this, even 
thought the literature treating overfitting is 
rich with algorithms for selecting/limiting the {\em complexity} of the 
output hypothesis, this literature does not generalize 
to this other important problem of ``overfitting'' 
through having tested too {\em many} hypotheses with too small a CV set. 
(The mentioned literature is too wide to survey here, but for 
a sample, see 
\cite{baum89:wsngvg,judd90:nndatcol,kearns95:aeatcomsm,miller90:ssir,rissanen78:mbsdd}.)
And, while there are interesting similarities between 
``too many'' and ``too complex,'' the relationship between 
them is not obvious, and we defer a further discussion of them to 
Section~\ref{section-discussion}, when we will have developed a 
notation for addressing these issues.
%->something to point to later part of the paper, to overcome 
%-> objections? 
%We will briefly discuss this issue further in the paper.

%quinlan89:idtutmdlp,
%vapnik71:otucorfoettp}.
%Also, for literature discussing an alternative way of using a 
%set of hypotheses, see also \cite{littlestone91:twma,littlestone94:twma}.

This paper examines this problem of ``overfitting'' of cross-validation
data. Our contributions are two-fold: First, we explain how this 
overfitting really occurs, and show the surprising result that 
we can do better than picking the hypothesis with the lowest 
CV error. Note that this is a much {\em stronger} result than merely 
that the hypothesis with the lowest CV error is unlikely to be 
the hypothesis with the lowest generalization error; we are claiming 
that it is sometimes possible to actually {\em find} a hypothesis 
that we can expect to have lower generalization error than the 
minimum-CV-error hypothesis. And second, we will present an algorithm, 
{\em Leave-one-out Cross-Validated Cross-Validation} (LOOCVCV), 
to choose such a hypothesis, and show experimentally that 
LOOCVCV does consistently beat picking the minimum--CV--error 
hypothesis in one domain. 

\section{Definitions}
		\label{section-definitions}

Henceforth, for simplicity, let us assume a boolean target function $f$.
Let $H=\{h_i\}_{i=1}^{n}$ be the set of $n$ hypotheses that 
we are choosing from, and let us also assume that each hypothesis was 
drawn \iid\/ from some fixed distribution of hypotheses $D_H$ (possibly
a distribution induced by application of a randomized learning algorithm). 
We will also consider the case when we are also given access to an oracle
that samples $h \in D_H$, but will not distinguish between this and
the case where we do not have access such an oracle, because of the
two cases' vast similarities. For example, when $n$ is large, statements 
pertaining to when we have access to an oracle can naturally be 
approximately translated to when we do not, by ``constructing'' our 
own oracle that samples, possibly with replacement, from $H$.
Continuing, let $S = \{(x_i,y_i)\}_{i=1}^{m}$
be the set of $m$ samples forming the CV data, where 
each $x_i$ is an arbitrary-dimensioned input 
vector drawn \iid\/ from some distribution $D$, and $y_i \in \{0,1\}$,
the labels of the data, have independently been corrupted 
with some fixed but unknown {\em noise rate} $\eta$, so
that $y_i = f(x_i)$ with probability $1-\eta$, and $y_i = \neg f(x_i)$ with
probability $\eta$. We shall also write $p_i = (x_i, y_i)$, so that 
$S = \{p_i\}_{i=1}^{m}$, and let us also define 
$S_{-i} = S \setminus \{p_i\}$.
Furthermore, we shall say that hypothesis $h_i$ 
{\em apparently misclassifies} some element of the CV set 
$(x_j,y_j)$ if $h_i(x_j) \neq y_j$, and that it 
{\em truly misclassifies} it if $h_i(x_j) \neq f(x_j)$. 
Let the CV error $\ehat_S(h_i) \in [0,1]$ be the fraction of the CV set $S$
that hypothesis $h_i$ apparently 
misclassifies, 
and let the generalization error (with respect to
uncorrupted data) of each hypothesis $h_i$ 
be $\varepsilon_{D,f}(h_i) = {\rm Pr}_{x\in D}[h_i(x) \neq f(x)]$,
following the PAC framework 
%of \cite{valiant84:atotl},
(although we will, for notational brevity, often drop
the subscripts in $\ehat_S$ and $\varepsilon_{D,f}$).
Finally, the hypothesis $h_i$ in a set $H$ of size $n$ 
that has the lowest CV error $\ehat_S(h_i)$
will be called the ``best--of--$n$ hypothesis.'' 

Note that by CV, we mean what is sometimes called ``simple cross-validation,''
where there is a fixed cross-validation set, rather than $m$-fold or 
leave-one-out cross-validation. And in our notation, the goal of hypothesis 
selection, informally, is then to choose a hypothesis $h$ such 
that $\varepsilon_{D,f}(h)$ is expected to be ``small,'' and the 
conventional approach is choosing the best--of--$n$ hypothesis.

%Then, informally, the goal of hypothesis selection is, given 
%$S$ and either $D_H$ or $H$, to choose a 
%hypothesis $h$ such that $\varepsilon_{D,f}(h)$ is expected to be ``small,''
%and the conventional approach is choosing the 
%best--of--$n$ hypothesis.

\section{Overfitting by a Simulated Learning Algorithm}
		\label{section-simulated-algorithm}

As mentioned, when we choose the hypothesis $h$ that has the 
lowest $\ehat(h)$, we may be choosing a poor hypothesis that 
had fit the CV data well ``just by chance.'' 
For example, if $\ehat(h) = 0$ so that it apparently classifies 
the CV data perfectly, but some of the CV data's 
labels had been corrupted, then the hypothesis must 
actually be {\em truly misclassifying} the corrupted 
samples in the CV data, and therefore have non-zero 
(and possibly large) generalization error. 

\begin{figure*}[t]
%\vspace*{-0.2in}
%\hskip -0.55in 
\parbox{3.25in}
{
\hskip 0.275in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.17.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.17$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.17] \approx 0.0648$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior17}
%\end{figure} 
  \centerline{\parbox{3.01in} {Figure~\ref{figure-posterior17}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.17$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.17] \approx 0.0648$.}}
%           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}
\parbox{0.37in}
{ \hbox to 0.18in {} }
\parbox{3.25in}
{
\hskip 0.263in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.20.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.20$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.20] \approx 0.0252$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior20}
%\end{figure}
  \centerline{\parbox{3.01in} {Figure~\ref{figure-posterior20}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.20$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.20] \approx 0.0252$.}}
%           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}
%\end{figure*}

%\begin{figure*}
%\hskip -0.55in
\parbox{3.25in}
{
\hskip 0.275in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.23.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.23$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.23] \approx 0.0727$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior23}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-posterior23}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.23$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.23] \approx 0.0727$.}}
%           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}
\parbox{0.37in}
{ \hbox to 0.18in {} }
\parbox{3.25in}
{
\hskip 0.263in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/num60000.com.ps,width=2.70in,height=1.7in,angle=270}
%  \psfig{figure=/afs/andrew/usr/an2i/cv/new_com.ps,width=2.70in,height=1.7in,angle=270}
% \caption{$n$ varies along the $x$-axis, and the top-most curve
%           the expected generalization error if we draw $n$ hypotheses 
%           and pick the one with the lowest CV error. The middle,
%           step-like line is the expected error if we picked the 
%           $k$\/-th percentile hypotheses, using $k = 100 (1- (1/n))$.
%	    The bottom line is the generalization error of LOOCVCV achieves.}
\refstepcounter{my-figures-ctr}
  \label{figure-simulated-results}
%\end{figure}
 \centerline{\parbox{3.01in} 
	{Figure~\ref{figure-simulated-results}: $n$ on $x$-axis, generalization error on $y$-axis. 
	   Top curve is best--of--$n$, middle, stepped, line is 
		$k$\/-th percentile hypothesis, with $k = 100 (1- (1/(n+1)))$,
	    bottom line is LOOCVCV.}}
}
\end{figure*}

\subsection{A Simulated Learning Algorithm}
	\label{subsection-simulated-algorithm}

To examine the overfitting effect, let us first look at a 
particular model of a learning algorithm and learning task:

\begin{itemize}
  \item We have a black box that draws successive hypotheses $h_i$ \iid\ 
	from $D_H$ (say, by applying a randomized algorithm to
	a set of training data), such that the random variable
	$\varepsilon_{D,f}(h_i)$ (induced by drawing $h_i \in D_H$), 
	is distributed $\varepsilon_{D,f}(h_i) \sim {\rm Uniform}(0,1)$.
  \item We fix $S$ to have $m=100$ CV samples, and assume that our 
	noise process happened to corrupt exactly 20 of this 
	set's labels. (Note that, in this example, it is only to make 
	our exposition cleaner that we have ``fixed'' the noise,
	which should technically be random.)
  \item Each hypothesis $h_i$ has an {\em independent}\/ $\varepsilon(h_i)$
		chance of truly misclassifying each of the $m$ CV samples,
		where the independence is across both hypotheses and CV
		samples.
\end{itemize}

So, to simulate testing a hypothesis with a given generalization error
$\varepsilon(h_i)$, we would flip a 0--1 coin with bias 
$\varepsilon(h_i)$ 100 times, negate the result of (say) the 
last 20 flips, and calculate the fraction of resulting
`1's to get the CV error $\ehat(h_i)$. Note also that of the assumptions 
above, the one of uncorrelated errors across hypotheses is probably the 
most unrealistic, but we include it to have a simple and analyzable 
example of a learning algorithm.
%And before proceeding, we also wish to note that we also found 
%empirically that our results are only very weakly dependent on the 
%that $\varepsilon(h_i)$ is distributed ${\rm Uniform}(0,1)$. 
The goal, then, is to generate some number of hypotheses, test 
them on the $100$-sample CV data, and then to somehow pick a 
hypothesis $h$ with ``small'' expected generalization 
error.

\subsection{Why Picking the ``Best'' is Bad}
	\label{subsection-why-best-bad}

%In this section, we give a reason for why picking the
%``apparent best'' hypothesis may not be optimal when the CV
%data's labels have been corrupted by noise. Succinctly, it is
%the surprising result that lower CV error does not necessarily 
%imply lower expected generalization error (where the random variable
%that the expectation is over hypotheses drawn from $D_H$).

%Succinctly, is is that
%even though lower $\ehat(h) \not\implies$ 
%lower ${\rm E}[\varepsilon(h)|\ehat(h)]$, because of differing 
%{\em variances} of $\ehat(h)|\varepsilon(h)$'s.
%
%  \item If errors between the hypotheses are uncorrelated, then some
%		``mean'' hypothesis, that somehow agrees with most of the 
%		other hypotheses, can be expected to do well. And at least 
%		on one metric (to be explained), the best--of--$n$ 
%		hypothesis is maximally different from doing this.

%\subsubsection{Lower Test Error \protect$\not\implies$ 
%				Lower Generalization Error}
%	\label{section-lower-not-imply}

In this section, we will explain why picking the ``apparent best'' 
hypothesis may not be optimal when the CV data's labels 
have been corrupted by noise, and this will lead to results that will 
be the cornerstone of the rest of this work. While, for the purpose
of clarity in exposition, we will continue to assume that exactly 
$20$ CV data labels had been corrupted, it is worth keeping in 
mind that our main results will readily apply for the same problem
using any other fixed CV sample $S$ with some corrupted labels, 
whatever the level of label corruption may be. Naturally, the 
algorithms we propose later in this paper will also be applicable 
without requiring knowledge of the level of label corruption 
in $S$.

To motivate the rest of this discussion, let us first consider 
a simpler distribution of hypotheses than described above, 
that clearly shows why one might not want to 
pick the ``apparent best'' hypothesis.
Let us continue to assume that exactly $20$ of the $100$ CV data 
labels were corrupted, but that $D_H$ is now such that each hypothesis
$h \in D_H$ has either $\varepsilon(h) = 0.00$ or $\varepsilon(h) = 0.10$.
Now, consider the particular case of choosing between two 
hypotheses $h_1$ and $h_2$, where $\ehat(h_1) = 0.17$, and 
$\ehat(h_2) = 0.20$. Suppose we chose $h_1$. Then, we know that 
we must necessarily have chosen a hypothesis with 
$\varepsilon(h) = 0.10$, because if $\varepsilon(h) = 0.00$, 
then $h$ would have classified all of 
the CV sample correctly with respect to the true target function, 
and would therefore have had $\ehat(h) = 0.20$ because $20$ elements of 
the CV sample were corrupted. 
On the other hand, if we chose $h_2$ with $\ehat(h_2) = 0.20$, then 
we have a non-zero chance of picking a hypothesis with 
$\varepsilon(h) = 0.00$. Hence, 
$E_{h \in D_H}[\varepsilon(h)|\ehat(h)=0.20] < E_{h \in D_H}[\varepsilon(h)|\ehat(h)=0.17]$
for this case. 

The previous example might have seemed slightly contrived, but a more general 
explanation for the hypothesis with lower $\ehat(h)$ not having 
lower expected generalization error 
%$E[\varepsilon(h)|\ehat(h)]$ 
is in the 
differing {\em variances} of $\ehat(h)$. Given 
$S$, $D$, $D_H$ and $f$, then $\ehat_S(h)$, 
as a random variable dependent with $\varepsilon(h)$ 
(which is in turn induced by drawing $h \in D_H$), is such that  
$\E_{h \in D_H}[\ehat_S(h)|\,\varepsilon(h)]$ monotonically increases with 
$\varepsilon(h)$ as expected; but, its 
variance ${\rm Var}_{h \in D_H}(\ehat(h) |\, \varepsilon(h))$
also increases as a function of $\varepsilon(h)$, at least
for $\varepsilon(h) \in [0,0.5]$. For example, if $\varepsilon(h) = 0$, 
then $\ehat(h) = 0.20$ always, so that
${\rm Var}_{h \in D_H}(\ehat(h)|\,\varepsilon(h)=0) = 0$; 
but ${\rm Var}_{h \in D_H}(\ehat(h)|\,\varepsilon(h)=0.5) = 0.25 > 0$.
So, despite hypotheses with lower $\varepsilon(h)$'s having 
lower ${\rm E}_{h \in D_H}[\ehat(h)|\varepsilon(h)]$'s, it is possible for them
to have have a {\em smaller} probability of having ``extremely low'' 
$\ehat(h)$'s ``by chance,'' because their distributions of $\ehat(h)$ 
are less spread-out (lower variance). In words, {\em hypotheses with lower
generalization errors do have lower expected CV errors, but may have
a smaller chance of having ``extremely low'' CV errors because their
distributions of CV errors are less spread--out than those of hypotheses
with higher generalization errors (and higher expected CV errors).}
And for certain ``priors'' such as the simplified $D_H$ above, this leads 
to the effect of ${\rm low}\: \ehatbar \not\implies {\rm low}\: 
\E_{h \in D_H}[\varepsilon(h)|\,\ehat(h)=\ehatbar]$, which 
causes the overfitting phenomenon. 

Finally, let us return to the model of 
Section~\ref{subsection-simulated-algorithm}. Recall that 
exactly $20$ of the $100$ CV data labels were corrupted. Then, using
our ``prior'' for $h \in D_H$ that is such that 
$\varepsilon_{D,f}(h) \sim {\rm Uniform}(0,1)$, we can calculate,
for each value of $\ehatbar$, the posterior distribution 
$\varepsilon(h) | \ehat(h) = \ehatbar$, which we denote by 
$f_{\varepsilon|\ehat}$:
\begin{eqnarray}
f_{\varepsilon|\ehat}(\varepsilon(h) | \ehat(h)=\ehatbar) &=&
		\frac{f_{\ehat|\varepsilon}(\ehat(h)=\ehatbar|\varepsilon(h)) f_\varepsilon(\varepsilon(h))}
			{f_{\ehat}(\ehat(h)=\ehatbar)}  \nonumber \\
		&=& k f_{\ehat|\varepsilon}(\ehat(h)=\ehatbar|\varepsilon(h)) 
					\nonumber \\
		&=& k \sum_{i=0}^{100\ehatbar} (B(80,\varepsilon(h), i) 
					\nonumber \\
		& & \hbox to 0.15in{} 
					B(20, 1-\varepsilon(h), 100\ehatbar-i))
					\nonumber
\end{eqnarray}
\noindent
where $k$ is a normalizing constant, and 
% that makes the integral of 
%  the probability distribution function equal 1, and
%				\left(\begin{array}{c}80\\i\end{array}\right) 
%			\varepsilon(h)^i (1-\varepsilon(h))^{80-i} \nonumber \\
%&&				\left(\begin{array}{c}20\\100\ehatbar-i\end{array}\right)
%				(1-\varepsilon(h))^{100\ehatbar-i} \nonumber\\
%&&				\varepsilon(h)^{20-(100\ehatbar-i)} \nonumber
%
% Original displaymath-style definition of B(.,.,.)
%\begin{displaymath}
%	B(n,p,x) = 
%		\left(\begin{array}{c}n\\x\end{array}\right) p^x (1-p)^{n-x} 
%		\nonumber	
%\end{displaymath}
%
$B(n,p,x) = \left(\hbox to 0pt {$^n$}_{x}\right) p^x (1-p)^{n-x}$
%\noindent
is the probability of a ${\rm Binomial}(n,p)$ distribution returning $x$
(so that $i$ in the summation is the number of apparent misclassifications
that are also true misclassifications). 
These posterior distributions 
of $\varepsilon(h)$ for $\ehatbar = 0.17$, $0.20$,\/ and $0.23$ 
%were evaluated numerically, and 
are shown in 
Figures~\ref{figure-posterior17}\/--\/\ref{figure-posterior23}.
From the figures, we see that if we picked a hypothesis 
with $\ehat(h)=0.17$ or $\ehat(h)=0.23$, we can
expect the hypothesis to have higher generalization error than one 
with $\ehat(h)=0.20$. It was, of course, expected that picking 
a hypothesis with CV error $0.20$ beat picking one with $0.23$, 
but it was surprising that picking a hypothesis with a {\em higher} error 
on the CV data ($\ehat(h)=0.20$ rather than $0.17$) could, on average,
result in a hypothesis with {\em lower} generalization error.

This was unexpected because we expect hypotheses 
with lower generalization error to have lower CV error.  
The point is that {\em the reverse is not necessarily true,}\/ and 
hypotheses with lower CV error do not necessarily have lower
expected generalization error. 
That is, given fixed $S$, $D_H$, and $f$ for this problem,
    ${\rm low}\ \overline{\varepsilon} \implies 
	{\rm low}\ \E_{h\in D_H}[\ehat(h)|\,\varepsilon(h) = \overline{\varepsilon}]$, 
but ${\rm low}\ \overline{\ehat} \not\implies 
	{\rm low}\ \E_{h\in D_H}[\varepsilon(h)|\,\ehat(h) = \overline{\ehat}]$.

% commented out: 9/14/96
%\subsubsection{Agreement between hypotheses}
%
%Suppose that we were told that the noise rate $\eta$ in
%the test data was 0.5, so that we are, from an information
%standpoint, given only an unlabeled test sample
%(i.e. only the $x_i$'s) as the $y_i$'s are contain no information. Then, 
%informally, it seems reasonable to pick some hypothesis that somehow
%``agrees'' with the other hypotheses on the classifications it makes, 
%with the intention that if errors of the hypotheses are uncorrelated,
%then some ``mean'' hypothesis may have lower expected
%generalization error than one chosen randomly from the pool.
%
%However, in picking the best--of--$n$ hypothesis, we are doing exactly
%the opposite of this: We are picking the hypothesis that has the lowest
%test error, and which therefore {\em maximally disagrees with all the other 
%hypotheses on how many labels in the test sample are incorrect}. 
%
%In our experimental results (not reported),
%%in the decision--tree domain
%%we describe in Section~\ref{section-experiment}, 
%we found that, 
%even with moderately high noise rates in the test data, this can 
%led to picking hypotheses whose expected generalization errors are 
%significantly higher than picking a hypothesis at random from the pool.
%%which, with a sufficiently high $\eta$, in turn does worse than picking 
%%the hypothesis with median test error $\ehat(h)$ (which may {\em not} 
%%be a 50\%--error hypothesis).
%
%This also suggests that another natural heuristic to use in hypothesis 
%selection is to pick a hypothesis that agrees with the other hypothesis
%maximally in some sense. However, we consider this issue to be 
%somewhat orthogonal to our problem of hypothesis selection using 
%test data, and shall not explore it further in this paper.


%\section{An Alternative to Choosing the ``Best''} 
\section{Not Choosing the ``Best''} 
	\label{section-alternative}

Because of this ``overfitting'' of CV data, 
``folklore'' suggests generating only a limited number 
of hypotheses from $D_H$ before picking the one 
that has the lowest CV error. The rationale is that
if we generate too few hypotheses, then we might not generate 
any good hypotheses; whereas if we generate too many, we 
increase the risk of generating poor hypotheses that 
fit the CV data well ``by chance,'' and overfit.  So, 
the optimal number of hypotheses to generate should be between
the two extremes, and we will call this number $\nopt$.
But, such practice seems sub-optimal. For example, 
it is not clear how $\nopt$ can be found. Moreover, if we 
knew the value of $\nopt$, but have already generated more 
than $\nopt$ hypotheses, then it is unclear if we should throw-away 
all but $\nopt$ of them, or if there is a better way to use 
the extra hypotheses. Finally, we would also like an algorithm 
that asymptotically improves as it is given more hypotheses.

The problem with generating ``too many'' hypotheses was that those
with very low CV errors were actually ``overfitting,'' and had high 
generalization error. Because, of any large set of 
hypotheses drawn from the fixed distribution of hypotheses,
the fraction of them that ``overfit'' should be approximately 
the same, we propose the following, which we
call {\em percentile-cv}:

%[should probably either explain the name (cv), or call it something
%else.]

\begin{enumerate}
   \item Generate as many hypotheses as is computationally feasible.
   \item Test them all on our CV data $S$, and sort them in descending 
		order of $\ehat_S(h_i)$.
   \item For some $k$, pick the hypothesis in the $k$\/-th percentile 
		in the list of sorted 
		hypotheses. (i.e. if we have $n$ hypotheses, then 
		choose the one that was ranked 
		$\left\lceil kn/100 \right\rceil$ in the sorted list.) 
\end{enumerate}

If multiple hypotheses fall into the $k$-percentile by having the
same CV error, then we pick uniformly from them. Note 
also that we have introduced a new variable, $k$, that we have not 
specified a way to choose yet, but also that we no longer need 
to guess the value of $\nopt$. 

The motivation behind percentile-cv is that when we sort the hypotheses 
in order of $\ehat$'s
as in Step 2 above, we hope that the hypotheses are 
approximately sorted into order of ``truly bad'' 
(high $\varepsilon$, high $\ehat$), followed by ``good'' 
(low $\varepsilon$, low $\ehat$), followed by ``bad, but 
overfit CV data by chance'' (high $\varepsilon$, very low $\ehat$). 
Then, if we chose $k$ well, we would pick something between 
the ``truly bad'' and ``bad, but overfit CV data by chance,'' 
and choose a good hypothesis. Note, however, that we will be 
{\em choosing a hypothesis with higher CV error over some
others with lower CV errors}.

Also, as the number of hypotheses generated tends 
towards $\infty$, since the hypotheses are drawn \iid\/ from a fixed 
distribution, we expect fixed proportions of them to be ``truly bad,'' 
``good,'' and ``bad, but overfit,'' so we will still be choosing 
a hypothesis that is ``good.'' And as an example, in the domain 
of the simulated algorithm described in 
Section~\ref{section-simulated-algorithm}, we would hope that
$k$ is chosen such that the $k$\/-th percentile hypothesis is one 
that has $\ehat(h)=0.20$.

We now present a result that further justifies using percentile-cv, 
which generates a large set of hypotheses and picks some $k$-th percentile 
hypothesis from it, instead of  best--of--$n$, which generates some 
finite set of $n$ hypotheses, and picks the minimum CV-error 
hypothesis in it.
The following theorem states that, with a well chosen $k$, 
percentile-cv always does at least well as best--of--$n$
(i.e. $\exists k$ such 
that $k$-th percentile beats best--of--$n$, for all $n$).
%and can therefore be considered a ``superset'' of best--of--$n$:

\begin{theorem}  \label{theorem-percentile-win}
For any $D_H$, $f$, and $S$, 
$\exists k$ such that, $\forall n$, choosing the $k$-th percentile 
hypothesis from a (sufficiently) large set of hypotheses 
drawn \iid\/ from $D_H$, gives an expected generalization error 
no higher than picking the ``apparent best'' of $n$ randomly 
generated hypotheses.
\end{theorem}
{\bf Proof (Sketch):} For any fixed $n$, 
picking the best--of--$n$ is equivalent
to the following in terms of performance: 
(1) Generating a set $H$ of $n$ 
hypotheses, noting down the lowest CV error 
$\ehat_0 = {\rm min}_{h \in H}\; \ehat(h)$, and then (2) generating fresh 
hypotheses until one has the same CV error of $\ehat_0$, and 
outputting that hypothesis, which we shall call $h_{\rm f}$. 
%Now, step (1) chooses $\ehat_0$, and step (2) draws and outputs a 
%hypothesis from the distribution $D_H | \ehat(h) = \ehat_0$. 
Now, step (2) is simply drawing $h_{\rm f}$ from the conditional 
distribution $h \in D_H | \ehat(h)=\ehat_0$; and of the possible 
$\ehat_0$'s that could be chosen, one is ``best'' in terms of giving
the conditional distribution that minimizes 
%%${\rm E}[\varepsilon(h_{\rm f})| \ehat_0]$ during this step.
${\rm E}_{h_{\rm f} \in D_H}[\varepsilon(h_{\rm f})| \ehat(h_{\rm f}) = \ehat_0]$ during this step.
%which draws $h$ from the distribution $D_H | \ehat(h)=\ehat_0$. 
Call this $\ehat_{\rm opt}$.
Then, there exists $k$ such that the $k$-th percentile hypothesis
will have $\ehat(h) = \ehat_{\rm opt}$ with probability 1, meaning
that this $k$-th percentile hypothesis has an expected generalization 
error of 
${\rm E}_{h \in D_H}[\varepsilon(h)| \ehat(h) = \ehat_{\rm opt}]$.
So, for all $n$,
$\exists k$ such that $k$-th percentile (not strictly) beats the apparent
best of $n$. This implies that some optimal $k$ 
must beat {\em all} $n$, which is the statement of the theorem.

\medskip

\noindent
\begin{corollary}  \label{corollary-percentile-strict-win}
If 
${\rm E}_{h_{\rm f} \in D_H}[\varepsilon(h_{\rm f})| \ehat(h_{\rm f}) = \ehat_0]$ 
is not constant over all possible $\ehat_0$, $\exists k$ such 
that $\forall n$, the expected generalization error of the 
$k$-th percentile hypothesis is strictly less than that of 
the best--of--$n$ hypothesis.
\end{corollary}

\section{Choosing a Percentile}
	\label{section-choose-percentile}

Using the conventional hypothesis selection method, we 
had to estimate $\nopt$, the number of hypotheses to generate
before picking the one that had the lowest CV error. 
In this section, we describe 
{\em Leave-one-out Cross-Validated Cross-Validation} (LOOCVCV), 
a new algorithm 
that first finds an estimate $\nhat$ of $\nopt$, then chooses 
$k$ from that, and finally applies percentile-cv using that 
value of $k$. We will first show how to choose $k$ from $\nopt$ or 
an estimate $\nhat$ of it, and then how to estimate $\nhat$.

\subsection{Choosing $k$ from $\nopt$}
	\label{subsection-k-from-nopt}

For a fixed CV set $S$, since the hypotheses 
$h_i$\/'s are drawn from a fixed distribution, their CV 
errors $\ehat_S(h_i)$ are also drawn from some fixed distribution. 
Hence, if we generate $\nopt$ hypotheses and pick the one 
with the lowest CV error, then to a ``good approximation,'' we 
expect to pick a hypothesis 
whose CV error just falls into the bottom $1/(\nopt+1)$ fraction of all
CV errors drawn from our distribution of $\ehat$'s. 
(The proof of this is from the literature on the statistical 
theory of Extreme Value Distributions.
See \cite{leadbetter80:earporsap}, for example.)
We therefore propose using the following value of $k$\/:

\begin{equation}
  k = 100 \cdot (1- \frac{1}{\nopt+1})   \label{eqn-k-from-nopt}
\end{equation}

%[Might be clearer if we got rid of the constant 100 everywhere,
% and made it fraction-cv or something instead. (?)]

Using notation from the proof of Theorem~\ref{theorem-percentile-win}, 
we can also think of picking the best--of--$n$ hypothesis as a way 
of choosing $\ehat_0$ (and then picking a hypothesis $h$ 
with $\ehat(h) = \ehat_0$). Now, suppose we already have a good 
estimate $\nhat$ of $\nopt$\/; then a natural alternative to 
picking the apparent best of $\nhat$ is to pick the $k$-th percentile 
hypothesis, where $k = 100(1-(1/(\nhat+1)))$, with the advantage 
of doing this being reducing the {\em variance} of $\ehat_0$, 
and hopefully the ``variance'' of the hypothesis we pick as well. 
Later, we will also give experimental results that demonstrate that
this can do significantly better than picking the best--of--$\hat{n}$.

\subsection{Estimating $\nopt$}

We will only sketch the main ideas of the algorithm here,
and leave the details to the Appendix.
We want to use Leave-one-out Cross-Validation (LOOCV) as follows: 
For each value of $\nhat$, we want a measure of 
the generalization error of the best--of--$\nhat$ 
hypothesis. So, for each sample $p_i \in S$, we shall ``leave-out'' 
$p_i$ and use $S_{-i}$ only,
choosing the hypothesis $\hstar_{i,\nhat}$ that has the minimum
empirical error on $S_{-i}$ among a set of $\nhat$ randomly
generated hypothesis; and then test $\hstar_{i,\nhat}$ on 
$p_i$ and see if $\hstar_{i,\nhat}(x_i) \testeq y_i$. Then, 
the fraction of values of $i$ for which $\hstar_{i,\nhat}(x_i) \neq y_i$, 
which we call the {\em LOOCV-error}, is a measure of the 
generalization error of the best--of--$\nhat$ hypothesis. And running
this process on different $\nhat$'s, we can get a measure 
of the LOOCV-error for each value of $\nhat$, and pick the 
value of $\nhat$ that minimizes the LOOCV-error.


%[Incidentally, doing LOOCV directly on the ``desired'' error doesn't 
%work, for not-so-obvious reasons. The main problem of this is
%that the estimate of the error of the hypothesis that got, say, 19 wrong,
%will be completely determined by the number of hypothesis that got 19
%wrong, and the number of hypothesis that got 20 wrong. Unless we do
%something clever, it'll also always pick the hypothesis that had the 
%{\em highest} test error (yes!). Maybe we can do LOOCV on the percentile, 
%though (emphasis on the ``Maybe''), and do something similar to the 
%bootstrap done in LOOCVCV (described below). But, the math doesn't really 
%work out as nicely as it does for LOOCVCV, and I'm not sure that
%we can really to do the ``simulate bootstrap'' thing (also described 
%below).]

The problem with the current formulation of this algorithm 
is that choosing the best--of--$\nhat$ hypothesis is a noisy process, as it
involves generating randomly chosen hypotheses. So, the entire process
has to be repeated many times in a Monte Carlo fashion for each 
value of $\nhat$, to get estimates of the LOOCV-error. This will, 
unfortunately, generally require generating an intractably large 
number of hypotheses. So instead, we will generate only a 
large but finite set $H$ of hypotheses, and choose uniformly 
and with replacement from this set whenever we need to generate a hypothesis
(this, from the Statistics literature, is very similar to
non-parametric bootstrap \cite{efron79:bmalatj}).
However, even doing this, we found that we would still require averaging
over an intractable or otherwise very large number of repetitions 
to get a smooth curve for the LOOCV-error as a function of $\nhat$. 
But, there is fortunately a way to calculate 
explicitly, from $\nhat$, $H$, and $S$, the LOOCV-error value we would 
get if we really repeated this bootstrap-like process an infinite 
number of times, and this is what LOOCVCV does. (Details in appendix.)

So, for every value of $\nhat$, we now have a tractable way of estimating 
the generalization error of the best--of--$\nhat$ hypothesis,
and we can 
use the value of $\nhat$ that minimizes the LOOCV-error as our 
estimate of $\nopt$. With this, we can use 
Equation~\ref{eqn-k-from-nopt} to derive $k$, our desired 
percentile, which we can then use to apply percentile-cv. This 
completes our formulation of the LOOCVCV algorithm.


\section{Results Using the Simulated Learning Algorithm}
		\label{section-results-simulated-algo}	

We ask the reader to refer to Figure~\ref{figure-simulated-results},
which shows results of different hypothesis selection methods applied to
the Simulated Learning Algorithm 
described in Section~\ref{section-simulated-algorithm}.
The value of $n$ varies along the $x$-axis, and the top-most curve
is the expected generalization error if we draw $n$
hypotheses and pick the one with the lowest CV error. The middle,
stepped, line is the expected generalization error if we picked the 
$k$\/-th percentile hypothesis, with $k = 100 (1- (1/(n+1)))$,
using an ``infinitely'' large set of hypotheses. Finally, the bottom 
horizontal line is the generalization 
error that LOOCVCV achieves, also using an ``infinitely'' large 
set of hypotheses.

%[top curve: Actually generated by bootstrap on 60000 hypotheses. Should
%  be pretty accurate. middle: data from 2,000,000 simulated hypotheses. 
%  Should be very accurate.]

For this particular problem, $\nopt \approx 101$. Notice, however, that 
if we generated exactly $\nopt$ hypotheses and picked the one with
the lowest CV error, our expected generalization error would be 
about 0.035, which is higher than if we picked the 
$k=100(1 - (1/(101+1))) \approx 99.0$\/-th 
percentile hypothesis from an ``infinitely'' large sample of hypotheses,
which gives an expected generalization error of about 0.025.
The reason for this is that with best--of--$n$, the CV error
$\ehat(h)$ of the hypothesis we choose in the end has non-zero variance,
and therefore performs worse than percentile-cv, by our argument 
of Section~\ref{subsection-k-from-nopt}.
% Why did I originally put Section~\ref{section-lower-not-imply}?
That is, 
$\E_{h \in D_H}[\varepsilon(h) | \ehat(h) = \overline{\hat\varepsilon}]$ 
is minimized when $\ehatbar = 0.20$. However, in the process of
drawing $\nopt$ hypotheses and picking the one with lowest CV error,
we may occasionally pick hypotheses with other values of $\ehat(h)$,
which is why its expected generalization error is higher. On the other
hand, in the ``infinitely'' large sample of hypotheses, 
the $99.0$\/-th percentile hypothesis has $\ehat(h) = 0.20$ 
with probability $1$, which is why picking
the $99.0$\/-th percentile hypothesis results in a lower expected
generalization error than best--of--101: There is less 
variance in the CV error of the hypothesis we pick.

Similarly, note too that, for most values of the curve, 
$k$-th percentile 
(the middle curve) beats picking the apparent best of $n$ 
(the top curve). We argue that this is due to the same
reason: using $k$-th percentile results in lower variance in the CV error
of the chosen hypothesis. This is further justification for our 
earlier argument that, if we did not want to apply LOOCVCV, but 
already had an estimate $\nhat$ of $\nopt$, then we may be better off
picking the $100(1-(1/(\nhat+1)))$\/-th percentile hypothesis, instead of 
the best--of--$\nhat$ hypothesis.

Finally, we note that when LOOCVCV is given an ``infinitely''
large set of hypotheses for this problem, it will always achieve an 
expected generalization error of about 0.025 which, as is also
suggested by the stepped line in Figure~\ref{figure-simulated-results} 
having a lowest value of 0.025, is provably the best that any 
algorithm using percentile-cv can achieve on this problem. 
The best--of--$\infty$ hypothesis, in contrast, has a perhaps
surprisingly high expected generalization error of about 0.206, which is the
value that the topmost curve of Figure~\ref{figure-simulated-results} 
eventually asymtotes at.

%\newpage
%\vspace*{-.2in}
%\hskip -0.55in
\begin{figure*}[t]
\parbox{3.25in}
{
\hskip 0.275in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xy20.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise in CV data on $x$-axis, generalization error on $y$-axis.
%	m=20. Vertical dashes show 95\% confidence intervals for means. }
\refstepcounter{my-figures-ctr}
  \label{figure-xy20}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xy20}: Noise rate on $x$-axis, generalization error on $y$-axis.
	m=20. Vertical dashes show 95\% confidence intervals for means. }}
}
\parbox{0.37in}  
{ \hbox to 0.18in {} }
\parbox{3.25in}
{
\hskip 0.263in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xy40.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise in CV data on $x$-axis, generalization error on $y$-axis.
%	m=40. Vertical dashes show 95\% confidence intervals for means. }
\refstepcounter{my-figures-ctr}
  \label{figure-xy40}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xy40}: Noise rate on $x$-axis, generalization error on $y$-axis.
	m=40. Vertical dashes show 95\% confidence intervals for means. }}
}
%\end{figure*}

%\vspace*{-1.0in}
%\hskip -0.55in
%\begin{figure*}
\parbox{3.25in}
{
\hskip 0.275in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xytest.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{m (number of CV-data samples, log scale) on $x$-axis,
%	generalization error on $y$-axis.
%	Uppermost curve is ``apparent best'' of 1000, lower curve is LOOCVCV.}
\refstepcounter{my-figures-ctr}
  \label{figure-xytest}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xytest}: m (number of test-data samples, log scale) on $x$-axis,
	generalization error on $y$-axis.
	Upper curve is best--of--1000, lower curve is LOOCVCV.}}
}
%\hbox to 0.2in {}
\parbox{0.37in}
{ \hbox to 0.18in {} }
\parbox{3.25in}
{
\hskip 0.263in
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xyoptint.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise rate on $x$-axis, generalization error on $y$-axis.
%	Uppermost curve is best--of--$\nopt$, middle curve is the
%	$k$-th percentile hypothesis, where $k$ was derived from $\nopt$ as 
%	in (\protect\ref{eqn-k-from-nopt}), bottom curve is $k_{\it opt}$-th percentile 
%	hypothesis where $k_{\it opt}$ is the optimal percentile.}
\refstepcounter{my-figures-ctr}
 \label{figure-xyopt}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xyopt}: Noise on $x$-axis, error on $y$-axis.
	Uppermost curve is best--of--$\nopt$, middle curve is 
	$k$-th percentile hypothesis, with $k = 100 (1- (1/(\nopt+1)))$,
	bottom curve is $k_{\it opt}$-th percentile }}
}
\end{figure*}

\section{Experimental Results}
	\label{section-experiment}

To examine a slightly more realistic hypothesis selection problem, 
we considered a small learning task involving decision trees. The target
function was a boolean function over binary attributes $x_0$ to 
$x_4$, with $f(x_0,x_1,x_2,x_3,x_4) = (x_0 \wedge x_1 \wedge x_2) \vee 
(x_3 \wedge x_4)$. The input distribution $D$ was uniform over the 
instance space of size 32, 
and $n = 1000$ hypotheses were generated to form the set $H$.
Finally, $D_H$ was such that each hypothesis was generated by drawing 
$20$ samples from $D$ with their correct labels, and running an ID3-like 
decision 
tree algorithm using a greedy information-gain heuristic to choose 
splits \cite{quinlan86:iodt}. (Of course, no practitioner would ever see 
such a distribution of hypotheses; but, we are primarily interested in 
the problem of selecting 
a hypothesis out of a set, and this was a convenient way of 
creating such a set of hypotheses.) 

All experimental results reported in this section are averages of 
400 trials. First, using $m = 20$ CV samples, the results for different 
noise rates $\eta$ in the CV data are
in Figure~\ref{figure-xy20}. We see that for noise rates
of about $\eta=0.1$ and higher, picking the best--of--1000 hypothesis
loses to LOOCVCV. Note too that picking the apparent
best could do even worse than these results suggest,
if we had used more than the 1000 hypotheses we had 
restricted ourselves to for computational reasons.

One might think that this effect would go away with a larger
CV sample. However, it was still significant when the size
of the CV sample was increased to 40 (and note that the size of the 
instance space is 32), keeping everything else the same as before
(Figure~\ref{figure-xy40}). Also, we tried 
varying the CV-sample size, across the larger range of 
5 to 80, using a fixed noise rate $\eta$ of 0.3, and the results 
(Figure~\ref{figure-xytest}) also show LOOCVCV doing better consistently.

Next, to allow us to perform certain calculations exactly such 
as finding $\nopt$, we carried out the next few 
experiments by first drawing 1000 hypotheses from $D_H$, and then sampling 
with replacement from this pool of 1000 each time we needed to generate a
hypothesis, instead of drawing a new one from $D_H$. 
(We also verified that these results are close to if we had used $D_H$.)

Recall that in Section~\ref{subsection-k-from-nopt}, we claimed that
if we had an estimate of a good $\nhat$ to use in picking best--of--$\nhat$,
we might instead wish to use the $k$-th percentile hypothesis instead,
with $k = 100 (1- (1/(\nhat+1)))$. We see this effect again in our next experiment.
Using $m=20$ CV samples again, the top line of Figure~\ref{figure-xyopt} 
is the expected error if we knew exactly what the best $n$, $\nopt$ 
were and chose the best--of--$\nopt$. The middle curve is the error
of the $k$-th percentile hypothesis, where $k = 100(1-(1/(\nopt+1)))$, and
we see that it consistently beats best--of--$\nopt$. Finally, the 
lowermost curve is if we knew the optimal $k$, $k_{\it opt}$, and chose
the $k_{\it opt}$-th percentile hypothesis, which, as predicted by
Theorem~\ref{theorem-percentile-win}, beats best--of--$\nopt$ also.

\section{Discussion}
	\label{section-discussion}

The the results we have presented so far generally show LOOCVCV 
beating best--of--$n$, by picking a hypothesis other than the 
one with the smallest CV error. But best--of--$n$, especially with 
large $n$, is like using $k=100$, so if there were insufficient 
data to find an accurate estimate $k$ of $\kopt$, or if $\kopt$ is 
close to $100$, then we might expect best--of--$n$ to beat 
LOOCVCV's using a possibly noisy estimate of $\kopt$. In fact, at 
the noise rate $\eta = 0$, LOOCVCV generally does lose
to best--of--$n$ (Figures~\ref{figure-xy20} and~\ref{figure-xy40}). 
Also, some preliminary results (not reported) involving 
artificial neural networks, 
trained by backpropagation on a fixed set of training data and 
with randomly re-initialized weights each time,
also showed LOOCVCV losing slightly to best--of--$n$.

Leaving considerations of a loss function of $k$ aside, 
we might therefore decide not to use LOOCVCV unless we had 
reason to believe that the ``overfitting'' problem is 
significant in the distribution of hypotheses in our hypothesis 
selection problem, or that LOOCVCV will find a good $\kopthat$ 
(such as when the CV sample is large).  However, if we do 
have an exceedingly large number of hypotheses to choose from, 
and have a reasonably large CV-set, then LOOCVCV may give
much better performance than simply selecting the hypothesis with the 
lowest CV error. 

Before closing, one reformulation of our problem is worth commenting 
on here.
Let us call try calling $H$ a ``hypothesis class,'' and $S$ 
``training data.'' 
With this formulation, one might then have thought that for any 
fixed input space, $n$ would have been a good surrogate measure for the 
``complexity'' of $H$, and might have tried to apply conventional 
complexity regularization techniques using, say, $\log n$ as some
``measure'' of the ``complexity'' of $H$. But this turns out to
be unsound as a general approach, because it fails to take into 
account the effect of $D_H$. For example, one can easily produce 
two different $D_H$'s such that the ``expressiveness'' or richness 
of, say, $n=1000$ hypotheses drawn from the first $D_H$ is vastly 
greater then the expressiveness of 1000 hypotheses drawn from the second.
However, with this ``hypothesis class'' and ``training data'' formulation, 
one point of interest is in noting that the arguments of 
this paper still apply exactly as before, and suggest the 
surprising result 
that it might not be optimal to pick the hypothesis from a 
hypothesis class $H$ that has the lowest training error, even 
when there is no notion of complexity within $H$ in that $H$ is not, 
for example, partitioned into a nested sequence of hypothesis 
classes of increasing complexity. This possibility will be a subject 
of our future research.

Finally, these results might have implications for the practitioner
as well. If cross-validation is used to decide how many epochs to 
train a backpropagation neural network (which, while not falling 
within our assumed framework of \iid $h$'s, is similar), and if
the practitioner does not somehow use the common assumption that hypotheses
trained with more epochs are more ``complex''; or if cross-validation 
is used to select from a large pool of learning algorithms, then 
our results suggest it may {\em not} be advisable to pick the 
minimum cross-validation error hypothesis.  However, how strongly 
these results of this paper will 
bear out in future practice remains to be seen.

%[mention MDL alternative?]

\section{Conclusions} 

This paper explained how overfitting of CV data occurs despite
there being no notion of {\em complexity} within the set of hypotheses,
and presented percentile-cv and LOOCVCV, that try to correct 
this problem. It also proved that percentile-cv with an appropriately chosen 
percentile will always do at least as well as best--of--$n$, 
and demonstrated experimentally that LOOCVCV consistently beats 
best--of--$n$ in one domain. However, there are also domains where 
overfitting of CV data is not a significant problem, in which 
LOOCVCV loses slightly to best--of--$n$; and the characterization 
of such domains is an open problem. 

%This paper explained how overfitting of CV data occurs in hypothesis
%selection, and showed the surprising result that it can be overcome
%by selecting a hypothesis other than the minimum-CV-error hypothesis. 
%It also presented new algorithms percentile-cv and LOOCVCV, 
%and proved that percentile-cv with an appropriately chosen percentile 
%will always do at least as well as best--of--$n$, and demonstrated 
%experimentally that LOOCVCV consistently beats best--of--$n$ in one 
%domain. However, there are also domains where overfitting of CV 
%data is not a significant problem, in which LOOCVCV loses slightly 
%to best--of--$n$; the characterization of such domains is an open 
%problem.

% START-OBSERVATIONS
%\section{Other Random Stuff} 

%\begin{observation} 
%If $\E[\varepsilon(h)|\,\ehat(h)]$ as a function of $\ehat(h)$ is convex,
%and $\ehatbar$ is the expected $\ehat(h)$ under a certain algorithm (such
%as ``best of $\nopt$ hypotheses''), then we can expect to always do at 
%least as well or better by picking a hypothesis with $\ehat(h) = \ehatbar$
%(if one exists).
%\end{observation}
%\Proof By Jensen's inequality. \hb
%However, there may not be any such hypothesis. (i.e. if 100 test samples, and
%$\ehatbar = 0.203$.) So, maybe we can pick the nearest one (0.20 in this 
%case)? But that won't guarantee always doing better. So, as an alternative to
%LOOCVCV, how about somehow choosing randomly from the two nearest ones (0.20 and 0.21) 
%instead? I think this can be made to always do at least as well/better. 
%Empirically, however, percentile-cv almost always chooses something very 
%similar to this. (That is, given n, if we look at 
%$E[\ehat({\mbox{\rm best of n hypotheses}})]$
%and the $\ehat$ that percentile-cv chooses using $k$ as in Equation~\ref{eqn-k-from-nopt},
%they will almost always be very close.) So, does this serve as another justification for 
%percentile-cv? (Incidentally, percentile-cv does {\em not} always beat 
%``best of $n$'', using $k$ and $n$ related as in 
%Equation~\ref{eqn-k-from-nopt}.)
%\hb
%
%
%\begin{observation} 
%From Figure~\ref{figure-simulated-results}, our loss if we overestimate $\nopt$ is
%much smaller than if we underestimated it by the same amount. 
%\end{observation}
%Maybe if we can prove that it'll always be so, using only simple assumptions.
%Anyway, might be worth mentioning somewhere, since if we aren't sure what $\nopt$ is, it's
%better to ``guess high'' than to ``guess low.'' Also, should add something
%about ``guessing high'' into LOOCVCV?
%
%\begin{observation} 
%Yet another advantage of LOOCVCV: It allows things like ``$\nopt = 3.5$,'' where
%it'll simply figure $k$ out from that, and pick a hypothesis.
%\end{observation}
%(Incidentally, the mathematical formulas used in LOOCVCV have a natural way of 
%allowed $n$ to be treated as a continuous real variable, rather than as an integer.
%So, this is how we might get non-integral $\nopt$ from it.)

%END-OBSERVATIONS

\subsubsection*{Acknowledgements} 

I give warm thanks to Michael Kearns and Andrew Moore for 
many interesting and informative discussions, that 
contributed much to this work.
%without with this work would not have been possible. 
%%to Avrim Blum 
%for his input on writing this paper.

\section*{Appendix: The LOOCVCV Algorithm}

%\noindent
%\subsection*{The LOOCVCV Algorithm}
Please refer to Section~\ref{section-choose-percentile}
for the details of the LOOCVCV algorithm. Here, we describe how it 
finds an estimate $\hat{n}$ of $\nopt$ using $H$, from which we can
derive $k$, for picking the $k$-th percentile hypothesis.\hb
%\hb
\def\err {{\rm err}}
\def\ebar {\overline{\rm e}}
\def\empt {\hbox to 0.0in{}}
{\bf Input:} $\hat{n}$, $H[1..n]$, $S$, $i$ 	\hb
{\bf Output:} leave-one-out error with $p_i$ ``left-out,''
		for best--of--$\hat{n}$ hypothesis drawn from $H$	\hb
{\bf Notes:} The key idea is sorting $H$ in order of leave-one-out
	errors and then calculating, for each $h \in H$, the {\em probability}
	(first term in the summation in step 12)
	that it will be the lowest-LOOCV-error hypothesis in a set of 
	$\nhat$ drawn from $H$.
%	(the term $((n-j+1)^{\hat{n}} - (n-j)^{\hat{n}})$ in step 13 below),  \hb
	These probabilities can then be used to find the probability that 
	the selected hypothesis will apparently misclassify $p_i$. $\ebar$
	is used to accomodate multiple hypotheses in the set of
	$\nhat$ having the same lowest LOOCV-error, in which
	case we pick uniformly from them. \hb
%Also, $I$ is the indicator
%	function, so $I(A) = 1$ if $A=$ True, 0 otherwise.\hb
\noindent
1.  function loocv-error($\hat{n}$, $H[1..n]$, $S$, $i$) 	\hb
2.  \ \ \ \ \ create array $H'[1..n].\{\ehat_{-1},e\}$		\hb
3.  \ \ \ \ \ for ($j = 1 {\rm \ to\ } n$) do begin 		\hb
4.  \ \ \ \ \ \ \ \ $H'[j].\ehat_{-1} = 			
% uncomment next 2 lines for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 
| \{H[j](x_k) \neq y_k | 
					(x_k, y_k) \in S_{-i}\} |$ \hb
%5.  \ \ \ \ \ \ \ \ \ $H'[j].e = I\{H[j](x_i) \neq y_i\}$ 	\hb
5.  \ \ \ \ \ \ \ \ $H'[j].e = 1$ if $H[j](x_i) \neq y_i$, $0$ otherwise \hb
6.  \ \ \ \ \ \ \ \ end 						\hb
7.  \ \ \ \ \ Sort $H'[]$ in ascending order of $H'[].\ehat_{-1} $ \hb
%8.  \ \ \ \ \ For each value $x$ of $H'[?].\ehat_{-1}$, 		
%% uncomment next 2 lines for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ 
%let $\ebar(x)$ be the mean of the $H'[j].e$'s 
%% uncomment next 2 lines for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ 
%that have $H'[j].\ehat_{-1} = x$.		\hb
8.  \ \ \ \ \ for ($x = 0 {\rm \ to\ } |S|-1$) do begin \hb
9.  \ \ \ \ \ \ \ \ $A = \{j | H'[j].\ehat_{-1} = x\}$ \hb
10.     \ \ \ \ \ \ \ let $\ebar(x) = \frac{1}{|A|}\sum_{j\in A} H'[j].e$ if $|A| \neq 0$, \hb
\empt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ undefined otherwise. \hb
11.     \ \ \ \ \ \ \ end \hb
12. return $\sum_{j=1}^{n} 
            \left( 
                (\frac{n-j+1}{n})^{\hat{n}} - (\frac{n-j}{n})^{\hat{n}} 
            \right) 
				\ebar(H'[j].\ehat_{-1})$ \hb
\hb
{\bf Input:} $H$, $S$						\hb
{\bf Output:} Recommended $\nhat$				\hb
13. function loocvcv-estimate-nopt($H$,$S$) 		\hb
14. return  \hb
\empt \ \ \ \ \ \ \ \ \ \ \ \
    ${\rm argmin}_{\nhat \in [1,\infty)} 
		\{ \sum_{i=1}^m \hbox{{\rm loocv-error}}(\nhat, H, S, i)$\} \hb
\hb
%Original TWOCOLUMN version below
%9.  \ \ \ \ \ for ($j = 1 {\rm \ to\ } n$) do			\hb
%10. \ \ \ \ \ \ \ \ \ $H'[j].e \leftarrow \ebar(H'[j].\ehat_{-i})$ \hb
%11. return $\sum_{j=1}^{n} ((n-j+1)^{\hat{n}} - (n-j)^{\hat{n}}) H`[j].e$\hb
%\hb
%{\bf Input:} $H$, $S$						\hb
%{\bf Output:} Recommended $\hat{n}$				\hb
%12. function loocvcv-($H$,$S$) \hb
%13. return 							\hb
%\empt \ \ \ \ \ \ ${\rm argmin}_{\nhat \in [1,\infty)} 
%		\{ \sum_{i=1}^m \hbox{{\rm loocv-error}}(\nhat, H, S, i)$\} \hb


%\bibliographystyle{plain}
%\bibliographystyle{acm}
%\bibliographystyle{alpha}
%\bibliographystyle{ieeetr}
% \bibliographystyle{siam}
%\bibliographystyle{ieetr}
%\bibliographystyle{cmu-bib}
%%\bibliographystyle{cmu-unsrt}

%\bibliographystyle{apalike}

% off /usr/local/lib/texmf/bibtex/bst
%.bst   
%amsplain.bst   cmu-unsrt.bst  

%in harvard/       :
%agsm.bst        dcu.bst         jphysicsB.bst   nederlands.bst  
%apsr.bst        jmr.bst         kluwer.bst      

\bibliographystyle{apalike}
\bibliography{cv-bib}

%-----------------------------


\end{document}



% original stuff from ml97.tex (sample formatting info)
 
\begin{abstract} 
The Abstract paragraph should be indented 0.25 inch (1.5 picas) on
both left-and right-hand margins.  Use 10-point type, with a vertical
spacing of 11~points.  {\bf Abstract} must be centered, bold, and in
point size 12. Two line spaces precede the Abstract. The Abstract must
be limited to one paragraph.
\end{abstract} 
 
\section{GENERAL FORMATTING INSTRUCTIONS} 
 
Papers are in 2 columns with the overall line width of 6.75~inches
(41~picas).  Each column is 3.25~inches wide (19.5~picas).  The space
between the columns is .25~inches wide (1.5~picas).  The left margin
is 1~inch (6~picas).  Use 10-point type with a vertical spacing of
11~points.  Times Roman is the preferred typeface throughout.
 
Paper title is 16~point, caps/lc, bold, centered between 2~horizontal
rules.  Top rule is 1-point thick and bottom rule is 1-point thick.
Allow 1/4-inch space above and below title to rules.
 
Authors' names are centered, initial caps.  The lead author's name is
to be listed first (left-most), and the co-authors' names (if
different address) are set to follow.  If only one co-author, center
both the author and co-author, side-by-side.  Include internet address
for each author.
 
Allow one-half line space between paragraphs, with no indent.

Note: if you use \LaTeX\ or troff, you can obtain the appropriate
macros via ftp.  Connect to csftp.vuse.vanderbilt.edu, login as anonymous with
your email account as password, and change directory (cd) to
pub/users/icml97/macros, where you will find the necessary files (ml97.sty and
ml97.tex for \LaTeX, and ml97.trf for troff). You may also obtain
these files via the WWW at
http://cswww.vuse.vanderbilt.edu/\verb+~+icml97/authors/.

 
\section{FIRST-LEVEL HEADINGS} 
 
First-level headings are all caps, flush left, bold and in point size
12.  Allow one line space before the first-level heading and 1/2-line
space after the first-level heading.
 
\subsection{SECOND-LEVEL HEADING} 
 
Second-level headings must be flush left, all caps, bold and in point
size 10.  Allow one line space before the second-level heading and
1/2-line space after the second-level heading.
 
\subsubsection{Third-Level Heading} 
 
Third-level headings must be flush left, initial caps, bold, and in
point size 10.  Allow one line space before the third-level heading
and 1/2-line space after the third-level heading.
 
\subsubsubsection{Fourth-Level Heading}
 
Fourth-level headings must be flush left, initial caps and Roman type.
Allow one line space before the fourth-level heading and 1/2-line
space after the fourth-level heading.
 
\subsection{CITATIONS, FIGURES, REFERENCES} 

 
\subsubsection{Citations in Text} 
 
Citations within the text should include the author's last name and 
year, e.g., (Smith, 1995). Reference style should follow the style 
that you are used to using, as long as the citation style is 
consistent. 
 
\subsubsection{Footnotes} 
 
Indicate footnotes with a superscript number\footnote{Sample of the
first footnote.} in the text. Use 8-point type for footnotes.  Place
the footnotes at the bottom of the page on which they appear.  Precede
the footnote with a 1/2-point horizontal rule 1-inch (6 picas)
long.\footnote{Sample of the second footnote}
 
\subsubsection{Figures}  
 
All artwork must be centered, neat, clean, and legible.  Avoid
screens and pattern-fills. All lines should be very dark, not screened
and at least 1/2 point thickness, for purposes of reproduction. Artwork 
should not be hand-drawn.  Figure number and caption always appear below the
figure.  Leave 2 line spaces between the figure and the caption. The
figure caption is caps/lc and each figure is numbered consecutively.
 
Make sure that the figure caption does not get separated from the
figure. Leave extra white space at the bottom of the page rather than
splitting the figure and figure caption.

If the figure is wide enough to span both columns, place the figure at
the top of bottom of the page (not in the middle).  Photos should {\em
not} be affixed to the page.  Leave adequate space for the figure and
provide the photo separately in an envelope clipped to the page on
which it should be placed.  Be sure it is labeled (in pencil on the
back) with author's name and figure number.

\begin{figure}[h] 
\begin{center}
\begin{tabular}{ccc}
\fbox{\begin{tabular}{c}Source\\Problem\end{tabular}}&
$\rightarrow$&
\fbox{\begin{tabular}{c}Source\\Solution\end{tabular}}\\
%
$\uparrow$&&$\downarrow$\rule{0pt}{12pt}\\[.5em]
%
\fbox{\begin{tabular}{c}Target\\Problem\end{tabular}}&
$\Rightarrow$&
\fbox{\begin{tabular}{c}Target\\Solution\end{tabular}}
\end{tabular}\rule{0pt}{12pt}
\end{center}
\caption{Sample Figure Caption} 
\end{figure} 
 
\subsubsection{Tables} 
 
All tables must be centered, neat, clean, and legible. Do not use
hand-drawn tables. Table number and title always appear above the
table.  See Table~\ref{sample-table}.
 
Allow one line space before the table title, one line space after the
table title, and one line space after the table. The table title must
be caps/lc and each table numbered consecutively.
 
\begin{table}[htb] 
\caption{Sample Table Title} 
\label{sample-table} 
\begin{center} 
\begin{tabular}{ll} 
\multicolumn{1}{c}{\bf PART}  &\multicolumn{1}{c}{\bf DESCRIPTION} \\ \hline
Dendrite         &Input terminal \rule{0pt}{12pt}\\ 
Axon             &Output terminal \\ 
Soma             &Cell body (contains cell nucleus)
\end{tabular} 
\end{center} 
\end{table} 
 

\subsubsection{Identification}  
 
Make certain that your name is written in pencil on the back of every
page of your masters, and number pages sequentially. This information
is for identification only. Final page numbers will be assigned by the
publisher. If you have preferred wording for the abbreviation of a
long title for the runninghead, please include this when you send your
paper.
 
\subsubsection*{Acknowledgements} 
 
Use an unnumbered third-level heading for the acknowledgements.  All
acknowledgements go at the end of the paper, just before the
references.
 
 
\subsubsection*{References} 
 
References follow the acknowledgements.  Use an unnumbered third-level 
heading for the references.  APA citation style is preferred, but any
choice of citation style is acceptable as long as you are consistent. 
 
 
Alspector, J., Gupta, B.~\& Allen, R.~B.~(1989). Performance of a 
stochastic learning microchip.  In D. S. Touretzky (ed.), {\it Advances 
in Neural Information Processing Systems 1}, 748-760.  San Francisco, CA: 
Morgan Kaufmann. 
 
Rosenblatt, F.~(1962). {\it Principles of Neurodynamics.} Washington, 
D.C.: Spartan Books. 
 
Tesauro, G.~(1989). Neurogammon wins computer Olympiad.  {\it Neural 
Computation} {\bf 1}(3):321-323. 
 
\end{document} 

