%\documentstyle[12pt,twocolumn,psfig]{article}
\documentstyle[12pt,psfig]{article}

%\documentstyle[twocolumn,psfig]{article}
%\documentstyle[psfig]{article}

%\psdraft

%\input{colt.bib}

% FORMAT SPECIFICATIONS

% My attempt at fitting 40 lines onto each page. (Use w/12pt) Yucks.
\columnsep=0.25in
\setlength{\textheight}{7.5in}		% ICML97 requirement. DO NOT CHANGE
\setlength{\textwidth}{5.5in}		% ICML97 requirement. DO NOT CHANGE

\setlength{\topmargin}{0.8in}
\setlength{\headsep}{0pt}
\setlength{\headheight}{0pt}
\setlength{\oddsidemargin}{0.5in}
\setlength{\evensidemargin}{0.5in}
%\addtolength{\leftmargin}{-0.6in}
%\addtolength{\leftmargin}{-0.625in}

\begin{document}
%\begin{onecolumn}

\newlength{\linelength}
\setlength{\linelength}{0.35\textwidth}

\pagenumbering{arabic}

\def\Proof {{\bf Proof: \enspace}}
\def\argmin {{\rm argmin}}
\def\hb {\hfil\break}
\def\entropy {{\cal H}}
\def\implies {\Rightarrow}
%\def\ehat {\hat\epsilon}
\def\ehat {\hat\varepsilon}
\def\nhat {\hat{n}}
\def\ehatbar{\overline{\hat\varepsilon}}
\def\iid {{\it i.i.d.}}
\def\Pr {{\rm Pr}}
\def\E {{\rm E}}
\def\nopt {{n_{\it opt}}}
\def\nopthat {\widehat{n_{\it opt}}}
\def\testeq {\stackrel{?}{=}}
\def\hstar {h^{\ast}}
\def\kopt {{k_{\it opt}}}
\def\kopthat {\widehat{\kopt}}


\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{observation}{Interesting Observation}

\title{Preventing ``Overfitting'' of Cross-Validation Data}

\author{}
%\author{
%Andrew Y. Ng 
%%\thanks{
%%\hbox{email: $<$Andrew.Ng@cs.cmu.edu$>$} \hbox to 4.36in{} \hbox to 0.12in{}
%%\hbox to 0.2in{}
%%paper mail: Carnegie Mellon University,
%%	Box 3088, 1000 Morewood Ave, Pittsburgh PA 15213, USA.}
%\\
%\makebox[\linelength]{Carnegie Mellon University}\\
%\makebox[\linelength]{Pittsburgh, Pennsylvania} \\
%%\\
%%\makebox[\linelength]{Andrew.Ng@cs.cmu.edu}
%}

\date{\today}
%\date{September 18, 1996}

\maketitle
\thispagestyle{empty}

%---------------------------------------------------------------------
% Main text starts here. 
% START

\begin{abstract}

Suppose that, for a learning task, we have to select one hypothesis
out of a set of hypotheses (that may, for example, have been generated 
by multiple applications of a randomized learning algorithm). A common 
approach is to evaluate each hypothesis in the set on some previously 
unseen cross-validation data, and then to select the hypothesis that had 
the lowest cross-validation error. But when the cross-validation data
is partially corrupted such as by noise, and if the set of hypotheses 
we are selecting from is large, then ``folklore'' also warns about 
``overfitting'' the cross-validation data \cite{klockars86:mc,tukey49:cimitaov,tukey53:tpomc}.
In this paper, we explain how this ``overfitting'' really occurs, and 
show the surprising result that it can be overcome by selecting a hypothesis 
with a {\em higher} cross-validation error, over others with lower 
cross-validation errors. We give reasons for not selecting the hypothesis 
with the lowest cross-validation error, and propose a new algorithm, 
LOOCVCV, that uses a computationally efficient form of leave--one--out 
cross-validation to select such a hypothesis. Finally, we present experimental 
results for one domain, that show LOOCVCV consistently beating picking 
the hypothesis with the lowest cross-validation error, even when using 
reasonably large cross-validation sets.

\end{abstract}

%\newpage
%\thispagestyle{empty}
%\ \ 
%\newpage
%\setcounter{page}{1}

\section{Introduction}
		\label{section-intro}	

A basic problem in learning is that of 
selecting a hypothesis out of a set of hypotheses -- that 
is, determining which in a given set of hypotheses is ``best.'' 
So, when given such a set of hypotheses (that was, for example,
possibly generated by multiple applications of a randomized learning 
algorithm 
to a set of training data, or equivalently, applying a 
deterministic algorithm multiple times but using randomly 
re-chosen parameters each time),
a common approach is to test all of them on some set 
of previously unseen cross-validation (CV) data, and then to 
pick the hypothesis that had the smallest CV error 
\cite{mosteller68:dais,stone74;cvcaaosp,stone78:afaacv}.
For example, we may generate a set of hypotheses by training
multiple Neural Networks 
with Backpropagation on a fixed set of training data, with each 
Neural Network having different initial weights, and then pick
the Neural Network that had the lowest error on a set of 
hold-out CV data. But if the CV data is partially corrupted 
by noise, then ``folklore'' also warns that if we used too small a 
set of CV data to test too large a number of hypotheses, we may
end up picking a poor hypothesis that had fit the corrupted CV 
data well ``just by chance'' 
\cite{klockars86:mc,tukey49:cimitaov,tukey53:tpomc}.
%\footnote{There was some debate on whether this should be called
%``overfitting'' since there is no notion of complexity. However,
%some other authors have used this terminology in similar context,
%and we will continue to do so, in the understanding our paper will
%continue to make clear what we are mean by ``overfitting.''}
In this paper,   	% HERE
we examine this problem of ``overfitting'' of 
CV data. This is an unusual form of overfitting because, unlike
overfitting by single applications of learning algorithms such 
as Decision Trees and Neural Networks, {\em this overfitting comes from 
having tested too many hypotheses, rather than from 
having chosen a too complex a hypothesis;} so that, 
although the literature treating overfitting is rich with 
algorithms for selecting/limiting the {\em complexity} of the output 
hypothesis, this literature does not
generalize to this other important problem of ``overfitting'' 
through having tested too {\em many} hypotheses with too small a CV set. 
(The mentioned literature is too wide to survey here, but for 
a sample, see 
\cite{baum89:wsngvg,judd90:nndatcol,miller90:ssir,rissanen78:mbsdd}.)
%quinlan89:idtutmdlp,
%vapnik71:otucorfoettp}.
%Also, for literature discussing an alternative way of using a 
%set of hypotheses, see also \cite{littlestone91:twma,littlestone94:twma}.

In this paper, our contributions are two-fold: First, we explain how the 
this overfitting really occurs, and show the surprising 
result that we can do better than picking the hypothesis 
with the lowest CV error. Note that this is a much {\em stronger}
result than merely that the hypothesis with the lowest CV error 
is unlikely to be the hypothesis with the lowest generalization
error; we are claiming that it is sometimes possible to 
actually {\em find} a hypothesis that we can expect to have 
lower generalization error than the minimum-CV-error hypothesis. 
And second, we will present an algorithm, 
{\em Leave-one-out Cross Validated Cross Validation} (LOOCVCV), 
to choose such a hypothesis, and show experimentally that 
LOOCVCV does consistently beat picking the minimum--CV--error 
hypothesis in one domain. 

% HERE
%[These results also extend to using Cross-Validation for model selection.]

\section{Definitions}
		\label{section-definitions}
Henceforth, for simplicity, let us assume a boolean target function $f$.
Let $H=\{h_i\}_{i=1}^{n}$ be the set of $n$ hypotheses that 
we are choosing from, and let us also assume that each hypothesis was 
drawn \iid\/ from some fixed distribution of hypotheses $D_H$ (possibly
a distribution induced by repeated application of a randomized 
learning algorithm). Also, let $S = \{(x_i,y_i)\}_{i=1}^{m}$
be the set of $m$ samples forming the CV data, where 
each $x_i$ is an arbitrary-dimensioned input 
vector drawn \iid\/ from some distribution $D$, and $y_i \in \{0,1\}$,
the labels of the data, have independently been corrupted 
with some fixed but unknown {\em noise rate} $\eta$, so
that $y_i = f(x_i)$ with probability $1-\eta$, and $y_i = \neg f(x_i)$ with
probability $\eta$. We shall also write $p_i = (x_i, y_i)$, so that 
$S = \{p_i\}_{i=1}^{m}$, and let us also define 
$S_{-i} = S \setminus \{p_i\}$.
Furthermore, we shall say that hypothesis $h_i$ 
{\em apparently misclassifies} some element of the CV set 
$(x_j,y_j)$ if $h_i(x_j) \neq y_j$, and that it 
{\em truly misclassifies} it if $h_i(x_j) \neq f(x_j)$. 
Let the CV error $\ehat_S(h_i) \in [0,1]$ be the fraction of the CV set $S$
that hypothesis $h_i$ apparently 
misclassifies, 
and let the generalization error (with respect to
uncorrupted data) of each hypothesis $h_i$ 
be $\varepsilon_{D,f}(h_i) = {\rm Pr}_{x\in D}[h_i(x) \neq f(x)]$,
following the PAC framework 
%of \cite{valiant84:atotl},
(although we will, for notational brevity, often drop
the subscripts in $\ehat_S$ and $\varepsilon_{D,f}$).
Finally, the hypothesis $h_i$ in a set $H$ of size $n$ 
that has the lowest CV error $\ehat_S(h_i)$
will be called the ``best--of--$n$ hypothesis.'' 

Informally, the goal of hypothesis selection is then to choose a 
hypothesis $h$ such that $\varepsilon_{D,f}(h)$ is expected to be ``small,''
and the conventional approach is choosing the 
best--of--$n$ hypothesis.

%Then, informally, the goal of hypothesis selection is, given 
%$S$ and either $D_H$ or $H$, to choose a 
%hypothesis $h$ such that $\varepsilon_{D,f}(h)$ is expected to be ``small,''
%and the conventional approach is choosing the 
%best--of--$n$ hypothesis.

\section{Overfitting by a Simulated Learning Algorithm}
		\label{section-simulated-algorithm}

As mentioned, when we choose the hypothesis $h$ that has the 
lowest $\ehat(h)$, we may be choosing a poor hypothesis that 
had fit the CV data well ``just by chance.'' 
For example, if $\ehat(h) = 0$ so that it apparently classifies 
the CV data perfectly, but some of the CV data's 
labels have been corrupted, then the hypothesis must 
actually be {\em truly misclassifying} the corrupted 
samples in the CV data, and therefore have non-zero 
(and possibly large) generalization error. 

\subsection{A Simulated Learning Algorithm}
	\label{subsection-simulated-algorithm}

To examine the overfitting effect, let us first look at a 
particular model of a learning algorithm and learning task:

\begin{itemize}
  \item We have a black box that draws successive hypotheses $h_i$ \iid\ 
	from $D_H$ (say, by applying a randomized algorithm to
	a set of training data), such that the random variable
	$\varepsilon_{D,f}(h_i)$ (induced by drawing $h_i \in D_H$), 
	is distributed $\varepsilon_{D,f}(h_i) \sim {\rm Uniform}(0,1)$.
  \item We fix $S$ to have $m=100$ CV samples, and assume that our 
	noise process happened to corrupt exactly 20 of this 
	set's labels. (Note that, in this example, it is only to make 
	our exposition cleaner that we have ``fixed'' the noise,
	which should technically be random.)
  \item Each hypothesis $h_i$ has an {\em independent}\/ $\varepsilon(h_i)$
		chance of truly misclassifying each of the $m$ CV samples,
		where the independence is across both hypotheses and CV
		samples.
\end{itemize}

So, to simulate testing a hypothesis with a given
$\varepsilon(h_i)$, we would flip a 0--1 coin with bias 
$\varepsilon(h_i)$ 100 times, negate the result of (say) the 
last 20 flips, and calculate the fraction of resulting
`1's to get $\ehat(h_i)$. Note also that of the assumptions above, 
the one of uncorrelated errors across hypotheses is probably the most 
unrealistic, but we include it to have a simple and analyzable 
example of a learning algorithm.
%And before proceeding, we also wish to note that we also found 
%empirically that our results are only very weakly dependent on the 
%that $\varepsilon(h_i)$ is distributed ${\rm Uniform}(0,1)$. 
The goal, then, is to generate some number of hypotheses, test 
them on the $100$-sample CV data, and then to somehow pick a 
hypothesis $h$ with ``small'' expected generalization 
error.

\subsection{Why Picking the ``Best'' is Bad}
	\label{subsection-why-best-bad}

%In this section, we give a reason for why picking the
%``apparent best'' hypothesis may not be optimal when the CV
%data's labels have been corrupted by noise. Succinctly, it is
%the surprising result that lower CV error does not necessarily 
%imply lower expected generalization error (where the random variable
%that the expectation is over hypotheses drawn from $D_H$).

%Succinctly, is is that
%even though lower $\ehat(h) \not\implies$ 
%lower ${\rm E}[\varepsilon(h)|\ehat(h)]$, because of differing 
%{\em variances} of $\ehat(h)|\varepsilon(h)$'s.
%
%  \item If errors between the hypotheses are uncorrelated, then some
%		``mean'' hypothesis, that somehow agrees with most of the 
%		other hypotheses, can be expected to do well. And at least 
%		on one metric (to be explained), the best--of--$n$ 
%		hypothesis is maximally different from doing this.

%\subsubsection{Lower Test Error \protect$\not\implies$ 
%				Lower Generalization Error}
%	\label{section-lower-not-imply}

In this section, we will explain why picking the ``apparent best'' 
hypothesis may not be optimal when the CV data's labels 
have been corrupted by noise. These results will be the cornerstone
of the rest of this work. But to motivate the rest of this 
discussion, let us first consider a simpler distribution 
of hypotheses than described above, that shows this effect clearly. 
Again, let us assume that exactly $20$ of the $100$ CV data labels 
were corrupted, and that $D_H$ is such that each hypothesis
$h \in D_H$ has either $\varepsilon(h) = 0.00$ or $\varepsilon(h) = 0.10$.
Now, consider the particular case of choosing between two 
hypotheses $h_1$ and $h_2$, where $\ehat(h_1) = 0.17$, and 
$\ehat(h_2) = 0.20$. Suppose we chose $h_1$. Then, we know that 
we must necessarily have 
chosen a hypothesis with $\varepsilon(h) = 0.10$, because if 
$\varepsilon(h) = 0.00$, then $h$ would have classified all of 
the CV sample correctly with respect to the true target function, 
and would therefore have had $\ehat(h) = 0.20$ because $20$ elements of 
the CV sample were corrupted. 
On the other hand, if we chose $h_2$ with $\ehat(h_2) = 0.20$, then 
we have a non-zero chance of picking a hypothesis with 
$\varepsilon(h) = 0.00$. Hence, 
$E_{h \in D_H}[\varepsilon(h)|\ehat(h)=0.20] < E_{h \in D_H}[\varepsilon(h)|\ehat(h)=0.17]$
for this case. 

The previous example might have seemed slightly contrived, but a more general 
explanation for the hypothesis with lower $\ehat(h)$ not having 
lower expected generalization error 
%$E[\varepsilon(h)|\ehat(h)]$ 
is in the 
differing {\em variances} of $\ehat(h)$. Given 
$S$, $D$, $D_H$ and $f$, then $\ehat_S(h)$, 
as a random variable dependent with $\varepsilon(h)$ 
(which is in turn induced by drawing $h \in D_H$), is such that  
$\E_{h \in D_H}[\ehat_S(h)|\,\varepsilon(h)]$ monotonically increases with 
$\varepsilon(h)$ as expected; but, its 
variance ${\rm Var}_{h \in D_H}(\ehat(h) |\, \varepsilon(h))$
also increases as a function of $\varepsilon(h)$, at least
for $\varepsilon(h) \in [0,0.5]$. For example, if $\varepsilon(h) = 0$, 
then $\ehat(h) = 0.20$ always, so that
${\rm Var}_{h \in D_H}(\ehat(h)|\,\varepsilon(h)=0) = 0$; 
but ${\rm Var}_{h \in D_H}(\ehat(h)|\,\varepsilon(h)=0.5) = 0.25 > 0$.
So, in spite of hypotheses with lower $\varepsilon(h)$'s having 
lower ${\rm E}_{h \in D_H}[\ehat(h)|\varepsilon(h)]$'s, it is possible for them
to have have a {\em smaller} probability of having ``extremely low'' 
$\ehat(h)$'s ``by chance,'' because their distributions of $\ehat(h)$ 
are less spread-out (lower variance). In words, {\em hypotheses with lower
generalization errors do have lower expected CV errors, but may have
a smaller chance of having ``extremely low'' CV errors because their
distributions of CV errors are less spread--out than those of hypotheses
with higher generalization errors (and higher expected CV errors).}
And for certain ``priors'' such as the simplified $D_H$ above, this leads 
to the effect of ${\rm low}\: \ehatbar \not\implies {\rm low}\: 
\E_{h \in D_H}[\varepsilon(h)|\,\ehat(h)=\ehatbar]$, which 
causes the overfitting phenomenon. 

Finally, let us return to the model of 
Section~\ref{subsection-simulated-algorithm}. Recall that 
exactly $20$ of the $100$ CV data labels were corrupted. Then, using
our ``prior'' for $h \in D_H$ that is such that 
$\varepsilon_{D,f}(h) \sim {\rm Uniform}(0,1)$, we can calculate,
for each value of $\ehatbar$, the posterior distribution 
$\varepsilon(h) | \ehat(h) = \ehatbar$, which we denote by 
$f_{\varepsilon|\ehat}$:
\begin{eqnarray}
f_{\varepsilon|\ehat}(\varepsilon(h) | \ehat(h)=\ehatbar) &=&
		\frac{f_{\ehat|\varepsilon}(\ehat(h)=\ehatbar|\varepsilon(h)) f_\varepsilon(\varepsilon(h))}
			{f_{\ehat}(\ehat(h)=\ehatbar)}  \nonumber \\
		&=& k f_{\ehat|\varepsilon}(\ehat(h)=\ehatbar|\varepsilon(h)) 
					\nonumber \\
		&=& k \sum_{i=0}^{100\ehatbar} \left(B(80,\varepsilon(h), i) 
% uncomment for TWOCOLUMN 
%					\nonumber \\
%		& & \hbox to 0.05in{} 
					B(20, 1-\varepsilon(h), 100\ehatbar-i) \right)
					\nonumber
\end{eqnarray}
\noindent
where $k$ is a normalizing constant, and 
% that makes the integral of 
%  the probability distribution function equal 1, and
%				\left(\begin{array}{c}80\\i\end{array}\right) 
%			\varepsilon(h)^i (1-\varepsilon(h))^{80-i} \nonumber \\
%&&				\left(\begin{array}{c}20\\100\ehatbar-i\end{array}\right)
%				(1-\varepsilon(h))^{100\ehatbar-i} \nonumber\\
%&&				\varepsilon(h)^{20-(100\ehatbar-i)} \nonumber
%
% Original displaymath-style definition of B(.,.,.)
%\begin{displaymath}
%	B(n,p,x) = 
%		\left(\begin{array}{c}n\\x\end{array}\right) p^x (1-p)^{n-x} 
%		\nonumber	
%\end{displaymath}
%
$B(n,p,x) = \left(\hbox to 0pt {$^n$}_{x}\right) p^x (1-p)^{n-x}$
%\noindent
is the probability of a ${\rm Binomial}(n,p)$ distribution returning $x$
(so that $i$ in the summation is the number of apparent misclassifications
that are also true misclassifications). 
These posterior distributions 
of $\varepsilon(h)$ for $\ehatbar = 0.17$, $0.20$,\/ and $0.23$ 
%were evaluated numerically, and 
are shown in 
Figures~\ref{figure-posterior17}\/--\/\ref{figure-posterior23}.
From the figures, we see that if we picked a hypothesis 
with $\ehat(h)=0.17$ or $\ehat(h)=0.23$, we can
expect the hypothesis to have higher generalization error than one 
with $\ehat(h)=0.20$. It was, of course, expected that picking 
a hypothesis with CV error $0.20$ beat picking one with $0.23$, 
but it was surprising that picking a hypothesis with a {\em higher} error 
on the CV data ($\ehat(h)=0.20$ rather than $0.17$) could, on average,
result in a hypothesis with {\em lower} generalization error.

This was unexpected because we expect hypotheses 
with lower generalization error to have lower CV error.  
The point is that {\em the reverse is not necessarily true,}\/ and 
hypotheses with lower CV error do not necessarily have lower
expected generalization error. 
That is, given fixed $S$, $D_H$, and $f$,
    ${\rm low}\ \overline{\varepsilon} \implies 
	{\rm low}\ \E_{h\in D_H}[\ehat(h)|\,\varepsilon(h) = \overline{\varepsilon}]$, 
but ${\rm low}\ \overline{\ehat} \not\implies 
	{\rm low}\ \E_{h\in D_H}[\varepsilon(h)|\,\ehat(h) = \overline{\ehat}]$.

% commented out: 9/14/96
%\subsubsection{Agreement between hypotheses}
%
%Suppose that we were told that the noise rate $\eta$ in
%the test data was 0.5, so that we are, from an information
%standpoint, given only an unlabeled test sample
%(i.e. only the $x_i$'s) as the $y_i$'s are contain no information. Then, 
%informally, it seems reasonable to pick some hypothesis that somehow
%``agrees'' with the other hypotheses on the classifications it makes, 
%with the intention that if errors of the hypotheses are uncorrelated,
%then some ``mean'' hypothesis may have lower expected
%generalization error than one chosen randomly from the pool.
%
%However, in picking the best--of--$n$ hypothesis, we are doing exactly
%the opposite of this: We are picking the hypothesis that has the lowest
%test error, and which therefore {\em maximally disagrees with all the other 
%hypotheses on how many labels in the test sample are incorrect}. 
%
%In our experimental results (not reported),
%%in the decision--tree domain
%%we describe in Section~\ref{section-experiment}, 
%we found that, 
%even with moderately high noise rates in the test data, this can 
%led to picking hypotheses whose expected generalization errors are 
%significantly higher than picking a hypothesis at random from the pool.
%%which, with a sufficiently high $\eta$, in turn does worse than picking 
%%the hypothesis with median test error $\ehat(h)$ (which may {\em not} 
%%be a 50\%--error hypothesis).
%
%This also suggests that another natural heuristic to use in hypothesis 
%selection is to pick a hypothesis that agrees with the other hypothesis
%maximally in some sense. However, we consider this issue to be 
%somewhat orthogonal to our problem of hypothesis selection using 
%test data, and shall not explore it further in this paper.


%\section{An Alternative to Choosing the ``Best''} 
\section{Not Choosing the ``Best''} 
	\label{section-alternative}

Because of this ``overfitting'' of CV data, 
``folklore'' suggests generating only a limited number 
of hypotheses from $D_H$ before picking the one 
that has the lowest CV error. The rationale is that
if we generate too few hypotheses, then we might not generate 
any good hypotheses; whereas if we generate too many, we 
increase the risk of generating poor hypotheses that 
fit the CV data well ``by chance,'' and overfit.  So, 
the optimal number of hypotheses to generate should be between
the two extremes, and we will call this number $\nopt$.
But, such practice seems sub-optimal. For example, 
it is not clear how $\nopt$ can be found. Moreover, if we 
knew the value of $\nopt$, but have already generated more 
than $\nopt$ hypotheses, then it is unclear if we should throw-away 
all but $\nopt$ of them, or if there is a better way to use 
the extra hypotheses. Finally, we would also like an algorithm 
that asymptotically improves as it is given more hypotheses.

The problem with generating ``too many'' hypotheses was that those
with very low CV errors were actually ``overfitting,'' and had high 
generalization error. Because, of any large set of 
hypotheses drawn from the fixed distribution of hypotheses,
the fraction of them that ``overfit'' should be approximately 
the same, we propose the following, which we
call {\em percentile-cv}:

%HERE
%[should probably either explain the name (cv), or call it something
%else.]

\begin{enumerate}
   \item Generate as many hypotheses as is computationally feasible.
   \item Test them all on our CV data $S$, and sort them in descending 
		order of $\ehat_S(h_i)$.
   \item For some $k$, pick the hypothesis in the $k$\/-th percentile 
		in the list of sorted 
		hypotheses. (i.e. if we have $n$ hypotheses, then 
		choose the one that was ranked 
		$\left\lceil kn/100 \right\rceil$ in the sorted list.) 
\end{enumerate}

Note we have introduced a new variable, $k$, that we have not 
specified a way to choose yet, but also that we no longer need 
to guess the value of $\nopt$. 

The motivation behind precentile-cv is that when we sort the hypotheses 
in order of $\ehat$'s
as in Step 2 above, we hope that the hypotheses are 
approximately sorted into order of ``truly bad'' 
(high $\varepsilon$, high $\ehat$), followed by ``good'' 
(low $\varepsilon$, low $\ehat$), followed by ``bad, but 
overfit CV data by chance'' (high $\varepsilon$, very low $\ehat$). 
Then, if we chose $k$ well, we would pick something between 
the ``truly bad'' and ``bad, but overfit CV data by chance,'' 
and choose a good hypothesis. Note, however, that we will be 
{\em choosing a hypothesis with higher CV error over some
others with lower CV errors}.

Also, as the number of hypotheses generated tends 
towards $\infty$, since the hypotheses are drawn \iid\/ from a fixed 
distribution, we expect fixed proportions of them to be ``truly bad,'' 
``good,'' and ``bad, but overfit,'' so we will still be choosing 
a hypothesis that is ``good.'' And as an example, in the domain 
of the simulated algorithm described in 
Section~\ref{section-simulated-algorithm}, we would hope that $k$ 
is chosen such that the $k$\/-th percentile hypothesis is one 
that has $\ehat(h)=0.20$.

We now present a result that further justifies using percentile-cv, 
which generates a large set of hypotheses and picks some $k$-th percentile 
hypothesis from it, instead of  best--of--$n$, which generates some 
finite set of $n$ hypotheses, and picks the minimum CV-error 
hypothesis in it.
The following theorem states that, with a well chosen $k$, 
percentile-cv always does at least well as best--of--$n$
(i.e. $\exists k$ such 
that $k$-th percentile beats best--of--$n$, for all $n$).
%and can therefore be considered a ``superset'' of best--of--$n$:

\begin{theorem}  \label{theorem-percentile-win}
For any $D_H$, $f$, and $S$, 
$\exists k$ such that, $\forall n$, choosing the $k$-th percentile 
hypothesis from a (sufficiently) large set of hypotheses 
drawn \iid\/ from $D_H$, gives an expected generalization error 
no higher than picking the ``apparent best'' of $n$ randomly 
generated hypotheses.
\end{theorem}
{\bf Proof (Sketch):} Picking the best--of--$n$ is equivalent
to the following in terms of performance: 
(1) Generating a set $H$ of $n$ 
hypotheses, noting down the lowest CV error 
$\ehat_0 = {\rm min}_{h \in H}\; \ehat(h)$, and then (2) generating fresh 
hypotheses until one has the same CV error of $\ehat_0$, and 
outputting that hypothesis, which we shall call $h_{\rm f}$. 
%Now, step (1) chooses $\ehat_0$, and step (2) draws and outputs a 
%hypothesis from the distribution $D_H | \ehat(h) = \ehat_0$. 
Now, step (2) is simply drawing $h_{\rm f}$ from the conditional 
distribution $h \in D_H | \ehat(h)=\ehat_0$; and of the possible 
$\ehat_0$'s that could be chosen, one is ``best'' in terms of giving
the conditional distribution that minimizes 
${\rm E}[\varepsilon(h_{\rm f})| \ehat_0]$ during this step.
%%${\rm E}_{h_{\rm f} \ in D_H}[\varepsilon(h_{\rm f})| \ehat(h_{\rm f}) = \ehat_0]$ during this step.
%which draws $h$ from the distribution $D_H | \ehat(h)=\ehat_0$. 
Call this $\ehat_{\rm opt}$.
Then, there exists k such that the $k$-th percentile hypothesis
will have $\ehat(h) = \ehat_{\rm opt}$ with probability 1. So, for all $n$,
$\exists k$ such that $k$-th percentile (not strictly) beats the apparent
best of $n$. This implies that some optimal $k$ 
must beat {\em all} $n$, which is the statement of the theorem.

\medskip

\noindent
\begin{corollary}  \label{corollary-percentile-strict-win}
If ${\rm E}[\varepsilon(h_{\rm f})| \ehat_0]$ is not constant over 
all possible $\ehat_0$, $\exists k$ such that $\forall n$, the
expected generalization error of the $k$-th percentile hypothesis 
is strictly less than that of the best--of--$n$ hypothesis.
\end{corollary}

\section{Choosing a Percentile}
	\label{section-choose-percentile}

Using the conventional hypothesis selection method, we 
had to estimate $\nopt$, the number of hypotheses to generate
before picking the one that had the lowest CV error. 
In this section, we describe 
{\em Leave-one-out Cross Validated Cross Validation} (LOOCVCV), 
a new algorithm 
that first finds an estimate $\nhat$ of $\nopt$, then chooses 
$k$ from that, and finally applies percentile-cv using that 
value of $k$. We will first show how to choose $k$ from $\nopt$ or 
an estimate $\nhat$ of it, and then how to estimate $\nhat$.

\subsection{Choosing $k$ from $\nopt$}
	\label{subsection-k-from-nopt}

For a fixed CV set $S$, since the hypotheses 
$h_i$\/'s are drawn from a fixed distribution, their CV 
errors $\ehat_S(h_i)$ are also drawn from some fixed distribution. 
Hence, if we generate $\nopt$ hypotheses and pick the one 
with the lowest CV error, then to a ``good approximation,'' we 
expect to pick a hypothesis 
whose CV error just falls into the bottom $1/(\nopt+1)$ fraction of all
CV errors drawn from our distribution of $\ehat$'s. 
(Proof is from the statistical theory of Extreme Value Distributions 
literature. See \cite{leadbetter80:earporsap}, for example.)
We therefore propose using the following value of $k$\/:

\begin{equation}
  k = 100 \cdot (1- \frac{1}{\nopt+1})   \label{eqn-k-from-nopt}
\end{equation}

% HERE
%[Might be clearer if we got rid of the constant 100 everywhere,
% and made it fraction-cv or something instead. (?)]

Using notation from the proof of Theorem~\ref{theorem-percentile-win}, 
we can also think of picking the best--of--$n$ hypothesis as a way 
of choosing $\ehat_0$ (and then picking a hypothesis $h$ 
with $\ehat(h) = \ehat_0$). Now, suppose we already have a good 
estimate $\nhat$ of $\nopt$\/; then a natural alternative to 
picking the apparent best of $\nhat$ is to pick the $k$-th percentile 
hypothesis, where $k = 100(1-(1/(\nhat+1)))$, with the advantage 
of doing this being reducing the {\em variance} of $\ehat_0$, 
and hopefully the ``variance'' of the hypothesis we pick as well. 
Later, we will also give experimental results that demonstrate that
this can do significantly better than picking the best--of--$\hat{n}$.

\subsection{Estimating $\nopt$}

We will only sketch the main ideas of the algorithm here,
and leave the details to the Appendix.
We want to use Leave-one-out Cross Validation (LOOCV) as follows: 
For each value of $\nhat$, we want a measure of 
the generalization error of the best--of--$\nhat$ 
hypothesis. So, for each sample $p_i \in S$, we shall ``leave-out'' 
$p_i$ and use $S_{-i}$ only,
choosing the hypothesis $\hstar_{i,\nhat}$ that has the minimum
empirical error on $S_{-i}$ among a set of $\nhat$ randomly
generated hypothesis; and then test $\hstar_{i,\nhat}$ on 
$p_i$ and see if $\hstar_{i,\nhat}(x_i) \testeq y_i$. Then, 
the number of values of $i$ for which $\hstar_{i,\nhat}(x_i) \neq y_i$, 
which we call the {\em LOOCV-error}, is a measure of the 
generalization error of the best--of--$\nhat$ hypothesis. And running
this process on different $\nhat$'s, we can get a measure 
of the LOOCV-error for each value of $\nhat$, and use the 
value of $\nhat$ that minimizes the LOOCV-error.


% HERE
%[Incidentally, doing LOOCV directly on the ``desired'' error doesn't 
%work, for not-so-obvious reasons. The main problem of this is
%that the estimate of the error of the hypothesis that got, say, 19 wrong,
%will be completely determined by the number of hypothesis that got 19
%wrong, and the number of hypothesis that got 20 wrong. Unless we do
%something clever, it'll also always pick the hypothesis that had the 
%{\em highest} test error (yes!). Maybe we can do LOOCV on the percentile, 
%though (emphasis on the ``Maybe''), and do something similar to the 
%bootstrap done in LOOCVCV (described below). But, the math doesn't really 
%work out as nicely as it does for LOOCVCV, and I'm not sure that
%we can really to do the ``simulate bootstrap'' thing (also described 
%below).]

The problem with the current formulation of this algorithm 
is that choosing the best--of--$\nhat$ hypothesis is a noisy process, as it
involves generating randomly chosen hypotheses, so the entire process
has to be repeated many times for each value of $\nhat$ to get estimates
of the LOOCV-error. This will, unfortunately, generally require 
generating an intractable number of hypotheses. So, instead, we shall
only generate a large but finite set $H$ of hypotheses, and choose uniformly 
and with replacement from this set whenever we need to generate a hypothesis
(this, from the Statistics literature, is very similar to
non-parametric bootstrap \cite{efron79:bmalatj}).
However, even doing this, we found that we would still require averaging
over an intractable number of repetitions to get a smooth curve for the 
LOOCV-error as a function of $\nhat$. Fortunately, there is 
a way to calculate explicitly, from $\nhat$, $H$, and $S$, 
the LOOCV-error value we would get if we really repeated 
this bootstrap-like process an infinite number of 
times, and this is what LOOCVCV does.
(Details in appendix.)

So, for every value of $\nhat$, we now have a tractable way of estimating 
the generalization error of the best--of--$\nhat$ hypothesis,
and we can 
use the value of $\nhat$ that minimizes the LOOCV-error as our 
estimate of $\nopt$. With this, we can use 
Equation~\ref{eqn-k-from-nopt} to derive $k$, our desired 
percentile, which we can then use to apply percentile-cv. This 
completes our formulation of the LOOCVCV algorithm.


\section{Results Using the Simulated Learning Algorithm}
		\label{section-results-simulated-algo}	

We ask the reader to refer to Figure~\ref{figure-simulated-results},
which shows results of different model selection methods applied to
the Simulated Learning Algorithm 
described in Section~\ref{section-simulated-algorithm}.
The value of $n$ varies along the $x$-axis, and the top-most curve
is the expected generalization error if we draw $n$
hypotheses and pick the one with the lowest CV error. The middle,
stepped, line is the expected generalization error if we picked the 
$k$\/-th percentile hypothesis, with $k = 100 (1- (1/(n+1)))$,
using an ``infinitely'' large set of hypotheses. Finally, the bottom 
horizontal line is the generalization 
error that LOOCVCV achieves, also using an ``infinitely'' large 
set of hypotheses.

% HERE
%[top curve: Actually generated by bootstrap on 60000 hypotheses. Should
%  be pretty accurate. middle: data from 2,000,000 simulated hypotheses. 
%  Should be very accurate.]

For this particular problem, $\nopt \approx 103$. Notice, however, that 
even if we generated exactly $\nopt$ hypotheses and picked the one with
the lowest CV error, we would expect to get a generalization
error of about 0.035, which is higher than if we picked the 
$k=100(1 - (1/(103+1))) \approx 99.0$\/-th 
percentile hypothesis from an ``infinitely'' large sample of hypotheses,
which gives an expected generalization error of about 0.025.
The reason for this is that with best--of--$n$, the CV error
$\ehat(h)$ of the hypothesis we choose in the end has non-zero variance,
and therefore performs worse than percentile-cv, by our argument 
of Section~\ref{subsection-k-from-nopt}.
% Why did I originally put Section~\ref{section-lower-not-imply}?
That is, 
$\E_{h \in D_H}[\varepsilon(h) | \ehat(h) = \overline{\hat\varepsilon}]$ 
is minimized when $\ehatbar = 0.20$. However, in the process of
drawing $\nopt$ hypotheses and picking the one with lowest CV error,
we may occasionally pick hypotheses with other values of $\ehat(h)$,
which is why its expected generalization error is higher. On the other
hand, in the ``infinitely'' large sample of hypotheses, 
the $99.0$\/-th percentile hypothesis has $\ehat(h) = 0.20$ 
with probability $1$, which is why picking
the $99.0$\/-th percentile hypothesis results in a lower expected
generalization error than best--of--103: There is less 
variance in the CV error of the hypothesis we pick.

Similarly, note too that, for most values of the curve, LOOCVCV 
(the middle curve) beats picking the apparent best of $n$ 
(the top curve). We argue that this is due to the same
reason: LOOCVCV results in lower variance in the CV error
of the chosen hypothesis. This is further justification for our 
earlier argument that, if we did not want to apply LOOCVCV, but 
already had an estimate $\nhat$ of $\nopt$, then we may be better off
picking the $100(1-(1/(\nhat+1)))$\/-th percentile hypothesis, instead of 
the best--of--$\nhat$ hypothesis.

Finally, we note that when LOOCVCV is given an ``infinitely''
large set of hypotheses for this problem, it will always achieve an 
expected generalization error of about 0.025, which is provably
the best that any algorithm using percentile-cv can achieve on this problem.

\section{Experimental Results}
	\label{section-experiment}

To examine a slightly more realistic hypothesis selection problem, 
we considered a small learning task involving decision trees. The target
function was a boolean function over binary attributes $x_0$ to 
$x_4$, with $f(x_0,x_1,x_2,x_3,x_4) = (x_0 \wedge x_1 \wedge x_2) \vee 
(x_3 \wedge x_4)$. The input distribution $D$ was uniform over the 
instance space of size 32, 
and $n = 1000$ hypotheses were generated to form the set $H$.
Finally, $D_H$ was such that each hypothesis was generated by drawing 
$20$ samples from $D$ with their correct labels, and running an ID3-like 
decision 
tree algorithm using a greedy information-gain heuristic to choose 
splits \cite{quinlan86:iodt}. (Of course, no practitioner would ever see 
such a distribution of hypotheses; but, we are interested only in 
the problem of selecting 
a hypothesis out of a set, and this was a convenient way of 
creating such a set of hypotheses.) 

All experimental results reported in this section are averages of 
400 trials. First, using $m = 20$ CV samples, the results for different 
noise rates $\eta$ in the CV data are
in Figure~\ref{figure-xy20}. We see that for noise rates
of about $\eta=0.1$ and higher, picking the best--of--1000 hypotheses
loses to LOOCVCV. Note too that picking the apparent
best could do even worse than these results suggest,
if we had used more than the 1000 hypotheses we had 
restricted ourselves to for computational reasons.

One might think that this effect would go away with a larger
CV sample. However, it was still significant when the size
of the CV sample was increased to 40 (and note that the size of the 
instance space is 32), keeping everything else the same as before
(Figure~\ref{figure-xy40}). Also, we tried 
varying the CV-sample size, across the larger range of 
5 to 80, using a fixed noise rate $\eta$ of 0.3, and the results 
(Figure~\ref{figure-xytest}) also show LOOCVCV doing better consistently.

Next, to allow us to perform certain calculations exactly such 
as finding $\nopt$, we carried out the next few 
experiments by first drawing 1000 hypotheses from $D_H$, and then sampling 
with replacement from this pool of 1000 each time we needed to generate a
hypothesis, instead of drawing a new one from $D_H$. 
(We also verified that these results are close to if we had used $D_H$.)

Recall that in Section~\ref{subsection-k-from-nopt}, we claimed that
if we had an estimate of a good $\nhat$ to use in picking best--of--$\nhat$,
we might instead wish to use the $k$-th percentile hypothesis instead,
with $k = 100 (1- (1/(\nhat+1)))$. We see this effect again in our next experiment.
Using $m=20$ CV samples again, the top line of Figure~\ref{figure-xyopt} 
is the expected error if we knew exactly what the best $n$, $\nopt$ 
were and chose the best--of--$\nopt$. The middle curve is the error
of the $k$-th percentile hypothesis, where $k = 100(1-(1/(\nopt+1)))$, and
we see that it consistently beats best--of--$\nopt$. Finally, the 
lowermost curve is if we knew the optimal $k$, $k_{\it opt}$, and chose
the $k_{\it opt}$-th percentile hypothesis, which, as predicted by
Theorem~\ref{theorem-percentile-win}, beats best--of--$\nopt$ also.

\section{Discussion}

From the results we have presented so far, which generally show
LOOCVCV beating best--of--$n$ by picking a hypothesis other than
the one with the smallest CV error. Of course, best--of--$n$ 
is like using $k=100$, so if there were insufficient data to
find an estimate $k$ of $\kopt$, or if $\kopt$ is close to $100$,
then we would expect best--of--$n$ to beat LOOCVCV's using a 
possibly noisy estimate of $\kopt$. In fact, at the
noise rate $\eta = 0$, LOOCVCV generally does lose
to best--of--$n$ (Figures~\ref{figure-xy20} and~\ref{figure-xy40}). 
Also, some preliminary results (not reported) involving 
artificial neural networks, 
trained by backpropagation on a fixed set of training data and 
with randomly re-initialized weights each time,
also showed LOOCVCV losing slightly to best--of--$n$.

Leaving considerations of a loss function of $k$ aside, 
we might therefore decide not to use LOOCVCV unless we had 
reason to believe that the ``overfitting'' problem is 
significant in the distribution of hypotheses in our hypothesis 
selection problem, or that LOOCVCV will find a good $\kopthat$ 
(such as when the CV sample is large).  However, if we do 
have an exceedingly large number of hypotheses to choose from, 
and have a reasonably large CV-set, then LOOCVCV may give
much better performance than simply selecting the hypothesis with the 
lowest CV error. 

Lastly, we wish to comment on some of the implications that these 
results may have for the practitioner reading this paper. 
As an example, if cross-validation is used 
to decide how many epochs to train a backpropagation neural network 
(which, while not falling within our assumed framework of \iid\/ $h$'s, 
is similar), or to select from a large pool of learning algorithms,
then it may {\em not} be advisable 
to pick the minimum cross-validation error hypothesis to
use as the final hypothesis. However, how strongly these
results of this paper will bear out in future practice 
remains to be seen.

%HERE
%[mention MDL alternative?]

\section{Conclusions} 

This paper explained how overfitting of CV data occurs despite
there being no notion of {\em complexity},
and presented percentile-cv and LOOCVCV, that try to correct 
this problem. It also proved that percentile-cv with an appropriately chosen 
percentile will always do at least as well as best--of--$n$, 
and demonstrated experimentally that LOOCVCV consistently beats 
best--of--$n$ in one domain. However, there are also domains where 
overfitting of CV data is not a significant problem, in which 
LOOCVCV loses slightly to best--of--$n$; the characterization of 
such domains is an open problem.

%This paper explained how overfitting of CV data occurs in hypothesis
%selection, and showed the surprising result that it can be overcome
%by selecting a hypothesis other than the minimum-CV-error hypothesis. 
%It also presented new algorithms percentile-cv and LOOCVCV, 
%and proved that percentile-cv with an appropriately chosen percentile 
%will always do at least as well as best--of--$n$, and demonstrated 
%experimentally that LOOCVCV consistently beats best--of--$n$ in one 
%domain. However, there are also domains where overfitting of CV 
%data is not a significant problem, in which LOOCVCV loses slightly 
%to best--of--$n$; the characterization of such domains is an open 
%problem.

% START-OBSERVATIONS
%\section{Other Random Stuff} 

%\begin{observation} 
%If $\E[\varepsilon(h)|\,\ehat(h)]$ as a function of $\ehat(h)$ is convex,
%and $\ehatbar$ is the expected $\ehat(h)$ under a certain algorithm (such
%as ``best of $\nopt$ hypotheses''), then we can expect to always do at 
%least as well or better by picking a hypothesis with $\ehat(h) = \ehatbar$
%(if one exists).
%\end{observation}
%\Proof By Jensen's inequality. \hb
%However, there may not be any such hypothesis. (i.e. if 100 test samples, and
%$\ehatbar = 0.203$.) So, maybe we can pick the nearest one (0.20 in this 
%case)? But that won't guarantee always doing better. So, as an alternative to
%LOOCVCV, how about somehow choosing randomly from the two nearest ones (0.20 and 0.21) 
%instead? I think this can be made to always do at least as well/better. 
%Empirically, however, percentile-cv almost always chooses something very 
%similar to this. (That is, given n, if we look at 
%$E[\ehat({\mbox{\rm best of n hypotheses}})]$
%and the $\ehat$ that percentile-cv chooses using $k$ as in Equation~\ref{eqn-k-from-nopt},
%they will almost always be very close.) So, does this serve as another justification for 
%percentile-cv? (Incidentally, percentile-cv does {\em not} always beat 
%``best of $n$'', using $k$ and $n$ related as in 
%Equation~\ref{eqn-k-from-nopt}.)
%\hb
%
%
%\begin{observation} 
%From Figure~\ref{figure-simulated-results}, our loss if we overestimate $\nopt$ is
%much smaller than if we underestimated it by the same amount. 
%\end{observation}
%Maybe if we can prove that it'll always be so, using only simple assumptions.
%Anyway, might be worth mentioning somewhere, since if we aren't sure what $\nopt$ is, it's
%better to ``guess high'' than to ``guess low.'' Also, should add something
%about ``guessing high'' into LOOCVCV?
%
%\begin{observation} 
%Yet another advantage of LOOCVCV: It allows things like ``$\nopt = 3.5$,'' where
%it'll simply figure $k$ out from that, and pick a hypothesis.
%\end{observation}
%(Incidentally, the mathematical formulas used in LOOCVCV have a natural way of 
%allowed $n$ to be treated as a continuous real variable, rather than as an integer.
%So, this is how we might get non-integral $\nopt$ from it.)

%END-OBSERVATIONS

%\section*{Acknowledgments}
%
%I give warm thanks to Michael Kearns for many interesting and informative 
%discussions, and 
%%to Avrim Blum 
%for his input on writing this paper.

%\bibliographystyle{plain}
%\bibliography{cv-bib}

\section*{Appendix: The LOOCVCV Algorithm}

%\noindent
%\subsection*{The LOOCVCV Algorithm}
We ask the reader to refer to Section~\ref{section-choose-percentile}
for the details of the LOOCVCV algorithm. Here, we describe how it 
finds an estimate $\hat{n}$ of $\nopt$ using $H$, from which we can
derive $k$, for picking the $k$-th percentile hypothesis.
%(In practice, 
%we pick the $k$-th percentile hypothesis within $H$, rather than
%drawing a fresh set.) 
\hb
{\bf Input:} $\hat{n}$, $H[1..n]$, $S$, $i$ 	\hb
{\bf Output:} leave-one-out error with $p_i$ ``left-out,''
		for best of $\hat{n}$ hypothesis drawn from $H$	\hb
{\bf Notes:} The idea of this function is sorting $H$ in order of leave-one-out
	$\ehat$'s, and calculating, for each $h \in H$, the {\em probability}
	that it will be the lowest--CV--error hypothesis in a set of 
	$\nhat$ drawn from $H$ 
%	(the term $((n-j+1)^{\hat{n}} - (n-j)^{\hat{n}})$ in step 9 below).  \hb
	(step 9 below). 			\hb
%Also, $I$ is the indicator
%	function, so $I(A) = 1$ if $A=$ True, 0 otherwise.\hb
\noindent
\def\err {{\rm err}}
\def\ebar {\overline{\rm e}}
\def\empt {\hbox to 0.0in{}}
1.  function loocv-error($\hat{n}$, $H[1..n]$, $S$, $i$) 	\hb
2.  \ \ \ \ \ create $H'[1..n].\{\ehat_{-1},e\}$		\hb
3.  \ \ \ \ \ for ($j = 1 {\rm \ to\ } n$) do begin 		\hb
4.  \ \ \ \ \ \ \ \ \ $H'[j].\ehat_{-1} = 			
% uncomment for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 
| \{H[j](x_k) \neq y_k | 
					(x_k, y_k) \in S_{-i}\} |$ \hb
%5.  \ \ \ \ \ \ \ \ \ $H'[j].e = I\{H[j](x_i) \neq y_i\}$ 	\hb
5.  \ \ \ \ \ \ \ \ \ $H'[j].e = 1$ if $H[j](x_i) \neq y_i$, $0$ otherwise \hb
6.  \ \ \ \ \ \ \ \ \ end 						\hb
7.  \ \ \ \ \ Sort $H'[]$ in ascending order of $H'.[].\ehat_{-1} $ \hb
8.  \ \ \ \ \ for each value $x$ of $H'[?].\ehat_{-1}$, 		
% uncomment for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ 
let $\ebar(x)$ be the mean of the $H'[j].e$'s 
% uncomment for TWOCOLUMN 
%\hb
%\empt \ \ \ \ \ \ \ \ \ \ \ 
that have $H'[j].\ehat_{-1} = x$.		\hb
9. return $\sum_{j=1}^{n} 
            \left( 
                (\frac{n-j+1}{n})^{\hat{n}} - (\frac{n-j}{n})^{\hat{n}} 
            \right) 
				\ebar(H'[j].\ehat_{-1})$ \hb
\hb
{\bf Input:} $H$, $S$						\hb
{\bf Output:} Recommended $\nhat$				\hb
10. function loocvcv-estimate-nopt($H$,$S$) 		\hb
11. return ${\rm argmin}_{\nhat \in [1,\infty)} 
		\{ \sum_{i=1}^m \hbox{{\rm loocv-error}}(\nhat, H, S, i)$\} \hb
\hb
%Original TWOCOLUMN version below
%9.  \ \ \ \ \ for ($j = 1 {\rm \ to\ } n$) do			\hb
%10. \ \ \ \ \ \ \ \ \ $H'[j].e \leftarrow \ebar(H'[j].\ehat_{-i})$ \hb
%11. return $\sum_{j=1}^{n} ((n-j+1)^{\hat{n}} - (n-j)^{\hat{n}}) H`[j].e$\hb
%\hb
%{\bf Input:} $H$, $S$						\hb
%{\bf Output:} Recommended $\hat{n}$				\hb
%12. function loocvcv-($H$,$S$) \hb
%13. return 							\hb
%\empt \ \ \ \ \ \ ${\rm argmin}_{\nhat \in [1,\infty)} 
%		\{ \sum_{i=1}^m \hbox{{\rm loocv-error}}(\nhat, H, S, i)$\} \hb

\vspace*{-0.2in}
\newcounter{my-figures-ctr}
\hskip -0.55in 
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.17.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.17$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.17] \approx 0.0648$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior17}
%\end{figure} 
  \centerline{\parbox{3.01in} {Figure~\ref{figure-posterior17}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.17$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.17] \approx 0.0648$.
           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}
\hbox to 0.2in {}
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.20.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.20$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.20] \approx 0.0253$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior20}
%\end{figure}
  \centerline{\parbox{3.01in} {Figure~\ref{figure-posterior20}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.20$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.20] \approx 0.0253$.
           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}

\hskip -0.55in
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/posterior.new.23.ps,width=2.70in,height=1.7in,angle=270}
%  \caption{posterior distribution for $\varepsilon(h) | \ehat(h) = 0.23$.
%	    $\E[\varepsilon(h) | \ehat(h) = 0.23] \approx 0.0727$.
%           See Section~\protect\ref{section-simulated-algorithm} for details.}
\refstepcounter{my-figures-ctr}
  \label{figure-posterior23}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-posterior23}: posterior distribution for $\varepsilon(h) | \ehat(h) = 0.23$.
	    $\E_{h\in D_H}[\varepsilon(h) | \ehat(h) = 0.23] \approx 0.0727$.
           See Section~\protect\ref{section-simulated-algorithm} for details.}}
}
\hbox to 0.2in {}
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/cv/num60000.com.ps,width=2.70in,height=1.7in,angle=270}
% \caption{$n$ varies along the $x$-axis, and the top-most curve
%           the expected generalization error if we draw $n$ hypotheses 
%           and pick the one with the lowest CV error. The middle,
%           step-like line is the expected error if we picked the 
%           $k$\/-th percentile hypotheses, using $k = 100 (1- (1/n))$.
%	    The bottom line is the generalization error of LOOCVCV achieves.}
\refstepcounter{my-figures-ctr}
  \label{figure-simulated-results}
%\end{figure}
 \centerline{\parbox{3.01in} 
	{Figure~\ref{figure-simulated-results}: $n$ on $x$-axis, generalization error on $y$-axis. 
	   Top curve is best--of--$n$, middle, stepped, line is 
		$k$\/-th percentile hypothesis, with $k = 100 (1- (1/(n+1)))$,
	    bottom line is LOOCVCV.}}
}

\newpage
\vspace*{-.2in}
\hskip -0.55in
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xy20.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise in CV data on $x$-axis, generalization error on $y$-axis.
%	m=20. Vertical dashes show 95\% confidence intervals for means. }
\refstepcounter{my-figures-ctr}
  \label{figure-xy20}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xy20}: Noise rate on $x$-axis, generalization error on $y$-axis.
	m=20. Vertical dashes show 95\% confidence intervals for means. }}
}
\hbox to 0.2in {}
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xy40.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise in CV data on $x$-axis, generalization error on $y$-axis.
%	m=40. Vertical dashes show 95\% confidence intervals for means. }
\refstepcounter{my-figures-ctr}
  \label{figure-xy40}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xy40}: Noise rate on $x$-axis, generalization error on $y$-axis.
	m=40. Vertical dashes show 95\% confidence intervals for means. }}
}

%\vspace*{-1.0in}
\hskip -0.55in
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xytest.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{m (number of CV-data samples, log scale) on $x$-axis,
%	generalization error on $y$-axis.
%	Uppermost curve is ``apparent best'' of 1000, lower curve is LOOCVCV.}
\refstepcounter{my-figures-ctr}
  \label{figure-xytest}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xytest}: m (number of test-data samples, log scale) on $x$-axis,
	generalization error on $y$-axis.
	Uppermost curve is best--of--1000, lower curve is LOOCVCV.}}
}
\hbox to 0.2in {}
\parbox{3.01in}
{
%\begin{figure}[h]
  \psfig{figure=/afs/andrew/usr/an2i/learn-5/postscripts/xyoptint.ps0,width=2.70in,height=1.7in,angle=270}
%  \caption{Noise rate on $x$-axis, generalization error on $y$-axis.
%	Uppermost curve is best--of--$\nopt$, middle curve is the
%	$k$-th percentile hypothesis, where $k$ was derived from $\nopt$ as 
%	in (\protect\ref{eqn-k-from-nopt}), bottom curve is $k_{\it opt}$-th percentile 
%	hypothesis where $k_{\it opt}$ is the optimal percentile.}
\refstepcounter{my-figures-ctr}
 \label{figure-xyopt}
%\end{figure}
  \centerline{\parbox{3.01in}
	{Figure~\ref{figure-xyopt}: Noise on $x$-axis, error on $y$-axis.
	Uppermost curve is best--of--$\nopt$, middle curve is 
	$k$-th percentile hypothesis, with $k = 100 (1- (1/(\nopt+1)))$,
	bottom curve is $k_{\it opt}$-th percentile }}
}

\bibliographystyle{plain}
\bibliography{cv-bib}

\end{document}
