%\documentstyle[12pt,twocolumn,psfig]{article}
\documentstyle[11pt]{article}


%\documentstyle[twocolumn,psfig]{article}
%\documentstyle[psfig]{article}

%\psdraft

%\input{colt.bib}

% FORMAT SPECIFICATIONS

% My attempt at fitting 40 lines onto each page. (Use w/12pt) Yucks.
\columnsep=0.25in
\setlength{\textheight}{7.5in}		% ICML97 requirement. DO NOT CHANGE
\setlength{\textwidth}{5.5in}		% ICML97 requirement. DO NOT CHANGE

\setlength{\topmargin}{0.8in}
\setlength{\headsep}{0pt}
\setlength{\headheight}{0pt}
\setlength{\oddsidemargin}{0.5in}
\setlength{\evensidemargin}{0.5in}
%\addtolength{\leftmargin}{-0.6in}
%\addtolength{\leftmargin}{-0.625in}

\begin{document}
%\begin{onecolumn}

\newlength{\linelength}
\setlength{\linelength}{0.35\textwidth}

\newcounter{my-figures-ctr}

\pagenumbering{arabic}

\def\Proof {{\bf Proof: \enspace}}
\def\argmin {{\rm argmin}}
\def\argmax {{\rm argmax}}
\def\hb {\hfil\break}
\def\entropy {{\cal H}}
\def\implies {\Rightarrow}
%\def\ehat {\hat\epsilon}
\def\ehat {\hat\varepsilon}
\def\nhat {\hat{n}}
\def\ehatbar{\overline{\hat\varepsilon}}
\def\iid {{\it i.i.d.}}
\def\Pr {{\rm Pr}}
\def\E {{\rm E}}
\def\nopt {{n_{\it opt}}}
\def\nopthat {\widehat{n_{\it opt}}}
\def\testeq {\stackrel{?}{=}}
\def\hstar {h^{\ast}}
\def\kopt {{k_{\it opt}}}
\def\kopthat {\widehat{\kopt}}


\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{observation}{Interesting Observation}

 
\title{Preventing ``Overfitting'' of Cross-Validation Data}
 
\author{ {\bf Andrew Y.~Ng} \\
School of Computer Science \\
Carnegie Mellon University \\
Pittsburgh PA 15213 \\
Andrew.Ng@cs.cmu.edu \\ 
\\
Advisor: {\bf Andrew W. Moore} \\
} 
 
\date{\today}
\date{January 20, 1997}
 
\maketitle 
 
\begin{abstract}

Suppose that, for a learning task, we have to select one hypothesis
out of a set of hypotheses (that may, for example, have been generated 
by multiple applications of a randomized learning algorithm). A common 
approach is to evaluate each hypothesis in the set on some previously 
unseen cross-validation data, and then to select the hypothesis that had 
the lowest cross-validation error. But when the cross-validation data
is partially corrupted such as by noise, and if the set of hypotheses 
we are selecting from is large, then ``folklore'' also warns about 
``overfitting'' the cross-validation data.
In this paper, we explain how this ``overfitting'' really occurs, and 
show the surprising result that it can be overcome by selecting a hypothesis 
with a {\em higher} cross-validation error, over others with lower 
cross-validation errors. We give reasons for not selecting the hypothesis 
with the lowest cross-validation error, and propose a new algorithm, 
LOOCVCV, that uses a computationally efficient form of leave--one--out 
cross-validation to select such a hypothesis. Finally, we present experimental 
results for one domain, that show LOOCVCV consistently beating picking 
the hypothesis with the lowest cross-validation error, even when using 
reasonably large cross-validation sets.

\end{abstract}

\maketitle
\thispagestyle{empty}

\bigskip
This work will be presented at the Fourteenth International Conference on
Machine Learning.

\end{document}

