Return-path: <danar@cs.huji.ac.il>
X-Andrew-Authenticated-as: 0;andrew.cmu.edu;Network-Mail
Received: from po3.andrew.cmu.edu via trymail
          ID </afs/andrew.cmu.edu/usr25/an2i/Mailbox/wjMnhvC00Udb4QsU5F>;
          Sun, 12 Mar 1995 13:20:43 -0500 (EST)
Received: from cmu2.cs.cmu.edu (CMU2.CS.CMU.EDU [128.2.206.46]) by po3.andrew.cmu.edu (8.6.9/8.6.9) with ESMTP id NAA07390 for <an2i+@andrew.cmu.edu>; Sun, 12 Mar 1995 13:20:40 -0500
Received: (from daemon@localhost) by cmu2.cs.cmu.edu (8.6.9/8.6.9) id NAA27020 for an2i+@andrew.cmu.edu; Sun, 12 Mar 1995 13:20:39 -0500
Received: via localmail; Sun, 12 Mar 1995 13:20:37 -0500 (EST)
Received: from cs.huji.ac.il (cs.huji.ac.il [132.65.16.10]) by cmu2.cs.cmu.edu (8.6.9/8.6.9) with SMTP id NAA27016 for <lion+@CMU.EDU>; Sun, 12 Mar 1995 13:19:09 -0500
Received: from minuet.cs.huji.ac.il by cs.huji.ac.il with SMTP id AA05814
  (5.67b/HUJI 4.153 for <lion+@CMU.EDU>); Sun, 12 Mar 1995 20:22:54 +0200
Received: by minuet.cs.huji.ac.il id AA24347
  (5.65c/HUJI 4.114 for <lion+@CMU.EDU>); Sun, 12 Mar 1995 20:18:53 +0200
Date: Sun, 12 Mar 1995 20:18:53 +0200
From: Dana Ron <danar@cs.huji.ac.il>
Message-Id: <199503121818.AA24347@minuet.cs.huji.ac.il>
To: Andrew Y Ng <lion+@CMU.EDU>
Subject: Re:  TeX version of paper

I think what I have is the submitted version.
I am enclosing it below. What is your other paper about?

Dana.

----------------------------------------------------------------
%From mkearns@research.att.com Fri Jan  6 01:53:58 1995

\documentstyle{article}

\input{psfig}

% FORMAT SPECIFICATIONS

\columnsep=0.25in
\setlength{\topmargin}{-0.250in}
\setlength{\headsep}{0pt}
\setlength{\headheight}{0pt}
\setlength{\textheight}{8.8in}
\setlength{\textwidth}{6.75in} %\setlength{\textwidth}{6in}
\setlength{\oddsidemargin}{0in}
\addtolength{\leftmargin}{-0.5in}
% \renewcommand{\baselinestretch}{0.95}

\iffalse
\columnsep=0.25in
\addtolength{\topmargin}{-0.75in}
\setlength{\textheight}{8.5in}
\setlength{\textwidth}{6.5in} %\setlength{\textwidth}{6in}
\addtolength{\oddsidemargin}{-0.75in}
\renewcommand{\baselinestretch}{1.0}
\pagestyle{plain}
\fi

\begin{document}

\newlength{\linelength}
\setlength{\linelength}{0.35\textwidth}

\title{An Experimental and Theoretical Comparison\\
	of Model Selection Methods}

\author{
Michael Kearns\\
\makebox[\linelength]{AT\&T Bell Laboratories}\\
\makebox[\linelength]{Murray Hill, New Jersey}
\and
Yishay Mansour\\
\makebox[\linelength]{Tel Aviv University}\\
\makebox[\linelength]{Tel Aviv, Israel}
\and
Andrew Y. Ng\\
\makebox[\linelength]{Carnegie Mellon University}\\
\makebox[\linelength]{Pittsburgh, Pennsylvania}
\and
Dana Ron\\
\makebox[\linelength]{Hebrew University}\\
\makebox[\linelength]{Jerusalem, Israel}
}

\date{\today}
\maketitle
\thispagestyle{empty}

\begin{abstract}
We investigate the problem of {\em model selection\/} in the
setting of supervised learning of boolean functions from independent
random examples. More precisely, we compare methods for finding a balance
between the complexity of the hypothesis chosen and its observed error on
a random training sample of limited size, when the goal is 
that of minimizing the
resulting generalization error. We undertake a detailed
comparison of three well-known model 
selection methods --- a variation of Vapnik's
{\em Guaranteed Risk Minimization\/} (GRM), an instance of Rissanen's
{\em Minimum Description Length Principle\/} (MDL), and cross validation (CV).
We introduce a general class of model selection methods (called
{\em $\hat{\epsilon}$-based\/} methods) that includes both GRM and MDL, with the
goal of providing general methods for analyzing such rules.

We provide both controlled experimental evidence and formal theorems
to support the following conclusions:
\begin{description}
\item $\bullet$ Even on simple model selection problems, the behavior of the
	methods examined can be both complex and incomparable. Furthermore,
	no amount of ``tuning'' of the rules investigated (such as
	introducing constant multipliers on the complexity penalty
	terms, or a distribution-specific ``effective dimension'')
	can eliminate this incomparability.
\item $\bullet$ It is possible to give rather general bounds on the generalization
	error, as a function of sample size,
	for $\hat{\epsilon}$-based methods. The quality of such bounds
	depends in a precise way on the extent to which the 
	method considered automatically limits
	the complexity of the hypothesis selected.
\item $\bullet$ For {\em any\/} model selection problem, the additional error
	of cross validation compared to {\em any\/}
	other method can be bounded above by the sum of two terms.
	The first term is large only if the learning curve of
	the underlying function classes
	experiences a ``phase transition'' between $(1-\gamma)m$ and
	$m$ examples (where $\gamma$ is the fraction saved for testing
	in CV). The second and competing term can be made arbitrarily
	small by increasing $\gamma$.
\item $\bullet$ The class of $\hat{\epsilon}$-based methods is fundamentally handicapped
	in the sense that there exist two types of model
	selection problems for which every $\hat{\epsilon}$-based method must
	incur large generalization error on at least one, while CV 
	enjoys small generalization error on both.
\end{description}
Despite the inescapable incomparability of model selection methods under
certain circumstances, we conclude with a discussion of our belief that the
balance of the evidence provides specific reasons to prefer CV to other
methods, unless one is in possession of detailed problem-specific information.
\end{abstract}

\newpage

% MACROS

\def\Exp {{\bf E}}
\def\Pr {{\bf Pr}}
\def\argmin {{\rm argmin}}

\def\ehat {{\hat{\epsilon}}}
\def\vs {{\it VS}}
\def\etwiddle {{\tilde{\epsilon}}}

\def\noiserate {{\eta}}
\def\noisyeps {{\epsilon_{\noiserate}}}
\def\noisyetw {{{\tilde{\epsilon}}_{\noiserate}}}

\def\entropy {{\cal H}}

\def\epsilonsrm {{\epsilon}_{\mbox{\footnotesize \sc grm}}}
\def\epsilonmdl {{\epsilon}_{\mbox{\footnotesize \sc mdl}}}
\def\epsiloncv {{\epsilon}_{\mbox{\footnotesize \sc cv}}}
\def\epsilonsgrm {{\epsilon}_{\mbox{\footnotesize \sc sgrm}}}
\def\epsilonM {{\epsilon}_{\mbox{\footnotesize \sc m}}}
\def\lengthsrm {{\hat{d}}_{\mbox{\footnotesize \sc grm}}}
\def\lengthmdl {{\hat{d}}_{\mbox{\footnotesize \sc mdl}}}
\def\lengthcv {{\hat{d}}_{\mbox{\footnotesize \sc cv}}}

\def\dmax {{d_{\mbox{\footnotesize \sc max}}}}

% AVRIM'S MARGINAL NOTE MACRO

\addtolength{\marginparwidth}{-0.4in}
\newcommand{\mnote}[1]
    {\marginpar%
        [{\tiny\begin{minipage}[t]{\marginparwidth}\raggedright#1%
                        \end{minipage}}]%
        {\tiny\begin{minipage}[t]{\marginparwidth}\raggedright#1%
                        \end{minipage}}%
    }

% GENERAL FORMS 

% Following from John Riecke

\newenvironment{labproof}[1]{\begin{trivlist}\item[\hskip\labelsep{\bf
Proof of #1:}]}{\blackslug\end{trivlist}} 
\newenvironment{labtheorem}[1]{\begin{trivlist}\begin{it}\item[\hskip\labelsep{\bf
Theorem #1}]}{\end{it}\end{trivlist}} 
\newenvironment{lablemma}[1]{\begin{trivlist}\begin{it}\item[\hskip\labelsep{\bf
Lemma #1}]}{\end{it}\end{trivlist}} 

% An endproof symbol  (a little solid black box)
\newcommand{\blackslug}{\rule{7pt}{7pt}}

\newcommand{\proof}{{\bf Proof: \enspace}}
\newcommand{\comment}[1]{}
\newcommand{\hs}{\enspace}
\newcommand{\hhs}{\thinspace}
\newcommand{\diverges}{\uparrow}
\newcommand{\converges}{\downarrow}
\newcommand{\inter}{\bigcap}
\newcommand{\real}{\ifmmode {\rm R} \else ${\rm R}$ \fi}
\newcommand{\nat}{\ifmmode {\rm N} \else ${\rm N}$  \fi}
\newcommand{\tot}{\ifmmode {\cal T} \else ${\cal T}$ \fi}
\newcommand{\sigstar}{\ifmmode \Sigma^{\ast} \else $\Sigma^{\ast}$ \fi}
\renewcommand{\star}{\ast}
\newcommand{\eqstar}{=^\ast}
\newcommand{\eqa}{=^a}
\newcommand{\eqk}{=^k}
\newcommand{\iok}{\exists_k^{\infty}}
\newcommand{\aek}{\forall_k^{\infty}}
\newcommand{\inn}{\ifmmode \in \else $\in$ \fi}
\renewcommand{\phi}{\ifmmode \varphi \else $\varphi$ \fi}
\renewcommand{\le}{\ifmmode \leq \else $\leq$ \fi}
\renewcommand{\ge}{\ifmmode \geq \else $\geq$ \fi}
\renewcommand{\ne}{\ifmmode \neq \else $\neq$ \fi}
\newcommand{\lt}{\ifmmode < \else $<$ \fi}
\newcommand{\gt}{\ifmmode > \else $>$ \fi}
\newcommand{\eq}{\ifmmode = \else $=$ \fi}
\newcommand{\half}{\ifmmode \frac{1}{2} \else $\frac{1}{2}$ \fi}
\newcommand{\oneovern}{\ifmmode \frac{1}{n} \else $\frac{1}{n}$ \fi}
\newcommand{\nseqm}{M_1,M_2,\ldots,M_n}
\newcommand{\ra}{\ifmmode \rightarrow \else $\rightarrow$ \fi}
\newcommand{\imp}{\ifmmode \Rightarrow \else $\Rightarrow$ \fi}
\newcommand{\implies}{\ifmmode \Rightarrow \else $\Rightarrow$ \fi}
% \newcommand{\qed}{\ifmmode \;\; {\rm QED} \else QED \fi}
\newcommand{\qed}{\hfill{\setlength{\fboxsep}{0pt}
\framebox[7pt]{\rule{0pt}{7pt}}}}
\newcommand{\pr}{\Pr}
\renewcommand{\notin}{\ifmmode \not\in \else $\not\in$ \fi}
% \newtheorem{theorem}{Theorem}[chapter]
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{conjecture}[theorem]{Conjecture}
\newlength{\thislabel}
\newcommand{\labsize}[1]{\settowidth{\thislabel}{#1}}
\def\lablistlabel#1{\rm #1\hfil}
\newenvironment{lenlist}[1]{
\labsize{#1}
\envbegin{list}{}{%
  \setlength{\labelwidth}{\thislabel}
 % \setlength{\baselineskip}{0.35\baselineskip}
  \setlength{\itemindent}{0pt}
  \setlength{\labelsep}{1em}
  \setlength{\rightmargin}{2em}
  \setlength{\leftmargin}{4em} %{\thislabel}
   % \addtolength{\leftmargin}{\labelsep}
  \setlength{\listparindent}{0pt}
  \let \makelabel \lablistlabel}
}{\envend{list}}

\pagenumbering{arabic}

\section{Introduction}

Stated informally, the model selection problem is simply that of
balancing the complexity of a statistical model with its goodness of
fit to the empirical data. In various forms, it arises repeatedly
in statistical estimation, machine learning and scientific inquiry
in general. Concrete instances of the problem include
choosing the number of hidden nodes in a neural network derived from
data, deciding on the amount of pruning to be performed on a decision tree,
and choosing the degree of a polynomial to be fit to a set of points.

The model selection problem is coarsely prefigured by Occam's Razor:
given two hypotheses that fit the data equally well, prefer the simpler one.
Unfortunately, Occam's Razor does not explicitly address the more complex,
more interesting and more common problem in which we have a simple model
with poor fit to the data, and a complex model with good fit to the data.
Here we require not a qualitative statement of a preference for simplicity,
but a {\em quantitative prescription\/} --- 
a formula or algorithm --- specifying the
relative merit of simplicity and goodness of fit. 

A great number of model selection methods have been proposed in
the literature of many research communities, too many to productively
survey here. (A more detailed history of the problem will be given in
the full paper.) Various types of analysis have been used to judge
the performance of these methods, including proofs of asymptotic consistency
in the statistical sense~\cite{v-edbed-82,s-afacv-77}, 
proofs of asymptotic optimality in coding-theoretic
measures~\cite{r-scsi-89}, and in rare cases, proofs of 
generalization error convergence rate~\cite{bc-mcde-91}.
Perhaps surprisingly, despite the many and diverse proposals, to our knowledge
no previous attempt to directly compare large classes of model selection methods
has been undertaken.

The goal of this paper is to provide such a comparison, and more importantly,
to describe the general conclusions to which it has led. As stated in the
abstract, we compare three well-known model selection methods and attempt 
to identify their relative and absolute strengths and weaknesses. We are not
interested in making blanket claims in favor of any particular
model selection method over all
others, and in fact the reader shall see that one of our main conclusions is
that such claims are inherently ill-founded, at least for the classes
of methods examined. Rather, we attempt to delineate
the conditions under which various methods will succeed or fail, and to provide
general methods for analyzing the behavior and performance of model selection
methods. The hope is that the informed practitioner can then make an educated
choice of a model selection method, perhaps based on some known 
properties of the model selection
problem confronting them.

We refer the reader to the abstract for a more detailed summary of
our findings, and proceed with the technical portion of the paper.

\section{Definitions and Model Selection Folk Wisdom} \label{section-defns}

Throughout the paper we assume that a fixed boolean
{\em target function\/} $f$ is used
to label inputs drawn randomly according to a fixed distribution $D$. 
For any boolean function $h$, we define the {\em generalization error\/}
$\epsilon(h) = \epsilon_{f,D}(h) \equiv \Pr_{x \in D}[h(x) \neq f(x)]$
\footnote{Except in circumstances where
confusion may result, for brevity we shall adopt the notational convention of
leaving implicit the many dependencies of the various quantities we define.
Thus, we suppress the obvious dependence of $\epsilon(h)$ on $f$ and $D$,
the dependence of empirical quantities on the random sample $S$, and so on.}.
We use $S$ to denote the random variable
$S = \langle x_1, b_1 \rangle,\ldots, \langle x_m, b_m \rangle$,
where $m$ is the {\em sample size\/}, each $x_i$ is drawn randomly and
independently according to $D$, and each $b_i$ 
is $f(x_i)\oplus c_i$,
where $c_i \in \{0,1\}$ is 0 with probability $1 - \noiserate$; we call
$\noiserate \in [0,1/2)$ the {\em noise rate\/}.
In the case that $\noiserate \neq 0$, we will sometimes
wish to discuss the generalization error of $h$
with respect to the noisy examples,
so we define
$\noisyeps(h) \equiv \Pr_{x \in D, b}[h(x) \neq f(x)\oplus b]$.
Note that $\epsilon(h)$ and $\noisyeps(h)$ are directly related by the equality
$\noisyeps(h) = (1 - \noiserate)\epsilon(h) + \noiserate(1 - \epsilon(h))
	= (1 - 2\noiserate)\epsilon(h) + \noiserate$. Thus, $\noisyeps(h)$ is
simply a ``damped'' version of $\epsilon(h)$, and both quantities are minimized
by the same $h$.
% MK: space saver
% For this reason, we use the term {\em generalization error\/}
% informally to refer to either quantity, making the distinction only when it is important.

For simplicity throughout the paper, we will use the expression
``with high probability'' to mean with probability $1 - \delta$ over
the draw of $S$, at a cost of a factor of $\log(1/\delta)$ in the bounds ---
thus, our bounds all contain ``hidden'' logarithmic factors, but our
handling of confidence is entirely standard and will be spelled out in
the full paper.

We assume a nested sequence of {\em hypothesis classes\/}
% MK: space saver
% \footnote{Such a nested sequence is called a {\em structure\/}
% by Vapnik~\cite{v-edbed-82} and is sometimes, but not always, the setting in which
% model selection methods are examined.}
$F_1 \subseteq \cdots \subseteq F_d \subseteq \cdots$.
The target function $f$
may or may not be contained in any of these classes, so we define
$h_d \equiv \argmin_{h \in F_d}\{\epsilon(h)\}$
and $\epsilon(d) \equiv \epsilon(h_d)$ (similarly,
$\noisyeps(d) \equiv \noisyeps(h_d)$).
Thus, $h_d$ is the best approximation to $f$ (with
respect to $D$) in the class $F_d$, and $\epsilon(d)$ measures the quality of
this approximation.
Note that $\epsilon(d)$ is a nonincreasing function of $d$ since
the hypothesis function classes are nested. Thus, larger values of $d$ can only
improve the {\em potential\/} approximative 
power of the hypothesis class. Of course,
the difficulty is to realize this potential on the basis of a small sample.

With this notation, the {\em model selection problem\/} can be stated informally:
on the basis of a 
random sample $S$ of a fixed size $m$, the goal is to choose
a {\em hypothesis complexity\/} $\hat{d}$, and
a hypothesis $\hat{h} \in F_{\hat{d}}$, such that the resulting 
generalization error $\epsilon(\hat{h})$
is minimized. In many treatments of model selection, including ours,
it is explicitly or implicitly assumed that the model selection method only has control
over the choice of the {\em complexity\/} $\hat{d}$, and that the final
{\em hypothesis\/} $\hat{h}$ is then an element of $F_{\hat{d}}$ chosen by
some fixed algorithm 
that is attempting to minimize or approximately
minimize the empirical error, given the allowed complexity $\hat{d}$.
To make these ideas more precise,
we define the {\em empirical error\/}
$\ehat(h) = \ehat_S(h) \equiv
	|\{\langle x_i, b_i \rangle \in S: h(x_i) \neq b_i\}|/m$,
and the {\em version space\/}
$\vs(d) = \vs_S(d) \equiv
	\{h \in F_d: \ehat(h) = \min_{h' \in F_d}\{\ehat(h')\}\}$.
Note that $\vs(d) \subseteq F_d$
may contain more than one function in $F_d$ --- 
several functions may minimize the
empirical error. If we are lucky, we have in our possession a
(possibly randomized)
{\em learning algorithm\/} $L$ that takes as 
input any sample $S$ and any complexity
value $d$, and outputs a member $\hat{h}_d$
of $\vs(d)$ (using some unspecified criterion to break ties if
$|\vs(d)| > 1$).
More generally, it may be the case that finding {\em any\/} function
in $\vs(d)$ is intractable, and that $L$ is simply a heuristic
(such as backpropogation or ID3) that does the best job it can
at finding $\hat{h}_d \in F_d$ with small empirical error on
input $S$ and $d$. 
% MK: space saver
% In this paper we will consider both specific
% problems for which there is an efficient algorithm $L$ for selecting
% a function from the version space, and the more abstract case in
% which $L$ may be arbitrary. 
In either case, we define
$\ehat(d) = \ehat_{L,S}(d) \equiv \ehat(\hat{h}_d)$, where 
$\hat{h}_d = L(S,d)$. Note that we expect $\ehat(d)$, like
$\epsilon(d)$, to be a nonincreasing function of $d$ --- by
going to a larger complexity, we can only reduce our empirical error.
Indeed, we may even expect there to be a sufficiently large value $\dmax$
(determined by the sequence of function classes, the learning algorithm,
the target function and distribution) such that $\ehat(\dmax) = 0$
always.

With these conventions and notation, we can give
a more precise statment of the model selection
problem: given the sample $S$, and the sequence of functions
$\hat{h}_1 = L(S,1),\ldots,\hat{h}_d = L(S,d),\ldots$ determined by the
underlying learning algorithm $L$,
select a complexity value $\hat{d}$ 
such that $\hat{h}_{\hat{d}}$ minimizes
the resulting
generalization error. Thus, a model selection method has
at its disposal both the sample $S$ and the sequence of
(increasingly complex) hypotheses derived by $L$ from $S$.
Notice that ``special'' model selection
criteria that incorporate knowledge about the behavior of
the learning algorithm $L$ may be appropriate in certain cases;
however, we hold that {\em good general model selection methods
should at least perform reasonably well
in the case that $L$ is actually an empirical error minimization
procedure.\/}

The current formalization suffices to motivate a key definition and
a discussion of the fundamental issues in model selection.
We define
$\etwiddle(d) = \etwiddle_{L,S}(d) \equiv
	\epsilon(\hat{h}_d)$. Thus, $\etwiddle(d)$ is a random variable
(determined by the random variable $S$) that gives the
{\em true generalization error\/} of the function $\hat{h}_d$ chosen by $L$
from the class $F_d$. Of course, $\etwiddle(d)$ is
not directly accessible to a model selection method; it can only 
be estimated in various ways from the sample $S$.

A simple but important observation to 
be made about the scenario we have constructed
is the following: if we consider 
{\em any\/} method for using the sample $S$ to choose
a hypothesis complexity $\hat{d}$ (and thus implicitly choosing the hypothesis
$\hat{h}_{\hat{d}}$ from the 
sequence $\hat{h}_1,\ldots,\hat{h}_d,\ldots$), no such
method can achieve generalization error $\epsilon$
less than $\min_d\{\etwiddle(d)\}$.
% MK: do we really need following?
% (If there is noise,
% no such method can achieve $\noisyeps$ less than $\min_d\{\noisyetw(d)\}$.)
Thus the behavior of the function 
$\etwiddle(d)$ --- especially the location and value of its
minimum --- is in some sense 
the essential quantity of interest in model selection as we have formulated it.

The prevailing folk wisdom in several research communities posits the following
picture for the ``typical'' behavior of 
$\etwiddle(d)$, at least in the optimistic case that
the learning algorithm $L$ implements empirical
error minimization. (In the ensuing discussion,
if there is classification noise 
the quantities $\noisyeps$ and $\noisyetw$ should
be substituted for $\epsilon$ and $\etwiddle$).
First, for small values
of $d$ ($d << m$), $\etwiddle(d)$ is large, due simply to the fact that 
$\epsilon(d)$ is large for small $d$, and
$\etwiddle(d) \geq \epsilon(d)$ always holds.
At such small $d$, empirical errors will be close to generalization errors (that
is, $\ehat(h) \approx \epsilon(h)$ for all $h \in F_d$ --- also known as
{\em uniform convergence\/}, or small 
``variance''\footnote{We put the terms ``bias'' and
``variance'' in quotes in this 
paragraph to distinguish our informal use
of them from their precise statistical 
counterparts.}), and $\vs(d)$ will contain only
functions whose true generalization error is near the best possible in $F_d$.
But this best generalization error is large, because
we have poor approximation power for
small $d$ (that is, we have a strong ``bias''). 
For large values of $d$ (usually $d \approx m$),
$\etwiddle(d)$ is again large, 
but for a very different reason. Here we expect that
$\epsilon(d)$ may actually be quite small (that is, we have a weak ``bias'', and
$F_d$ contains a good approximation 
to the target function $f$). But because $F_d$ is
so powerful, $\vs(d)$ will
contain many poor approximations as well (that is,
$\vs(d)$ contains functions $h$ with $\ehat(h) << \epsilon(h)$ --- so uniform
convergence does {\em not\/} hold in $F_d$, or we have large ``variance'')
\footnote{
A common way of informally expressing 
this behavior is to say that for small $d$,
the functions in $\vs(d)$ ``underfit'' the sample $S$,
% MK: space saver
% meaning that $F_d$ is not
% sufficiently expressive to capture the underlying regularities of $f$ exposed by
% $S$, 
and for large $d$, the functions in $\vs(d)$ 
``overfit'' the sample $S$.}.
%, meaning that the great expressive power of $F_d$
% causes $\vs(d)$ to contain many functions 
% that perform well empirically by modeling
% the idiosyncracies of the particular sample $S$ (or the noise),
% but perform poorly in generalization
% due to their failure to model the underlying function $f$ that generated $S$

As a demonstration of the validity 
of this view, and as an introduction to a particular
instance of the model selection problem 
that we will examine
throughout the paper, we call the reader's attention to
Figures~\ref{figure-folkepsilon} and~\ref{figure-folketw}.
For these figures the input domain is simply the real line segment $[0,1]$,
and the target function divides $[0,1]$ into 100
segments of equal width $1/100$; the target function alternates
from positive to negative labels on each consecutive segment.
The hypothesis class $F_d$ is simply the class of all boolean functions over
$[0,1]$ in which we allow
at most $d$ alternations of label,
so there are (at most) $d+1$
contiguous maximal regions of common label
(thus, $F_d$ contains what is sometimes
called {\em unions of $d/2$ intervals\/}). The problem of choosing the
best complexity $\hat{d}$ for 
this function class sequence on the basis of a random sample,
which we shall refer to in the sequel as
the {\em intervals model selection problem\/},
is particularly nice for our experimental investigations, since
it permits an efficient empirical error minimization algorithm
based on dynamic programming.
We have implemented this algorithm, and
in Figure~\ref{figure-folketw}
we experimentally plot $\etwiddle(d)$ as
a function of $d$ when $S$ consists of 
$m = 2000$ random examples (drawn from the uniform input distribution)
of the target function.
(Details of the algorithm and
the experimental results of the 
paper are provided in Appendix A.) 
For our current discussion
it suffices to note that $\etwiddle(d)$ 
does indeed experience its minimum at a value of
$d$ considerably larger than 0 and considerably smaller than $m$.
Not surprisingly, the minimum occurs near (but not exactly at)
the target complexity of 100.

Thus, according to Figure~\ref{figure-folketw} and
conventional wisdom, the best choice of $\hat{d}$ should be an intermediate
value (that is, {\em not\/} $\hat{d} \approx 0$ or $\hat{d} \approx m$) 
that minimizes the
two competing sources of error. But how should we choose $\hat{d}$ when the most
common empirical measure of generalization 
ability --- the function $\ehat(d)$ --- simply
decreases with increasing $d$, and whose
straightforward minimization will therefore always
result in a large value of $d$ 
that causes overfitting? This is the central question
raised by
the model selection problem, and many answers
have been proposed and analyzed. We review
three of them now.

\section{Three Proposals for Model Selection} \label{section-threemethods}

The first two methods of choosing $\hat{d}$ that we shall consider are members
of a general class that we shall informally refer to as
{\em $\ehat$-based methods\/} (and shall formally define shortly). The
common theme behind these methods is their attempt to construct
an approximation to the desired function $\etwiddle(d)$ solely on the
basis of the empirical error
function $\ehat(d)$ and the complexity $d$, usually by trying to
``correct'' $\ehat(d)$ by the amount that it underestimates $\etwiddle(d)$
through the addition of a ``complexity penalty'' term.

Vapnik's {\em Guaranteed Risk Minimization\/} 
(GRM in the sequel)~\cite{v-edbed-82}
is such a method, and is
based on a more precise formalization of the conventional
description of $\etwiddle(d)$ given above.
% MK: deleted following elaboration to save space.
\iffalse
the empirical error $\ehat(d)$ becomes a less and
less reliable approximation to $\etwiddle(d)$ as $d$ increases; in
particular, $\ehat(d)$ {\em underestimates\/} $\etwiddle(d)$
(in the presence of classification noise, 
$\ehat(d)$ underestimates $\noisyetw(d)$)
by an amount that increases with $d$. Thus, perhaps we can accurately
recover $\etwiddle(d)$ by {\em correcting\/} $\ehat(d)$ --- for instance,
by adding some quantity to $\ehat(d)$ (a quantity that depends
on $d$ and possibly other parameters)
such that the resulting sum is approximately $\etwiddle(d)$.
\fi
In our experiments
we shall consider a slight variant of Vapnik's original GRM
in which $\hat{d}$ is chosen according to the rule
\begin{equation} \label{srm-rule}
	\hat{d} = \argmin_d\{\ehat(d) + ({d}/{m})
		(1 + \sqrt{1 + \ehat(d){m}/{d}})\}
\end{equation}
where for convenience but without loss of generality we have assumed
that $d$ is the Vapnik-Chervonenkis dimension~\cite{v-edbed-82,vc-ucrfep-71}
of the class $F_d$
\footnote{Vapnik's original GRM actually multiplies the second term
inside the $\argmin\{\cdot\}$
above by a logarithmic factor intended to guard against worst-case
choices from $\vs(d)$. Since this 
pessimistic prefactor renders GRM uncompetitive on
the ensuing experiments, we consider this modified and quite competitive
rule whose spirit is the same as the original.}. 
The origin of this rule 
can be summarized as follows: it has been
shown~\cite{v-edbed-82} (ignoring logarithmic factors)
that $\sqrt{d/m}$ is an upper bound on $|\ehat(d) - \etwiddle(d)|$.
% MK: space saver
% (in fact, the stronger uniform convergence property holds:
% $|\ehat(h) - \epsilon(h)| \leq \sqrt{d/m}$ for all $h \in F_d$; the analogous
% statment holds for $\ehat(h)$ and $\noisyeps(h)$ in the $\noiserate \neq 0$
% case). 
Thus, by
simply adding $\sqrt{d/m}$ to $\ehat(d)$, we ensure that the resulting
sum upper bounds $\etwiddle(d)$, and if we are optimistic we might
further hope that the sum is in fact a rather close approximation to
$\etwiddle(d)$, and that its minimization is therefore tantamount to a
minimization of $\etwiddle(d)$. The actual rule given in
Equation (\ref{srm-rule}) is slightly more complex than this, and
reflects a refined uniform convergence bound that varies from $d/m$
for $\ehat(d)$ close to 0 to $\sqrt{d/m}$ otherwise.

The next method we consider,
the {\em Minimum Description Length Principle\/} 
(MDL in the sequel)~\cite{r-msdd-78,r-scm-86,r-scsi-89,bc-mcde-91,qr-idtmdlp-89}
is another $\ehat$-based model selection method,
but has rather different origins than GRM. MDL
is actually a rather broad class of methods with a common
information-theoretic motivation, whose instantiations
require the choice of a 
specific {\em coding\/} or {\em representation\/} scheme.
To illustrate the method, we shall examine the intervals model
selection problem with a particular coding scheme
\footnote{Our goal in choosing this coding scheme is simply to
give one reasonable instantiation of the MDL paradigm in the
setting of supervised function learning. We by no means propose
that it is optimal in any coding-theoretic sense --- in particular,
we choose to enforce a two-part coding of the sample consisting of a hypothesis
and corrections to it. Other codings schemes are obviously possible;
however, many of our results (such as those for all $\ehat$-based
methods), will hold for wide classes of MDL instantiations.}.
% MK: deleted following to save space.
\iffalse
The familiar MDL motivation regards each potential hypothesis
function as a code for the {\em labels\/} in the sample $S$,
assuming the code recipient has access only to the inputs in $S$: thus,
the ``best'' hypothesis is the one minimizing the total code length
for the labels in the given coding scheme
(the number of bits needed to represent the hypothesis function,
plus the number of bits needed to represent the labels given the
hypothesis function).
\fi

Suppose that $h$ is a function over $[0,1]$
with exactly $d$ alternations of label.
With respect to its behavior on the fixed set of $m$ inputs $x_1,\ldots,x_m$
appearing in the sample $S$, we can
represent the function $h$ by
specifying the $d$ inputs
where $h$ switches value (that is, the points $x_i$ such that
$h(x_i) \neq h(x_{i+1})$), which takes $\log {{m}\choose{d}}$ bits
\footnote{In the full paper we justify our use of the sample points
to describe $h$; it is quite similar to representing $h$ using
a grid of resolution $1/p(m)$ for some polynomial $p(\cdot)$.}.
Dividing by $m$ to obtain an order 1 quantity,
we obtain $(1/m)\log {{m}\choose{d}} \rightarrow \entropy(d/m)$
as $m$ becomes large~\cite{ct-eit-91}.
Now given the behavior of $h$ on the inputs in $S$, the labels can
be described simply by coding the {\em exceptions\/} or
mistakes of $h$ (that is, those indices $i$ where $h(x_i) \neq f(x_i)$).
This can be done by first sending the value $\ehat(h)\cdot m$, the
number of mistakes made by $h$ on $S$, at a cost of
$log(\ehat(h)\cdot m) \leq \log m$; if we again normalize by
dividing by $m$, this cost becomes negligible for large $m$.
We then send the location of the exceptions in the sample, at a
cost of $\log {{m}\choose{\ehat(h)\cdot m}}$, whose normalized
value approaches $\entropy(\ehat(h))$ for large $m$.
Thus, the version of MDL that we shall examine for the intervals
model selection problem dictates
the following choice of $\hat{d}$:
\begin{equation} \label{mdl)rule}
	\hat{d} = \argmin_d\{\entropy(\ehat(d)) +
		\entropy({d}/{m})\}.
\end{equation}
We regard MDL (as we shall refer to this rather specific instance of
the general class of model selection rules for convenience) 
% MK: space saver
% does not have
% the form of a simple complexity penalty added to $\ehat(d)$ as in GRM
% (although even in GRM this complexity penalty depended not only on $d$ but
% on $\ehat(d)$ as well). But we still view MDL 
as being similar in spirit
to GRM, since it proposes a 
correction to $\ehat(d)$ consisting of a
nonlinear transformation (the application 
of the entropy function $\entropy(\cdot)$),
and the addition of the complexity penalty $\entropy(d/m)$. More formally,
any model selection rule of the form
\begin{equation}	
	\hat{d} = \argmin_d\{G(\ehat(d),{d}/{m})\}
\end{equation}	
shall be called an {\em $\ehat$-based method\/}. Such methods assign the value
$G(\ehat(d),{d}/{m})$ (which is a random variable
determined by the sample $S$) 
to each $d$, and then minimize over $d$. Based on the
remarks we have made earlier, an 
ideal $\ehat$-based method would somehow manage
to consistently assign a value very near $\etwiddle(d)$ to each $d$
(or would at least assign values that preserve the {\em ordering\/} of
$d$ values induced by
$\etwiddle(d)$).

The third model selection method that we examine has a rather different spirit
than the $\ehat$-based GRM and MDL. 
In {\em cross validation\/} (CV in the sequel)~\cite{s-cvcasp-74,s-afacv-77},
rather than attempt to {\em reconstruct\/} $\etwiddle(d)$ from $\ehat(d)$ and $d$, we
instead settle for a ``worse'' $\etwiddle(d)$ (in a sense made precise
momentarily) that we can {\em directly estimate\/}.
More specifically, in CV we 
use only a fraction $(1 - \gamma)$ of the examples in $S$
in order to obtain the hypothesis sequence
$\hat{h}_1 \in F_1,\ldots,\hat{h}_d \in F_d,\ldots$ --- 
that is, each $\hat{h}_d$ is now $L(S',d)$,
where $S'$ consists of the first $(1 - \gamma)m$ examples in $S$.
Here $\gamma \in [0,1]$
is a parameter of the CV 
method whose tuning we discuss briefly later and in some detail
in a separate paper~\cite{fkmnr-ip-95}. Notice 
that we expect the quantity $\epsilon(\hat{h}_d)$, regarded
as a random variable determined by $S$, 
to be (perhaps considerably) larger than in the
case of GRM and MDL, because now 
$\hat{h}_d$ was chosen by $L$ from $F_d$ on the basis of only
$(1 - \gamma)m$ examples rather than 
all $m$ examples. For this reason we wish to introduce
the more general notation 
$\etwiddle^{\;\gamma}(d) \equiv \epsilon(\hat{h}_d)$ to indicate
the fraction of the sample witheld 
from the $\hat{h}_d$ selection process. 
Note that the original definition of 
$\etwiddle(d)$ is equivalent to $\etwiddle^{\;0}(d)$.
Now we can
more formally express our statement 
regarding the difference between the $\etwiddle(d)$ functions
of the $\ehat$-based methods and CV as
$\Exp_S[\etwiddle^{\;\gamma}(d)] \geq \Exp_S[\etwiddle(d)]$.

The CV method settles for 
$\etwiddle^{\;\gamma}(d)$ instead of $\etwiddle(d)$ in order to
have an independent test with which to 
directly estimate $\etwiddle^{\;\gamma}(d)$. That is,
CV chooses $\hat{d}$ according to the rule
\begin{equation} \label{cv-rule}
	\hat{d} = \argmin_d\{\ehat_{S''}(\hat{h}_d)\}
\end{equation}
where $\ehat_{S''}(\hat{h}_d)$ is 
the empirical error of $\hat{h}_d$ on $S''$, the
last $\gamma m$ examples of $S$ that were witheld in selecting $\hat{h}_d$.

For concreteness in the experimental
results we will examine CV using the test set fraction $\gamma = 0.1$.
In a companion paper~\cite{fkmnr-ip-95}, we will also examine the behavior of various
forms of {\em multi-fold cross validation\/}, in which many (either disjoint
or overlapping) training sets are selected from the original sample. For disjoint
training sets, we can prove that the expected generalization 
error will be the same as the basic CV described above, but with lower variance.
For this reason it is fair to say that we err on the side of pessimism when
evaluating the performance of CV-type methods throughout the investigation.

\section{A Controlled Experimental Comparison} \label{section-experimental}

Our results begin with a comparison of the performance and properties of
the three methods that we have described in a carefully controlled experimental
setting (namely, the intervals model selection problem)
that is nevertheless nontrivial: the experiments
are performed on large and noisy samples, and the function classes are of
large dimension (measured by the number of free parameters, the VC dimension,
the coding length, or any other reasonable metric), even though the
input space is one-dimensional. 
Among the 
obvious scientific advantages of such 
controlled experiments, at least in comparison to 
empirical results on data of unknown origin, are
our ability to exactly measure generalization error (since we know the target
function generating the data and the distribution), and our ability to precisely
study the effects of varying parameters of the data
(such as noise rate, target function complexity, sample size,
and so on) on the performance and behavior of model selection methods.
As we shall see, some rather interesting behavior already results from
a comparison on our simple
problem; perhaps most importantly, this experimental behavior foreshadows a
number of important themes that we shall draw upon in our more abstract
theoretical results.

We begin our discussion of
the experimental results with
Figure~\ref{figure-compare10err}
\footnote{For completeness, Figures~\ref{figure-compare0err}
and~\ref{figure-compare0len} plot analogous quantities for the noise-free case;
however, things become interesting when noise is introduced.}.
To obtain this figure, data was generated
from a uniform input distribution and labeled
according to an intervals function over $[0,1]$ consisting of 100
intervals of alternating label, corrupted with noise at rate $\noiserate = 0.1$ In
Figure~\ref{figure-compare10err},
we have plotted the {\em true\/} generalization 
errors $\epsilonsrm$, $\epsilonmdl$ and
$\epsiloncv$ (again, using $\gamma = 0.1$) of the hypotheses chosen by
the three methods (measured with respect to the noise-free source of examples)
as a function of sample size $m$, which ranged from 1 to 6500 examples.
Details of the code used to 
perform these experiments is given in Appendix A.

Figure~\ref{figure-compare10err} demonstrates the subtlety
involved in comparing the three methods: in particular, we see that
{\em none of the three methods outperforms the others for all sample sizes.\/}
Thus we can immediately dismiss the notion that one of the methods examined
can be said to be optimal for this problem in any standard sense. Getting into 
the details, we see that that there is an initial regime (for $m$ from 1 to
slightly less than 1000) in which $\epsilonmdl$ is the lowest of the three
errors, sometimes outperforming GRM by a considerable margin.
Then there is a second regime
(for $m$ about
1000 to slightly more than 1500) 
where an interesting reversal of relative performance occurs, since 
now $\epsilonsrm$ is the lowest error,
considerably outperforming $\epsilonmdl$, whose rate of decrease is slowing.
In both of these first two regimes, $\epsiloncv$ remains the intermediate
performer. In the third and final regime, $\epsilonmdl$ makes a rather dramatic
transition to match $\epsilonsrm$ and the slightly larger $\epsiloncv$, and the
performance of all three methods remains quite similar for the remainder of the
plot.

Insight into the cause of this behavior is given by
Figure~\ref{figure-compare10len},
where for the same runs used to obtain 
Figure~\ref{figure-compare10err}, we instead plot
the quantities $\lengthsrm$, $\lengthmdl$ and $\lengthcv$, the
number of intervals in the hypotheses chosen by GRM, MDL and CV
respectively (thus, the ``correct'' value, in the sense of simply
having the same number of intervals as the target function, is 100).
Here we see that for small sample sizes, corresponding roughly to
the first regime discussed for 
Figure~\ref{figure-compare10err} above, the length $\lengthsrm$ of
the hypothesis chosen by GRM is slowly approaching 100 from below,
reaching and remaining at the target value for about $m=1000$.
Although we have not shown it explicitly, GRM is incurring nonzero
training error throughout the entire range of $m$. In comparison,
for a long initial period (corresponding to the first two regimes
of $m$), MDL is simply choosing the shortest hypothesis that incurs
no training error (and thus encodes both ``legitimate'' intervals
and noise), and thus $\lengthmdl$ grows in an uncontrolled fashion.
This ``overcoding'' behavior of MDL is actually preferable,
in terms of generalization error, to the initial ``undercoding''
behavior of GRM, as verified by 
Figure~\ref{figure-compare10err}. Once $\lengthsrm$ reaches
100, however, the overcoding of MDL is a relative liability, resulting
in the second regime.
Figure~\ref{figure-compare10len} clearly shows that the transition
from the second to the third regime (where approximate parity is
achieved) is the direct result of a dramatic correction to
$\lengthmdl$ from the zero training error length to the target value
of 100.
Finally, $\lengthcv$ makes a more 
rapid but noisier approach to 100 than $\lengthsrm$, and in fact
also overshoots 100, but much 
less dramatically than $\lengthmdl$. This
more rapid initial increase again results in superior generalization
error compared to GRM for small $m$, but the 
inability of $\lengthcv$ to settle at 100 results in slightly higher
error for larger $m$.

We shall momentarily discuss further the interesting
behavior of $\lengthsrm$ and $\lengthmdl$, but first we call
attention to Figures~\ref{figure-compare20err} to~\ref{figure-compare40len}.
These figures, which come in pairs, show experiments identical to that
of Figures~\ref{figure-compare10err} and~\ref{figure-compare10len}, but
with the increasing error rates $\noiserate = 0.2, 0.3$ and 0.4. Notice
that as $\noiserate$ increases, the initial period of undercoding by GRM
seems to increase slightly, but the initial period of overcoding by MDL
increases tremendously, the result being that the first regime of
generalization error covers approximately the same values of $m$ (about
1 to 1000), but the second regime covers a wider and wider range of $m$,
until at $\noiserate = 0.4$, $\lengthmdl$ has not corrected to 100 even
at $m= 6500$ (further experiments revealed that $m = 15000$ is still not
sufficient).

The behavior of the lengths $\lengthsrm$ and $\lengthmdl$ can
be traced to the form of the total penalty functions for the two
methods. For instance, in Figures~\ref{figure-mdlpen500},
~\ref{figure-mdlpen2000},
and~\ref{figure-mdlpen4000}, we plot the total MDL penalty
as a function of complexity $d$ for the fixed sample sizes
$m = 500, 2000$ and $4000$ respectively, using noise
rate $\noiserate = 0.20$. At $m = 500$, we see
that the rather dramatic total penalty curve has its global minimum
at approximately $d = 200$, which as expected (we are in the
MDL overcoding regime at this small sample size) is the point
of consistency with the noisy sample. However, a small local
minimum is already developing near the target value of $d = 100$.
By $m = 2000$, this local minimum is quite pronounced, and beginning
to compete with the global consistency minimum (which for this noise rate
and sample size has now moved out to approximately $d = 650$). At
$m = 4000$, the former local minimum at $d = 100$ has become the global
minimum. The rapid transition of $\lengthmdl$ that marks
the start of the final
regime of generalization error discussed above (approximate parity of
the three methods)
is thus explained by the switching of the global total penalty minimum
from the $d$ required for consistency to $d = 100$
\footnote{In the full paper, we examine
this switching of the minimum in some detail,
and show that for any function of $s$ alternating intervals of equal width,
it is governed by a competition between the quantities
$\entropy(\noiserate) + \entropy(s/m)$ and
$\entropy(2\noiserate(1-\noiserate) + (s/m)(1-2\noiserate)^2)$.
Which of these quantities is smaller may actually be a nonmonotonic
function of $\noiserate$ for some values of $s/m$, implying that
the performance of MDL may actually improve by increasing the noise rate
under certain circumstances.}.
In Figures~\ref{figure-srmpen500},
~\ref{figure-srmpen2000},
and~\ref{figure-srmpen4000}, we give plots of the total GRM penalty
for the same three sample sizes and noise rate. Here the behavior is
much more controlled --- for each sample size, the total penalty has
the same single-minimum bowl shape, with the minimum starting to the
left of $d = 100$ (the minimum occurs at roughly $d = 40$ for $m = 500$),
and gradually moving over $d = 100$ and sharpening for large $m$.

A natural question to pose after examining these experiments is the following:
is there an $\ehat$-based 
methods that enjoys the best properties of both GRM and
MDL on this class of problems? By 
this we would mean a method that approaches the correct
$d$ value (whatever it may be)
more rapidly than GRM, but does so without suffering the long,
uncontrolled ``overcoding'' period 
of MDL. An obvious candidate for such a method is
simply a modified version of GRM or MDL, 
in which (for example) we reason that perhaps
the GRM penalty for complexity is 
too large for this problem (resulting in the reluctance
to code), and thus we multiply the complexity 
penalty term in the GRM rule by a constant
less than 1 (or analogously, multiply the 
MDL penalty term by a constant greater than 1
to reduce overcoding). The 
results of an experiment on such a modified version of GRM
are shown in Figures~\ref{figure-srmcompare20err}
and~\ref{figure-srmcompare20len}, where 
the original GRM performance is compared
to a modified version in which the complexity penalty is multiplied by 0.5.
Interestingly and perhaps unfortunately, we see that there is no free lunch:
while the modified version does indeed code more rapidly and thus reduce the
small $m$ generalization error, this comes at the cost of an MDL-like overcoding
regime with a corresponding degradation in generalization error (and in fact a
considerably slower return to $d = 100$ than MDL under the same conditions)
\footnote{Similar results are obtained in experiments in which every occurrence
of $d$ in the GRM rule is replaced 
by an ``effective dimension''~\cite{vll-mvlm-94,m-enp-91}
$c_0 d$ for any constant $c_0 < 1$.}.
The reverse phenomenon (reluctance to code) is experienced for MDL with an
increased complexity penalty multiplier, as demonstrated by
Figures~\ref{figure-mdlcompare20err}
and~\ref{figure-mdlcompare20len}. 

Let us summarize some of the key points 
demonstrated by the experiments --- points that
we shall return to in our formal results. First, none of the three methods
dominates the others for all sample sizes.
Second, the two $\ehat$-based methods
we examined seem to have either an inclination towards or bias against coding
that is overcome by the inherent properties of the data asymptotically, but
that can have a large effect on generalization error for small to moderate
sample sizes. Third, the overcoding or undercoding bias of the $\ehat$-based
methods cannot be overcome simply by adjusting the relative weight of error
and complexity penalties without reversing the bias of the resulting rule and
suffering increased generalization error for some range of $m$.
Fourth, while CV is not the best of the methods for any value of $m$, it does
manage to fairly closely track the best $\ehat$-based method
for each value of $m$, and considerably
beats both GRM and MDL in their regimes of weakness. We now turn our attention
to our theoretical results, where each of these key points will be developed
more formally.

\section{Bounds on Generalization Error for $\ehat$-Based Methods}
\label{section-epsbased}

We begin the discussion of our theoretical results with a basic
but nontrivial question: what
form should bounds on the expected 
generalization error of model selection methods
take? Since we have already pointed out that for any fixed sample,
$\min_d\{\etwiddle(d)\}$
lower bounds the generalization error of any model selection method,
it would seem that the ideal form for such bounds would be
$\min_d\{\etwiddle(d)\} + r(m)$ (with high probability),
where $r(m)$ is a decreasing function
depending only on $m$ (that is, not on $d$). Such a bound, if possible,
specifies the additional error, $r(m)$, suffered by the method over and
above the information-theoretic optimal.
Note that the expected value of $\min_d\{\etwiddle(d)\}$ 
itself also depends on $m$,
and we expect it to approach $\min_d\{\epsilon(d)\}$ as
$m \rightarrow \infty$.

The drawback of such a bound is that it is problem-specific 
(namely, it depends on the target function, $f$, and on the distribution on
the examples, $D$, since $\etwiddle(d)$ depends on $f$ and $D$), and thus
may involve difficult problem-specific calculations.
In particular, if we are to argue that the generalization error of a method 
is not much larger than $min_d\{\etwiddle(d)\}$, then it seems that
quite accurate information concerning the form of $\etwiddle(d)$ (as a 
function of $d$) is needed. We can indeed prove reasonably close upper
and lower bounds on $\etwiddle(d)$ for the intervals problem, and thereby
obtain generalization error bounds of the desired form for various methods
(details in the full paper), but in general such a program seems quite
difficult to carry out.
%Perhaps as a result of such difficulties, the few
%papers that propose bounds use a form that is different but still interesting,
%and that we now illustrate with a simple example.

In the two theorems presented in this section we introduce two types
of {\em problem-independent\/} bounds for
general $\ehat$-based model selection rules
of the form $\hat{d} = \argmin_d\{G(\ehat(d),d/m)\}$, where the only 
assumption we make on $G(\cdot,\cdot)$, is that it is %non-decreasing and
an increasing function of its first argument, and a non-decreasing function
its second argument. %both its arguments.
Our search for bounds of the type described in the theorems
was inspired by work of Barron and Cover~\cite{bc-mcde-91}.
Barron and Cover give bounds of a similar form (which they
call the {\em index of resolution\/})
on the generalization error of
MDL in the context of density estimation. 
% MK: space saver
% Following the statements of the theorems, we discuss them shortly, and
% illustrate how they can be applied  to several specific methods.
% Full proofs of the theorems appear in Appendix B.

In the first theorem we provide general conditions under which we can obtain
a bound on the generalization error of a given $\ehat$-based method
which depends only on the function $\epsilon(d)$ and on $d$. We refer
to such a bound as $\epsilon$-{\em based\/}. In the second theorem
we remove these conditions and obtain a bound which is ``almost''
$\epsilon$-based, in the sense that it is a sum two terms: the
first depends only on $\epsilon(d)$ (and $d$) while the second depends
on the selected complexity, $\hat{d}$. 
The two conditions required to apply Theorem~\ref{theorem-epsbased1} can
be described informally as follows. The meaning of the first
condition is simply that there exists a kind of 
{\em uniform convergence envelope\/} around $\ehat(d)$,
bounding its error both from above and from below.
This envelope is defined by two functions, $F_1$ and $F_2$.
The second condition asks that
$G$ be such that the contribution of its second (complexity) argument
``corrects'' the bound on the error of $\ehat(d)$ in $F_1$,
resulting in some expression depending {\em only \/} on $\epsilon(d)$,
which can be described by a function $G_0(\epsilon(d))$.
We state the theorems precisely first, then provide intuition for
their form and give several applications to specific model selection rules.
\begin{theorem}
\label{theorem-epsbased1}
\footnote{For simplicity, all of the results in this section and the following
one are stated for the noise-free case. However, all have straightforward
analogues for the noisy case that we give in the full paper.}
Let $G:[0,1] \times \Re \rightarrow \Re$
define the $\ehat$-based model selection rule
$\hat{d} = \argmin_d\{G(\ehat(d),d/m)\}$.
Assume that the following two conditions hold:
\vspace{-6pt}
\begin{enumerate}
\item %({\sf Uniform convergence envelope condition})
There exist functions
$F_1 : [0,1] \times \Re \rightarrow [0,1]$, and
$F_2 : [0,1] \times \Re \rightarrow [0,1]$ 
such that, with high probability, for all $d$,
$F_1(\epsilon(d),d/m) \; \leq \; \ehat(d) \;\leq \;  F_2(\epsilon(d),d/m)$.
%Both $F_1$ and $F_2$ should be 
%non-decreasing in their first argument, 
%but only $F_2$ need be non-decreasing in its second argument.
\vspace{-4pt}
\item 	
There exists an increasing function 
${G_0} : [0,1] \rightarrow \Re$, such that
for all $d$,
${G_0}(\epsilon(d)) \;\leq\; G(F_1(\epsilon(d),d/m),d/m)$.
\end{enumerate}
\vspace{-6pt}
Then, with high probability,
%\begin{equation}
$	\epsilon(\hat{d}) \; \leq \; 
		\min_d\left\{
		{G_0}^{-1}\left(G(F_2(\epsilon(d),d/m),d/m\right)
            		\right\}$.
%\end{equation}
\end{theorem}
\begin{theorem} \label{theorem-epsbased2}
Let $G:[0,1] \times \Re \rightarrow \Re$
define the $\ehat$-based model selection rule
$\hat{d} = \argmin_d\{G(\ehat(d),d/m)\}$.
Then with high probability
\vspace{-7pt}
\[
	\epsilon(\hat{d}) \; \leq \; 
		\min_d\left\{
		G_0^{-1}\left(G\left(\epsilon(d) + \sqrt{d/m},d/m\right)\right)
		\right\} + \sqrt{\hat{d}/m}
\]
\vspace{-12pt}

\noindent
where $G_0^{-1}(y)$ is the inverse of the function $G(x,0)$.
\end{theorem}

Let us examine the bounds on the generalization error
given in the two theorems above a little more carefully.
Note that the first term in the bound given in Theorem~\ref{theorem-epsbased2}
is very similar to the $\epsilon$-based bound in 
Theorem~\ref{theorem-epsbased1}. More precisely, if we ignore the somewhat
different definitions of $G_0$ in the two theorems, then this first 
term is a special case of the bound in Theorem~\ref{theorem-epsbased1},
where $F_2(\epsilon(d),d/m) = \epsilon(d) + \sqrt{d/m}$.
This choice of $F_2$ is simply the upper bound in
the uniform convergence bound mentioned previously which states
that with high probability,
$|\epsilon(d) - \ehat(d)| \leq \sqrt{d/m}$ for all $d$.
Suppose $d^*$ satisfies $\epsilon(d^*) = \min_d\{\epsilon(d)\}$ (that is,
$F_{d^*}$ contains the best possible approximation to the target function).
Then as $m \rightarrow \infty$,
\vspace{-6pt}
\begin{equation}
G_0^{-1}\left(G\left(\epsilon(d^*) + \sqrt{d^*/m},d^*/m\right)\right) 
  \;\rightarrow\; 	G_0^{-1}\left(G(\epsilon(d^*),0)\right) \;=\;
	\epsilon(d^*)\;.
\end{equation}
\vspace{-12pt}

\noindent
Thus the first term of the bound on $\epsilon(\hat{d})$ in
Theorem~\ref{theorem-epsbased2}
approaches the optimal generalization error in the limit of large $m$,
and a slightly more complicated expression can be obtained
for the more general bound in Theorem~\ref{theorem-epsbased1}.

Consider now the second term, $\sqrt{\hat{d}/m}$,
of the bound on $\epsilon(\hat{d})$ in
Theorem~\ref{theorem-epsbased2}.
If we want that the {\em sum\/} of the two terms in the bound be meaningful, 
we must also give a
bound on $\sqrt{\hat{d}/m}$
that decays to 0 with $m$, preferably as rapidly
as possible. In other words, {\em we must be able to argue that
the complexity of the hypothesis chosen is limited\/}.
If we can do so, then combined with the bound on the first term
we have a proof of the method's {\em statistical consistency\/} (that is,
approach to the optimal error in the large sample limit), and may even have
a nice {\em rate\/} of approach to the optimal error. If we cannot do so,
then we are forced to consider the possibility that our method is simply fitting
the sample, and incurring large error because as a result. Such a possibility
was clearly realized in the experimental results for MDL, where a long period
of unbounded hypothesis complexity directly caused a long period of essentially
constant generalization error as a function of $m$.

 From the discussion above, there seems to be a clear
benefit in having an $\epsilon$-based bound on the generalization
error as opposed to a bound which contains a potentially ``uncontrolled''
term such as the bound given in Theorem~\ref{theorem-epsbased2}.
However, the $\epsilon$-based bound given in 
Theorem~\ref{theorem-epsbased1} has its price. 
The selection rules to which it can be applied are
{\em conservative\/}, in the sense 
that the contribution of the
second (complexity) argument in the rule must be large enough so as
to ``correct'' the bound on the error of $\ehat(d)$ in $F_1$.
This conservative feature is illustrated in a GRM variant
we discuss next, and its drawbacks were already exemplified by the tendency
of GRM to undercode when $m$ is small in our experimental results.

Theorem~\ref{theorem-epsbased1} can be applied
to a {\em simplified\/} GRM variant (SGRM in the sequel), defined by
$\hat{d} = \argmin_d\{\ehat(d) + \sqrt{d/m}\}$.
Recall the uniform convergence bound which states
that with high probability,
$|\epsilon(d) - \ehat(d)| \leq \sqrt{d/m}$ for all $d$.
Using the notation introduced in Theorem~\ref{theorem-epsbased1},
$G(\ehat(d),d/m)$ is defined to be $\ehat(d) + \sqrt{d/m}$,
$F_1(\epsilon(d),d/m) = \epsilon(d) - \sqrt{d/m}$, and
$F_2(\epsilon(d),d/m) = \epsilon(d) + \sqrt{d/m}$. Substituting
$F_1(\epsilon(\hat{d}),\hat{d}/m)$ for $\ehat(\hat{d})$ in
$G(\ehat(\hat{d}),\hat{d}/m)$, we have that
$G(\ehat(\hat{d}),\hat{d}/m) \geq \epsilon(\hat{d})$, and we
can choose ${G_0}$ to be the identity function.
Thus, the generalization bound we get 
for SGRM using Theorem~\ref{theorem-epsbased1} is
\vspace{-6pt}
\begin{equation} \label{eqn-sgrmbound}
%\epsilon(\hat{d}) 
\epsilonsgrm
    \; \leq \; \min_d\left\{\epsilon(d) + 2\sqrt{{d}/{m}}\right\}\;.
\end{equation}
\vspace{-12pt}

This problem-independent bound expresses the
generalization error as the minimum of the sum of the best
possible error within each class $F_d$ and a penalty for complexity. Such a
bound seems entirely reasonable, given that it is essentially the expected
value of the empirical quantity we minimized to choose $\hat{d}$ in the
first place. Furthermore, if
$\epsilon(d) + \sqrt{d/m}$ approximates $\etwiddle(d)$ well, then such a
bound is about the best we could hope for. 
However, %for particular problems
there is no reason in general to expect this to be the case. Clearly,
if for a specific problem there are known tighter bounds $F_1$ and $F_2$, then
we can modify $G$ accordingly (to exactly correct the bound on the error
of $\ehat(d)$ in $F_1$), and derive better generalization bounds. 
Unfortunately, this brings us back to the potential difficulty of
problem-specific calculations.

As an example of the application of Theorem~\ref{theorem-epsbased2}
to MDL we derive the following bound on $\epsilonmdl$:

\vspace{-11pt}
\begin{equation}
\epsilonmdl \leq \min_d\{
	\entropy^{-1}(\entropy(\epsilon(d) + 
		\sqrt{d/m}) + \entropy(d/m))\} +
		\sqrt{\lengthmdl/m}
\leq
		\min_d\{
		\entropy(\epsilon(d)) + 2\entropy(\sqrt{d/m})\}
		+ \sqrt{\lengthmdl/m} \label{eqn-mdlbound},
\end{equation}
\vspace{-11pt}

\noindent
where we have used the facts that $\entropy^{-1}(y) \leq y$ and
$\entropy(x + y) \leq \entropy(x) + \entropy(y)$. Again, we emphasize that
the bound given by Equation (\ref{eqn-mdlbound}) is useless without
a bound on $\lengthmdl$, which we know from the experiments can be of
order $m$. However, by analyzing the point at which $\lengthmdl$ makes
its transition to the correct value in the experimental setting we examined,
we can in fact give an expression for $\lengthmdl$ that depends on the
target function complexity, the sample size and the noise rate. This expression,
combined with Equation (\ref{eqn-mdlbound}), gives a rather accurate theoretical
explanation for the experimental findings (details in the full paper).

As a final example, we show how 
Theorem~\ref{theorem-epsbased2} can be applied
to a {\em variant\/} of MDL in which the penalty for coding is
increased over the original:
\vspace{-9pt}
\begin{equation}
\hat{d} = \argmin_d\left\{\entropy(\ehat(d)) + 
               {1\over\lambda^2}\entropy\left({d\over m}\right)\right\}\;
\end{equation}
\vspace{-10pt}

\noindent
where $\lambda$ is some parameter that may depend on $d$ and $m$. Assuming that we
never choose $\hat{d}$ whose total penalty is larger than 1
(which is a reasonable assumption since it holds under the condition
that the fair coin belongs to some $F_d$, for a small $d$), we have
that $\entropy({d\over m}) \leq \lambda^2$. Since 
$\entropy(x) \geq x$, for all $x$, it follows that 
$\sqrt{\hat{d}/m} \leq \lambda$. If $\lambda$ is some decreasing function
of $m$ (say, $m^\alpha$ for some $0 <\alpha < 1$), then 
the bound on $\epsilon(\hat{d})$ approaches $\epsilon(d^{\ast})$ in
a reasonable rate.

\section{Bounding the Additional Error of Cross Validation}
\label{section-cvtracking}

In this section we state a rather general theorem bounding the additional
generalization error suffered by cross validation compared to any other
model selection method.
The only assumption we need to introduce to obtain
such a bound is that of a maximum complexity value $\dmax$. For most
natural problems, $\dmax$ is of the same order as $m$, and simply represents
the smallest $d$ such that $F_d$ can realize any labeling of $m$ inputs.
The theorem is most conveniently stated in terms of expected generalization
error; however, similar bounds holding with high probability can also be given.
\begin{theorem} \label{theorem-cvtracking}
\footnote{Again, for simplicity we state the theorem only for the noise-free
case, leaving the analgoue for the noisy case for the full paper.}
Let $L$ be any learning algorithm, and
let the hypothesis sequence
$\hat{h}_1 \in F_1,\ldots,\hat{h}_{\dmax} \in F_{\dmax}$ be
a random variable
determined by $\hat{h}_d = L(S,d)$, where $S$ is a random sample
of $m$ (possibly noisy) examples of any fixed target function $f$
and distribution $D$.
Let $M$ be any model selection method for choosing
$\hat{d} \in \{1,\ldots,\dmax\}$, and let $\epsilonM(m)$
and $\epsiloncv(m)$ denote the expected generalization error
of the hypothesis $\hat{h}_{\hat{d}}$
chosen by $M$ and CV respectively.
Then with high probability
\begin{equation}
	\epsiloncv(m) \leq \epsilonM((1 - \gamma)m)
		+ 2\sqrt{{\log\dmax}/{\gamma m}}.
\end{equation}
In other words, the generalization error of CV on $m$ examples is
at most the generalization error of any other method $M$ on $(1-\gamma)m$
examples plus the ``test penalty term''
$\sqrt{{\log\dmax}/{\gamma m}}$.
\end{theorem} 
A proof sketch for this theorem is provided in Appendix C.
Let us discuss the form of the bound on $\epsiloncv(m)$ given by
Theorem~\ref{theorem-cvtracking}. Note that the bound does {\em not\/}
claim $\epsiloncv(m) \leq \epsilonM(m)$ for all $M$ (which would
mean that cross validation is an optimal model selection method). 
The bound given is
weaker than this ideal in two important ways. First, and perhaps most
importantly,
$\epsilonM((1-\gamma)m)$ may be considerably larger
than $\epsilonM(m)$. This could either be due to properties of the
underlying learning algorithm $L$, or due to inherent {\em phase transitions\/}
(sudden decreases)
in the optimal information-theoretic learning 
curve~\cite{sst-smle-92,hkst-rlcbsm-94} --- thus, it could be
that the generalization error that can be achieved within some class $F_d$
by training on $m$ examples is
close to 0, but that the optimal 
generalization error that can be achieved in $F_d$
by training on a slightly smaller 
sample is near $1/2$. This issue will be discussed in
some detail in the full paper, but this is intuitively the worst case
for cross validation --- when the small fraction of the sample saved for testing
was critically needed for training in order to achieve nontrivial performance ---
and is reflected in the first term of our bound. Obviously the risk of ``missing''
phase transitions can be minimized by decreasing the test fraction $\gamma$, but
only at the expense of increasing the test penalty term, which is the second
way in which our bound falls short of the ideal. However, unlike the potentially
unbounded difference
$\epsilonM((1-\gamma)m) - \epsilonM(m)$, 
our bound on the test penalty can be
decreased without any problem-specific knowledge by 
simply {\em increasing\/} the test fraction $\gamma$.

Despite these two competing sources of additional error for
CV in Theorem~\ref{theorem-cvtracking},
the bound has some great strengths that are worth discussing. First of all,
% MK: space saver
% the bound does not simply compare the
% worst-case error of CV to the worst-case
% error of $M$ over a wide class of model selection problems;
the bound holds for {\em any\/} model selection problem --- that is, for any
specific sequence of function classes,
any algorithm $L$ for choosing a hypothesis from
each class, any target function, and any input distribution. This is a point
worth dwelling on, since we believe that giving similarly general bounds for any
$\ehat$-based method would be extremely difficult, if not impossible. The reason
for this belief essentially arises from the diversity of learning curve behavior
documented by the statistical mechanics approach~\cite{sst-smle-92,hkst-rlcbsm-94},
among other sources. In the same
way that there is no universal learning curve behavior, there is no universal
behavior for the relationship between the functions $\ehat(d)$ and $\etwiddle(d)$ ---
the relationship between these quantities may depend critically on the target function
and the input distribution
\footnote{This point is made more formally in Section~\ref{section-ehatlimit}.}.
CV is sensitive to this dependence by virtue of its
target function-dependent and distribution-dependent estimate of $\etwiddle(d)$.
In contrast, by their very nature, 
$\ehat$-based methods propose a universal penalty
to be assigned to the observation 
of error $\ehat(h)$ for a hypothesis $h$ of complexity $d$.

A more technical feature of the bound of Theorem~\ref{theorem-cvtracking} is that by
letting $M$ be an $\ehat$-based methods, we can combine the bound with those
given in Section~\ref{section-epsbased} to give $\epsilon$-based bounds on CV.
For example, letting $M$ be the SGRM method described in Section~\ref{section-epsbased},
and combining Equation(\ref{eqn-sgrmbound}) with Theorem~\ref{theorem-cvtracking} yields
\begin{eqnarray}
\epsiloncv(m) 
	& \leq & \epsilonsgrm((1-\gamma)m) 
		+ \sqrt{\log\dmax/{\gamma m}} \\
	& \leq & \min_d\left\{\epsilon(d) + 2\sqrt{d/(1-\gamma)m}\right\}
		+ \sqrt{\log\dmax/{\gamma m}} \label{eqn-cvepsbased}
\end{eqnarray}
In addition to providing a general $\epsilon$-based bound on CV, this final
expression raises an intriguing possibility: if we only knew the form of
$\epsilon(d)$ (or even had bounds on it), then in principle we could minimize
the bound of Equation (\ref{eqn-cvepsbased})
as a function of $\gamma$ to derive a recommended training/test
split. Such a program is feasible for many 
specific problems (such as the intervals problem), or by investigating general
but plausible bounds on the approximation rate $\epsilon(d)$, such as
$\epsilon(d) \leq c_0/d$ for some constant $c_0 > 0$. We pursue this line
of inquiry in some detail in a separate forthcoming paper~\cite{fkmnr-ip-95}. For now,
we simply note that the bound of Equation (\ref{eqn-cvepsbased}) tells us that
in cases in which the power law
decay of generalization error within each $F_d$
that is implicitly assumed by GRM and related methods
is approximately correct
(and thus that bounds of the form given in Equation (\ref{eqn-sgrmbound})
are nearly optimal), the performance of CV will be rather competitive with these
or any other methods. This makes perfect
sense in light of the preceding analysis of the two
sources for additional CV error: in problems with power law learning curve behavior,
we have a power law bound on $\epsilonM((1-\gamma)m) - \epsilonM(m)$, and thus CV
``tracks'' any other method closely in terms of generalization error. This is
exactly the behavior observed in
the experiments described in Section~\ref{section-experimental}, for which
the power law is known to hold approximately.

\section{On the Limitations of $\ehat$-Based Methods}
\label{section-ehatlimit}

Recall that our experimental findings suggested that it may sometimes
be fair to think of $\ehat$-based methods as being either conservative
or liberal in the amount of coding they are willing to allow in their
hypothesis, and that bias in either direction can result in suboptimal
generalization that is not easily overcome by tinkering with the form
of the rule. In this section we treat this intuition more formally, by
giving a theorem demonstrating some fundamental limitations on the diversity
of problems that can be effectively handled by any fixed $\ehat$-based method.
Briefly, we show that there are (at least) two very different forms that
the relationship between $\ehat(d)$ and $\etwiddle(d)$ can assume, and that
any $\ehat$-based method can perform well on only one of these. Intuitively, we show that
there exist model selection problems in which a hypothesis whose complexity is
close to the sample size is the optimal choice, and in which a hypothesis whose 
complexity is close to 0 is the optimal choice, but that generate $\ehat(d)$ curves
with insufficient information to distinguish which is the case.
For the problems we
choose, CV can in fact succeed on both; but as we have discussed previously,
the use of CV is not without pitfalls of its own. We therefore conclude the paper with
a summary of the rather different risks involved with
each type of method, and a discussion of
our belief that in the absence of detailed problem-specific knowledge, our
overall analysis favors the use of CV. 
\begin{theorem} \label{theorem-ehatlimit}
There exist
function class sequences
$F^1_1 \subseteq \cdots \subseteq F^1_d \cdots$ and
$F^2_1 \subseteq \cdots \subseteq F^2_d \cdots$, input distributions
$D_1$ and $D_2$, and a constant $\gamma$ such that for any sample size $m$,
there are target functions $f_1$ and $f_2$ such that for any $\ehat$-based
model selection method $G$, either
$\epsilon^1_G(m) \geq \min_d\{\etwiddle_1(d)\} + \gamma$ or
$\epsilon^2_G(m) \geq \min_d\{\etwiddle_2(d)\} + \gamma$.
Here $\etwiddle_i(d)$ is the function $\etwiddle(d)$ defined by
the target function $f_i$, distribution $D_i$ and sequence
$F^i_1 \subseteq \cdots \subseteq F^i_d \cdots$, and
$\epsilon^i_G(m)$ is the expected generalization error
of method $G$ (over training samples of size $m$) on this target function, distribution
and sequence.
Thus, on at least one of the two model selection problems,
the generalization error of $G$ is lower bounded away from the optimal value
by a constant independent of $m$.
\end{theorem}
A sketch of the proof of this theorem is provided in Appendix D.
We note that although Theorem~\ref{theorem-ehatlimit} was designed to create
two model selection problems with the most disparate behavior possible, the
proof technique can be used to give lower bounds on the generalization error
of $\ehat$-based methods under much more general settings. We will elaborate on
this point in the full paper. There we will also argue that for the two problems
considered, the generalization error of CV is in fact close to
$\min_d\{\etwiddle(d)\}$ (that is, within a small additive term that
decreases rapidly with $m$) for {\em both\/} problems.
Finally, we remark that Theorem~\ref{theorem-ehatlimit} can be strengthened
to hold for a {\em single\/} model selection problem (that is, a single function
class sequence and distribution), with only the target function changing to
obtain the two different behaviors. This rules out the salvation of the $\ehat$-based
methods via problem-specific parameters to be tuned, 
such as ``effective dimension''~\cite{vll-mvlm-94,m-enp-91}.

\section{Conclusions}
Based on both our experimental and theoretical results, we offer the following
conclusions:
\begin{description}
\item Model selection methods that attempt to reconstruct the curve $\etwiddle(d)$
	solely by examining the curve $\ehat(d)$ often have a tendency to overcode
	or undercode in their hypothesis for small sample sizes, which is exactly
	the sample size regime in which model selection is an issue. Such tendencies
	are not easily eliminated without suffering the reverse tendency.
\item There exist model selection problems in which a hypothesis whose complexity is
	close to the sample size should be chosen, and in which a hypothesis whose
	complexity is close to 0 should be chosen, but that generate $\ehat(d)$ curves
	with insufficient information to distinguish which is the case. The $\ehat$-based
	methods cannot succeed in both cases, whereas CV can.
\item The error of CV can be bounded in terms of the error of any other method.
	The only cases in which the CV error may be dramatically worse are those
	in which phase transitions occur in the underlying learning curves at
	a sample size larger than that held out for training by CV.
\end{description}
Thus we see that both types of methods considered have their own Achilles' Heel. For
$\ehat$-based methods, it is an inability to distinguish two types of problems that
call for drastically different hypothesis complexities. For CV, it is phase transitions
that unluckily fall between $(1-\gamma)m$ examples and $m$ examples. On balance, we
feel that the evidence we have gathered favors use of CV in most common circumstances.
Perhaps the best way of stating our position is as follows: given the
rather general upper bound
on CV error we have obtained, and the limited applicability of any fixed $\ehat$-based
rule demonstrated by Theorem~\ref{theorem-ehatlimit} and the experimental results,
the burden of proof lies with the practitioner who favors an $\ehat$-based method over CV.
In other words, such a practitioner should have concrete evidence (experimental or theoretical)
that their method will outperform CV on the problem of interest.
Such evidence {\em must\/} arise from rather detailed problem-specific knowledge,
since we have demonstrated here the diversity 
of behavior that is possible in natural model selection
problems.

Finally, we wish to remark that although we have limited our attention here to
the case of supervised learning of boolean functions, we believe that many of
the principles uncovered (such as the limitations of $\ehat$-based methods,
and the tracking abilities of cross validation)
will be applicable to practically any learning setting in which there is
a model minimizing an expected loss (generalization error) must be derived
from independent observations from a source. A prime example for further investigation
would be distribution learning with 
respect to the Kullback-Liebler divergence (log loss),
where $\epsilon$-based upper bounds for 
MDL-like rules are already known~\cite{bc-mcde-91}, yet there
also exist phase transitions for natural problems~\cite{hkst-rlcbsm-94}.

\section*{Acknowledgements}

We give warm thanks to Yoav Freund and Ronitt Rubinfeld for their collaboration on
various portions of the work presented here, and for their insightful comments.
Thanks to Sebastian Seung and Vladimir Vapnik for interesting and helpful conversations.

\bibliographystyle{plain}
\bibliography{/home/mkearns/coltbib/colt}

\section*{Appendix}

\noindent
{\bf A. Experimental Details}

All experimental results described in this paper are obtained for
the intervals model selection problem. Recall that in this problem,
the function class $F_d$ consists of all boolean functions over 
the domain $[0,1]$ which have at most $d$ alternations of label.
There are two main reasons for choosing this problem for our investigation.
The first is that the complexity of the hypothesis functions is unlimited;
in particular, it is not hard to show that the Vapnik-Chervonenkis dimension
of $F_d$ is $d$, and thus as $d$ increases we allow arbitrarily complex
functions. The second reason is that this is one of the few cases in
which empirical error minimization is feasible
\footnote{This is important in
light of our earlier assertion that a good model selection method should
at least perform well when the underlying learning algorithm implements
empirical error minimization, and we do not wish any of our experimental
results to be artifacts of the unknown properties 
of heuristics such as backpropogation or ID3.}.
(A number of papers provide
evidence for the intractability of empirical error minimization for
a variety of natural function classes~\cite{pv-clle-88,br-t3nnn-89,kss-teal-92}.) 

More precisely, there is an algorithm that on input an {\em arbitrary\/}
sample $S = \{\langle x_i,b_i \rangle\}$ (where $x_i \in [0,1]$ and
$b_i \in \{0,1\}$) and complexity value $d$, 
outputs a function in $\vs_S(d)$. The algorithm
is based on dynamic programming, and a straightforward implementation
yields a running time that is $O(dm^2)$. However, we have developed a
more sophisticated implementation that yields a running time that is
$O(m\log m)$. Details of this faster algorithm will be provided in the
full paper.
The algorithm was implemented in the C++ programming language on
an SGI Challenge XL with 8 150 MHz processors and 1 gigabyte of RAM.
This implementation
allowed execution of the empirical error minimization algorithm on samples
of size up to $m \approx 15000$ in only a few seconds of real time.

The fast empirical error minimization code was the heart of a more elaborate
experimental tool that offered the following features:
\begin{description}
\item $\bullet$
The user specifies a target intervals function over $[0,1]$ in a file
that indicates the values at which the function changes label. Thus, a
file containing the values 0.15, 0.40, 0.75 specifies the boolean function
that is 1 on the interval $[0,0.15)$, 0 on the region $[0.15,0.40)$, 1
on the region $[0.40,0.75)$ and 0 on the region $[0.75,1.0]$.
\item $\bullet$
The user specifies the sample size $m$, and the noise rate $\noiserate$
with which the labels in the sample will be corrupted with noise.
The user also specifies one or more model selection methods, 
such as GRM, MDL or CV.
\item $\bullet$
A random sample $S$ of size $m$ of
the specified target function corrupted by the specified
noise rate is then generated by the program (inputs are
drawn according to the uniform distribution). For each value of $d$ from $0$
to $m$, $S$ and $d$ are then given to the empirical error minimization code.
This code returns a function $\hat{h}_d \in \vs_S(d)$. If $\vs_S(d)$ contains
functions giving different labelings to $S$, the code chooses the least in
a lexicographic ordering described in the full paper.
The hypothesis selected from $\vs_S(d)$ always has its
label alternation points exactly midway between sample inputs.
\item $\bullet$
For each $\hat{h}_d$, the 
true generalization error $\epsilon(\hat{h}_d)$ is
computed with respect to the specified target function, thus allowing
exact computation of the curve $\etwiddle(d)$.
\item $\bullet$
For each $\hat{h}_d$, the total penalty 
assigned to $\hat{h}_d$ by the chosen
model selection method is computed from $\hat{h}_d$, $S$ and $d$. Minimization
of this total penalty with respect to $d$ is then performed by the
code, resulting in the hypothesis $\hat{h}_{\hat{d}}$ chosen by the specified
model selection method. The error of this hypothesis
can then be compared with that of other model selection methods, as
well as the optimal value $\min_d\{\etwiddle(d)\}$.
\end{description}

The experiments given in the paper were all performed using a target
function of 100 regions of equal width and alternating label. 
The code
provides an option for repeated trials at each sample size, which was
used extensively in the experiments. The code produces plot files that
were averaged where appropriate. The postscript
plots shown were generated
by reading the plot files generated by the code into
the Xmaple system, which allows postscript output.\\\\

\noindent
{\bf B. Proofs of Theorems~\ref{theorem-epsbased1} and~\ref{theorem-epsbased2}}

\noindent
{\bf Proof of Theorem~\ref{theorem-epsbased1}:~~}
By definition of $\hat{d}$, we have that
for all $d$,
\begin{equation} \label{eqn-ehatmin1}
G\left(\ehat(\hat{d}),\hat{d}/m\right) \; \leq \; G\left(\ehat(d),d/m\right)\;.
\end{equation}
Since by the first condition in the theorem,
$\ehat(\hat{d}) \geq F_1(\epsilon(\hat{d}),\hat{d}/m))$,
while for all $d$, $\ehat(d) \leq F_2(\epsilon(d),d/m)$, 
and since $G(\cdot,\cdot)$
is an increasing function in its first argument, we have that
\begin{equation}
G\left(\ehat(\hat{d}),\hat{d}/m\right) 
 \; \geq \; G\left(F_1\left(\epsilon(\hat{d}),\hat{d}/m\right),\hat{d}/m\right),
\;\;\mbox{ and } \;\;
%\end{equation}
%\begin{equation}
G\left(\ehat(d),d/m\right) 
   \;\leq\; G\left(F_2\left(\epsilon(d),d/m\right),d/m\right)\;.
\end{equation}
Hence, for all $d$,
\begin{equation} \label{eqn-epsintro1}
G\left(F_1\left(\epsilon(\hat{d}),\hat{d}/m\right),\hat{d}/m\right) \;
   \leq \; G\left(F_2\left(\epsilon(d),d/m\right),d/m\right)\;.
\end{equation}
According to the second condition in the theorem we have that
\begin{equation}
G\left(F_1\left(\epsilon(\hat{d}),\hat{d}/m\right),\hat{d}/m\right) \;
   \geq \; G_0\left(\epsilon(\hat{d}\right)\;,
\end{equation}
from which we get the final bound:
\begin{equation}
\epsilon(\hat{d}) \; \leq \; 
   \min_d\left\{{G_0}^{-1}G(F_2(\epsilon(d),d/m),d/m)\right\}\;.
\end{equation}
\qed

\noindent
{\bf Proof of Theorem~\ref{theorem-epsbased2}:~~}
Similarly to the proof of Theorem~\ref{theorem-epsbased1}, we start
with the inequality
\begin{equation} \label{eqn-ehatmin2}
G\left(\ehat(\hat{d}),\hat{d}/m\right) 
   \; \leq \; G\left(\ehat(d),d/m\right)\;.
\end{equation}
We next use the uniform convergence bound considered previously
which states that $|\epsilon(d) - \ehat(d)| \leq \sqrt{d/m}$ for all $d$,
and obtain
\begin{equation} \label{eqn-epsintro2}
G\left(\epsilon(\hat{d})-\sqrt{\hat{d}/m},d/m\right) \; \leq \;
                         G\left(\epsilon(d)+\sqrt{d/m},d/m\right)\;.
\end{equation}
Since $G$ is a non-decreasing function in its second argument, we can
still weaken Equation (\ref{eqn-epsintro2}) to obtain 
\begin{equation}
G\left(\epsilon(\hat{d})-\sqrt{\hat{d}/m},0\right) \; \leq \;
                         G\left(\epsilon(d)+\sqrt{d/m},d/m\right)\;.
\end{equation}
Since $G_0(x)$ was defined to be $G(x,0)$, we may write
\begin{equation}
\epsilon(\hat{d}) \; \leq \;
	  G_0^{-1}\left(G\left(\epsilon(d) + \sqrt{d/m},d/m\right)\right)
          	+ \sqrt{\hat{d}/m}\;,
\end{equation}
or as in the final form of the bound:
\begin{equation}
	\epsilon(\hat{d}) \; \leq \; 
		\min_d\left\{
		G_0^{-1}\left(G\left(\epsilon(d) + \sqrt{d/m},d/m\right)\right)
		\right\} + \sqrt{\hat{d}/m}
\end{equation}
\qed\\\\

\noindent
{\bf C. Proof Sketch for Theorem~\ref{theorem-cvtracking}}

% DR slightly changed the order of the proof
The logic of the proof is intuitive and somewhat standard.
Let $S = (S',S'')$ be a random sample of $m$ examples, where
$|S'| = (1 - \gamma)m$ and $|S''| = \gamma m$. We 
now show that for {\em any\/} choice of $S'$, with high probability
over the random choice of $S''$, the error of CV given the sample $S$
is upper bounded by the sum of the error of any other method that
is given only the sample $S'$,
and the additional term $2\sqrt{{\log\dmax}/{\gamma m}}$.
Let us thus fix $S'$, and let
$\hat{h}_1' \in F_1,\ldots,\hat{h}_{\dmax}' \in F_{\dmax}$ be
determined by $\hat{h}_d' = L(S',d)$.
By definition of CV, $\hat{d}$ is chosen according to
$\hat{d} = \argmin_d\{\ehat_{S''}(\hat{h}_d')\}$.
By standard uniform convergence arguments we 
have that $\left|\epsilon(\hat{h}_d') - \ehat_{S''}(\hat{h}_d')\right| \leq
                \sqrt{{\log\dmax}/{\gamma m}}$ for all $d$
with high probability over the draw of $S''$.
Therefore, (with high probability)
\begin{eqnarray}
\epsiloncv &=& \epsilon(\hat{h}'_{\hat{d}}) \\
&\leq& \ehat_{S''}(\hat{h}'_{\hat{d}}) 
                             + \sqrt{{\log\dmax}/{\gamma m}} \\
 &\leq& \min_d\{\epsilon(\hat{h}_d')\} 
                             + 2\sqrt{{\log\dmax}/{\gamma m}} \\
&=& \min_d\{\etwiddle_{S'}(d)\}
             + 2\sqrt{{\log\dmax}/{\gamma m}}\;.
\end{eqnarray}
But as we have previously observed,
the generalization error of {\em any\/} model selection method (including $M$)
on input $S'$ is lower bounded by
$\min_d\{\etwiddle_{S'}(d)\}$, and our claim directly follows.
\qed\\\\

\iffalse
The logic of the proof is intuitive and somewhat standard.
Let $S = (S',S'')$ be a fixed sample of $m$ examples, where
$|S'| = (1 - \gamma)m$ and $|S''| = \gamma m$. Let
$\hat{h}_1' \in F_1,\ldots,\hat{h}_{\dmax}' \in F_{\dmax}$ be
determined by $\hat{h}_d' = L(S',d)$. Now as we have previously observed,
the generalization error of {\em any\/} model selection method (including $M$)
on input $S'$ is obviously lower bounded by
$\min_d\{\etwiddle_{S'}(d)\} = \min_d\{\epsilon(\hat{h}_d')\}$.
Thus $\epsilonM((1 - \gamma)m)$ is lower bounded by
$\min_d\{\etwiddle_{S'}(d)\}$.
On the other hand, if we leave $S'$ fixed but
choose $S''$ randomly, then by standard uniform convergence arguments we will
have $\left|\epsilon(\hat{h}_d') - \ehat_{S''}(\hat{h}_d')\right| \leq
                \sqrt{{\log\dmax}/{\gamma m}}$ for all $d$
with high probability over the draw of $S'$.
The theorem can be shown to follow from the fact that
CV chooses $\hat{d}$ according to
$\hat{d} = \argmin_d\{\ehat_{S''}(\hat{h}_d')\}$.
\fi

\noindent
{\bf D. Proof Sketch for Theorem~\ref{theorem-ehatlimit}}

The proof involves a number of technicalities, so we begin with a description
of the most important ideas. For notational convenience, in the proof
we use $\ehat(d)$ and $\etwiddle(d)$ to refer to the expected values of these
functions.

We will arrange things so that the first model
selection problem (Problem 1, described below) has the following properties
(see Figure~\ref{figure-ehatlimit}):
\begin{description}
\item $\bullet$ The function $\ehat_1(d)$ lies between two linear functions
        with $y$-intercepts $\noiserate_1$ and $\noiserate_1(1 - \noiserate_1)$
        and common $x$-intercept $2\noiserate_1(1 - \noiserate_1)m \leq m/2$;
\item $\bullet$ The function $\etwiddle_1(d)$ is minimized at $d = 0$, and furthermore,
        for any constant $c$ we have $\etwiddle_1(cm) \geq c/2$.
% MK: figure out right dependence on c above
\end{description}
We will next arrange that the second model selection problem (Problem 2,
described below) will obey:
\begin{description}
\item $\bullet$ The function $\ehat_2(d) = a_1$
        for $0 \leq d \leq 2\noiserate_1(1 - \noiserate_1)m \leq m/2$,
        where $\noiserate_1(1 - \noiserate_1) > a_1$;
\item $\bullet$ The function $\etwiddle_2(d)$ is lower bounded by $a_1$ for
        $0 \leq d < m/2$, but $\etwiddle_2(m/2) = 0$.
\end{description}
Thus, in Problem 1, the {\em empirical error decays gradually with $d$\/},
the ``right'' choice of $d$ is 0, and {\em any additional coding
directly results in larger generalization
error\/}.
In Problem 2, the {\em empirical error remains large as $d$ increases\/}
until $d = m/2$, where the empirical error drops rapidly, and
{\em a large amount of coding is required to achieve nontrivial generalization error\/}.
In Figure~\ref{figure-ehatlimit} we illustrate the conditions on $\ehat(d)$ for
the two problems, and also include hypothetical instances of $\ehat_1(d)$ and $\ehat_2(d)$
that are consistent with these conditions (and are furthermore representative of the ``true''
behavior of the $\ehat(d)$ functions actually obtained for
the two problems we define momentarily).

We can now give the
underlying logic of the proof using the
hypothetical $\ehat_1(d)$ and $\ehat_2(d)$. First consider the behavior of $G$ on
Problem 2. In this problem we know by our assumptions on $\etwiddle_2(d)$
that if $G$ fails to choose $\hat{d} \geq m/2$, $\epsilon_G \geq a_1$, already giving
a constant lower bound on $\epsilon_G$ for this problem. This is the easier case; thus
let us now assume that $\hat{d} \geq m/2$ for Problem 2, and now consider the behavior
of $G$ on Problem 1. Referring to Figure~\ref{figure-ehatlimit}, we see that for
$0 \leq d \leq d_0$, $\ehat_1(d) > \ehat_2(d)$, and thus
$G(\ehat_1(d),d/m) \geq G(\ehat_2(d),d/m)$ (because $\ehat$-based
methods assign greater
penalties for greater empirical error or greater complexity). Thus, in moving from
Problem 2 to Problem 1, the complexity values between 0 to $d_0$ can only be
assigned a larger total penalty by $G$. On the other hand, since
$\ehat_1(m/2) = \ehat_2(m/2) = 0$, the total penalty assigned to $d = m/2$ is
the same for both problems. 
>From these observations we can infer that in Problem 1, $G$ {\em cannot\/}
choose $0 \leq \hat{d} \leq d_0$. By the second condition on Problem 1 above,
this implies that $\epsilon_G \geq \etwiddle(d_0)$; if we arrange that $d_0 = cm$
for some constant $c$, then we have a constant lower bound on $\epsilon_G$ for
Problem 1.

Now we describe the two problems used in the proof,
and briefly argue why they have the desired
properties; details will appear in the full paper. We are in fact already familiar
with the first problem: the class $F_d^1$ is simply the class of all $d$-alternation
functions over $[0,1]$, the target function is the 0-alternation function that classifies
all examples as negative, the input distribution $D_1$ is uniform over $[0,1]$, and
we may choose any constant noise rate $\noiserate_1$. Now clearly under these settings
we have $\epsilon_1(0) = \etwiddle_1(0) = 0$ and $\etwiddle_1(d) > 0$ for any $d > 0$ (because
the noise in the sample will cause us to code ``false alternations''). Furthermore, each
additional false interval that is coded will result in an additional $\Theta(1/m)$
generalization error, thus resulting in the desired property $\etwiddle_1(cm) \geq c/2$.
% MK: again, tune constants
Finally, we obviously expect $\ehat_1(0) = \noiserate_1$, and it can be shown that
the expected number of label alternations required to achieve empirical error 0 is
$2\noiserate_1(1 - \noiserate_1)m$. Furthermore,
it can be shown that
$\ehat_1(d)$ is a piecewise linear function in which the linear segments have negative
slopes that are integer multiples of $1/2$, and
whose absolute values decrease with $d$ to the most gradual slope of $-1/2$.
These facts yield the desired envelope on $\ehat_1(d)$.

For the second problem, let us begin with
the input space $\{0,1\}^N$ for some value $N >> m$.
The function class $F_d^2$ consists of all parity functions in which only the
variables $x_1,\ldots,x_d$ are permitted to appear, the target function $f \in F_{m/2}^2$
is $f(\vec{x}) = x_1 \oplus \cdots \oplus x_{m/2}$, and the input distribution $D_2$ is
uniform over $\{0,1\}^N$. The noise rate $\noiserate_2 = 0$ (larger values will work
as well). Under these settings, it is not difficult to show that $\epsilon_2(d) = 1/2$
for $0 \leq d < m/2$, thus implying that $\etwiddle_2(d) \geq 1/2$ in the same range.
Furthermore, it can be shown that $\epsilon_2(m/2) = \etwiddle_2(m/2) = 0$, and that
$\ehat_2(d) \approx 1/2$ for $0 \leq d < m/2$. Note that we have almost satisfied the
desired conditions on Problem 2, using the value $a_1 = 1/2$;
however, the conditions on Problem 2
and the lower bound argument given above require further that
$\noiserate_1(1 - \noiserate_1) > a_1$. We can easily arrange this final condition
by simply scaling down $a_1$, by adding a ``special'' point to the domain
on which all functions in $F_d^2$ agree (details are omitted).
Referring to Figure~\ref{figure-ehatlimit}, notice that the ``optimal'' setting of
$a_1$ is determined by the trade-off between $a_1$ (which lower bounds
the error of methods failing on Problem 2) and $d_0/m$ (which lower bounds
the error of methods failing on Problem 1).
\qed

\newpage

\input{figures}

\end{document}

