\documentclass{article}
\usepackage{psfig,colt99e,times}
\title{Beating the Hold-Out:\\ Bounds for K-fold and
Progressive Cross-Validation}
\author{ {\bf Avrim Blum\thanks{Supported in part by NSF grant CCR-9732705}}\\
School of Computer Science\\
Carnegie Mellon University\\
Pittsburgh, PA 15213\\
avrim+@cs.cmu.edu
\And
{\bf Adam Kalai\thanks{Supported by an NSF Graduate Fellowship.}}\\
School of Computer Science\\
Carnegie Mellon University\\
Pittsburgh, PA 15213\\
akalai+@cs.cmu.edu
\And
{\bf John Langford}\\
School of Computer Science\\
Carnegie Mellon University\\
Pittsburgh, PA 15213\\
jcl+@cs.cmu.edu
}
\newcommand{\qed}{\rule{5pt}{7pt}\bigskip}
\begin{document}
\bibliographystyle{alpha}
\maketitle
\begin{abstract}
The empirical error on a test set, the {\em hold-out estimate}, often
is a more reliable estimate of generalization error than the observed
error on the training set, the {\em training estimate}. K-fold cross
validation is used in practice with the hope of being more accurate
than the hold-out estimate without reducing the number of training
examples. We argue that the k-fold estimate does in fact achieve this
goal. Specifically, we show that for any nontrivial learning problem and
learning algorithm that is
insensitive to example ordering, the k-fold estimate is strictly more
accurate than a single hold-out estimate on
1/k of the data, for $2$.
Such a procedure can be scored in two dimensions: the true error of
its hypothesis, and the {\em error discrepancy}: $|(\mbox{estimated
error})-(\mbox{true error})|$.
The {\em resubstitution procedure}
generates a hypothesis by training on all the data and generates
an error estimate by measuring the number of mistakes of the learned
hypothesis on the same data used for training.
Since the training error can be a very optimistic
estimate of the true error, quantities such as
the VC dimension are used to bound the error discrepancy.
The {\em hold-out procedure} divides the data in two parts: the training set,
on which the hypothesis is trained, and the hold-out set, on which its
performance is measured. Among the nice properties that this procedure obeys
are Hoeffding bounds guaranteeing, regardless of the learning algorithm,
that with high probability the error discrepancy will be small.
The {\em k-fold procedure} divides the data into $k$ equally sized
{\em folds}. It then produces a hypothesis by training on $k-1$ folds
and testing on the remaining fold. This is repeated for each fold,
and the observed errors are averaged to form the {\em k-fold
estimate}. It is not obvious what hypothesis to output along with
this error estimate. In previous analysis \cite{AH98}, the final
hypothesis was a new hypothesis trained on all the data. Because this
hypothesis was constructed using more data than the hypotheses used
for computing the error estimate, in order to argue
for the accuracy of the estimate one needs some assumption that limits
the effect of this extra training data. In particular, previous work gives
{\em sanity-check bounds} which show that the k-fold estimate is almost
as good as the training error estimate in the resubstitution
procedure, under the assumption that the learning algorithm has some
form of hypothesis stability.
Our k-fold procedure, instead, outputs the {\em k-fold hypothesis}, a
meta-hypothesis that, given an example $x$, randomly chooses one of
the $k$ generated hypotheses $h_i$ and outputs the prediction of that
hypothesis, $h_i(x)$. Alternatively, if hypotheses are allowed to
make predictions in $[0,1]$ and we are using $L_1$ loss, this is
equivalent, in terms of its true error, to outputting the average
value of $h_1(x), \ldots, h_k(x)$ when the true labels are $0,1$. We show
that this k-fold procedure produces a better estimate than the
hold-out procedure in sense that that the error discrepancy
has smaller absolute moments, and that
Hoeffding bounds still apply.\footnote{Since our bounds compare an
estimate to the hold-out estimate instead of the training error
estimate, they are not sanity-check bounds, so they must be {\em
insanity-check bounds}.}
The {\em progressive validation procedure}, like the hold-out
procedure, first selects $s$ examples for testing, and the remainder
are for training only. It generates a sequence of $s$ hypotheses,
where the $i$th hypothesis is trained on all of training data {\em
plus} the first $i-1$ examples of the test set, and tested on the
$i$th example of the test set. Repeating this for $1 \leq i \leq s$,
we count the number of mistakes to produce an error estimate. The
hypothesis returned, as above, is a meta hypothesis which randomly
selects among the $s$ generated hypotheses to make its
prediction. This procedure is very similar to methods used to convert
online to batch learning algorithms \cite{Littlestone89,HW95}, but the
kinds of guarantees we are looking for are somewhat different. In
particular, we argue that the progressive validation procedure gives
as good an estimate as the hold-out procedure with a hold-out of size
$s$, while training on more examples.
\section{PRELIMINARY DEFINITIONS}
Let $X$ be the instance space and let $\mathcal{D}$ be a fixed
distribution over $X$. We also assume a fixed target function $f: X
\longrightarrow \{0,1\}$.
A learning algorithm produces a hypothesis $h: X \longrightarrow [0,1]$. We
allow the range to be $[0,1]$, for convenience, so that we have a notion of
averaging hypotheses.
The error of this hypothesis on a particular example $x \in X$
is $e_h(x) = |h(x)-f(x)|.$
The {\em true error} of this hypothesis is
$\bar{e}_h = E_{x \in \mathcal{D}}[e_h(x)].$
\section{K-FOLD ANALYSIS}
Imagine that we will flip an unfair coin ten times, and we want to estimate
the probability of heads $p$. The full estimator ``$\hat{p}_{10} =
(\mbox{total number of heads})/10$'' seems better than the one-flip estimator
``$\hat{p}_1 = 1$ if the first flip is a head and $\hat{p}_1 = 0$
otherwise'', but in what sense? For
$p=1/100$, the chance that $|\hat{p}_1-p|>0.05$ is 1/100, while the chance
that $|\hat{p}_{10}-p|>0.05$ is nearly 10/100, namely the chance that
any of the flips were heads. Thus, $\hat{p}_{10}$ doesn't completely
dominate $\hat{p}_1$ under every conceivable notion of ``better''.
Instead, what {\em can} be said is that $E\left[|\hat{p}_{10}-p|^m\right]
< E\left[|\hat{p}_1-p|^m\right]$, for all $m \geq 1$. We make a
similar statement
about the k-fold procedure in comparison to a hold-out of size $n/k$.
Say we have a labelled data set of size $n$, and $10$ (all probabilities are taken over the
draw of the full data set). In addition, our proof will need to
assume that the instance space $X$ is finite, and that the learning
algorithm is insensitive to example ordering. This insensitivity can
be enforced in our k-fold procedure simply by shuffling the training
examples before giving them to the learning algorithm, on each of the
$k$ runs.
It is interesting to note that the k-fold estimate can be identical to the
single hold-out estimate if $k=n$ or $k=2$. In the case where $k=n$
(leave-one-out), Kearns and Ron
\cite{KR97} give several nice examples of poor performance.
For instance, a learning algorithm that uses the rule ``if I have seen
an even number of positive examples then predict positive, else
predict negative'' will have the property that no matter what the
data, $\hat{e}_1 = \hat{e}_2 \ldots = \hat{e}_n$; thus the
leave-one-out estimate will be exactly the same as a hold-out of size
1. Furthermore, if the underlying distribution has 50\%
positive examples, then the true errors will be the same as well.
In the case where $k=2$, an example is as follows. Suppose
that we are to predict the label of integers drawn uniformly in some
range $[1, \ldots, 2t]$,
and the truth is that all labels are 0. Our hypotheses have a single
parameter $p$, predicting $p$ on even integers, and $1-p$ on odd integers,
thus having true error 50\% regardless of $p$. Furthermore, our ``learning''
algorithm chooses $p$ to be the fraction of even examples seen in the input.
Now, if $k=2$, we will have two hypotheses with $p_1$ and $p_2$, and
$\hat{e}_1 = p_1 p_2 + (1-p_1)(1-p_2) = \hat{e}_2$. So the two-fold estimate,
which is identical to the hold-out estimate, is no better an estimate of the
50\% true error.
\begin{theorem}Suppose the example space is finite, our learning algorithm
is insensitive to example ordering, and the hold-out estimate is not always
perfect, i.e. $Pr[\hat{e}_1\neq \bar{e}_1]>0$. Then, for
$20$.
\qed
It is interesting to consider when the k-fold estimate will be much better
than the hold-out. It is sufficient that $\hat{e}_i-\bar{e}_i$ have a
significant chance of different than $\hat{e}_j-\bar{e}_j$, i.e. that these
variables are not completely correlated. One scenario in which this is the
case is when you have a form of hypothesis stability, which could guarantee
that $\bar{e}_j$ is close to $\bar{e}_i$.
Finally, we show a worst-case type of result, that Hoeffding bounds
can still be used for the k-fold estimate, as if we had just a hold-out
of size $n/k$:
\begin{theorem}
Hoeffding bounds hold as if we used $n/k$ testing examples. In particular,
$$Pr[\hat{e}\subK>\bar{e}\subK+a] \leq e^{-2a^2n/k} \mbox{\ and\ }
Pr[\hat{e}\subK<\bar{e}\subK-a] \leq e^{-2a^2n/k}.$$
\end{theorem}
\noindent
{\bf Proof (sketch).}
The proof of Hoeffding bounds for the standard hold-out case of $\hat{e}_1$ and
$\bar{e}_1$ with a hold-out set of size $s = n/k$, e.g. \cite{AlSp91}, begins by
bounding $E[e^{\lambda s(\hat{e}_1-\bar{e}_1)}].$
Then they use Markov's inequality with this bound,
$$Pr[\hat{e}_1>\bar{e}_1+a] = Pr[e^{\lambda s(\hat{e}_1-\bar{e}_1)}>e^{\lambda
a}]
\leq \frac{E[e^{\lambda s(\hat{e}_1-\bar{e}_1)}]}{e^{\lambda sa}}.$$
However, since $e^{\lambda sx}$ is a convex function of $x$, Jensen's
inequality implies that,
\begin{eqnarray*}
e^{\lambda s(\hat{e}\subK-\bar{e}\subK)} &=&
e^{\frac{\lambda s}{k}(\hat{e}_1-\bar{e}_1+\cdots+\hat{e}_k-\bar{e}_k)} \\
&\leq&
\frac{e^{\lambda s(\hat{e}_1-\bar{e}_1)}+\cdots+e^{\lambda
s(\hat{e}_k-\bar{e}_k)}}{k}.
\end{eqnarray*}
Thus $E[e^{\lambda(\hat{e}\subK-\bar{e}\subK)}] \leq E[e^{\lambda(\hat{e}_1-\bar{e}_1)}]$, and
the proof goes through. \qed
\section{PROGRESSIVE VALIDATION ANALYSIS}
Again, we suppose we have a data set of size $n$. This time we break it into
two sets, a training set and a testing set, with the test set
having $s$ elements. In this section, we redefine $h_i$, $\hat{e}_i$, and
$\bar{e}_i$. Hypothesis $h_i$ is generated by training on the
training set and the first $i-1$ elements of the testing set. It is tested on
the $i$th element of the testing set to yield an estimate $\hat{e}_i$ of its
true error, $\bar{e}_i$. The progressive hypothesis chooses
randomly among the $s$ hypotheses to label an example. Thus it has true
error $\bar{e}\subP$, with
$$\bar{e}\subP = \frac{\bar{e}_1+\bar{e}_2+\cdots+\bar{e}_s}{s}.$$
Finally, we let the progressive error estimate be the average
$\hat{e}\subP = (\hat{e}_1+\hat{e}_2+\cdots+\hat{e}_s)/s.$
We would like to show that the progressive error with a hold-out of size $s$
is as good an estimate of the true error of the progressive hypothesis as the
hold-out error is of the hold-out hypothesis. First we show that the same
Hoeffding bounds apply:
\begin{theorem}
Hoeffding bounds hold as if we used a hold-out set of size $s$ . In particular,
$$Pr[\hat{e}\subP>\bar{e}\subP+a] \leq e^{-2a^2s} \mbox{\ and\ }
Pr[\hat{e}\subP<\bar{e}\subP-a] \leq e^{-2a^2s}.$$
\end{theorem}
Littlestone \cite{Littlestone89}, gives a quite detailed proof
of the multiplicative (Chernoff-style) version of this theorem. The
sketch below uses the same basic argument.
\noindent
{\bf Proof (sketch).}
As before, we only need to make a slight modification to the standard proof of
Hoeffding bounds. The standard proof \cite{AlSp91} begins by bounding
$E[e^{\lambda s(\hat{e}\subP-\bar{e}\subP)}].$ This bound
is achieved by writing $e^{\lambda s(\hat{e}\subP-\bar{e}\subP)}$ as a
product of $s$ terms $e^{\lambda Y_i}$, with
\begin{eqnarray}
E[e^{\lambda Y_i}]\leq e^{\lambda^2/8}. \label{eqn:bound}
\end{eqnarray}
The $Y_i$'s, in the standard proof are independent variables, perhaps
coin flips, but are adjusted so that they each have mean 0.
In our setting, $Y_i$ corresponds to $\hat{e}_i-\bar{e}_i$, the error
discrepancy of the $i$th hypothesis with the $i$th measurement. In the
ordinary setting these
$Y_i$'s are independent because the $i$th hypothesis doesn't depend on any
previous data in the hold-out. In our setting, they are not independent.
But, while the previous data in the hold-out may not be independent of the
hypothesis, it is independent of the $i$th hold-out example. Since
(\ref{eqn:bound}) holds regardless of the hypothesis, we still have that
$E[e^{\lambda Y_i}|Y_1,Y_2,\ldots,Y_{i-1}] \leq e^{\lambda^2/8}$.
Now, for two non-negative random variables $A$ and $B$, with
$E[A] \leq c_1$ and $E[B|A]\leq c_2$, it is true that $E[AB] \leq
E[Ac_2] \leq c_1c_2$.
Thus, by induction, even though the $Y_i$'s aren't independent,
$E[e^{\lambda s(\hat{e}\subP-\bar{e}\subP)}] = E[\prod e^{\lambda Y_i}]
\leq e^{\lambda^2s/8}$, which is all that is needed for the proof.
\qed
In fact, if we consider just the variance (the second moment) we can
make a stronger statement. In particular, the variance of the
progressive validation estimate, with respect to the true error of the
progressive validation hypothesis, is no worse than the variance of an
estimate produced by testing the progressive validation hypothesis on
a new, extra, hold-out of size $s$.
\begin{theorem}
Let $\hat{e}{\subP}'$ be an estimate of the progressive validation
hypothesis's error measured on a new, independently chosen hold-out of
size $s$.
Then,
$$E[(\hat{e}\subP-\bar{e}\subP)^2] \leq E[(\hat{e}\subP'-\bar{e}\subP)^2].$$
\end{theorem}
\noindent
{\bf Proof.} Both quantities above are averages of $s$ terms. The RHS is the
variance of the sum of independent terms, which is the sum of the variances.
Each of these i.i.d. terms has a $1/s$ chance of being distributed like
$\hat{e}_i$, for each $i$. Thus the RHS is
$$\frac{E[\sum_{i=1}^s(\hat{e}_i-\bar{e}\subP)^2]}{s^2}
= \frac{E[\sum_{i=1}^s (\hat{e}_i^2-\bar{e}\subP^2)]}{s^2}.$$ The LHS is $E[(\hat{e}\subP-\bar{e}\subP)^2] =
E[(\sum(\hat{e}_i-\bar{e}_i)^2]/s^2.$ We would like to again use the fact that
the variance of a sum of independent terms is the sum of variances. While
these terms
are not independent, we do have the martingale-like property that
$E[\hat{e}_j-\bar{e}_j|\hat{e}_i-\bar{e}_i] = 0$ for $i