%% This LaTeX-file was created by Wed Aug 5 22:55:48 1998
%% LyX 0.12 (C) 1995-1998 by Matthias Ettrich and the LyX Team
%% Do not edit this file unless you know what you are doing.
\documentclass{article}
\usepackage[T1]{fontenc}
\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\spacefactor1000}
\makeatother
\begin{document}
John Langford jcl@cs.cmu.edu 8/5/98
\title{An analysis of microchoice learning algorithms}
\maketitle
Consider a learning algorithm that proceeds by making a sequence of small choices
which move the algorithm through the hypothesis space. If the true error for
each (small) hypothesis space can be bounded, then we can bound the true error
for all the choices that the learning algorithm makes by a triangle inequality.
The proof of such a bound is straight forward, but the application to a particular
algorithm may not be so straight forward.
\section{related work}
Pedro Domingos \cite{1} works in essentially the same domain to derive a heuristic
which attempts to approximate the generalization error. The difference in this
work is that I want to derive a bound rather than a heuristic. Yoav Freund\cite{2}
worked on self-bounding learning algorithms which is a more general setting
than I am considering here.
\section{A (trivial) theorem to bound the error of many choices}
\subsection{starting point}
Take as a starting point one of the generalized PAC theorems by Haussler:
\paragraph{Theorem: Assume that \protect\( H_{I}\protect \) and the loss function, \protect\( l\protect \),
are such that \protect\( H_{I}^{l}\protect \) is permissible. Let \protect\( P\protect \)
be any probability distribution on \protect\( Z\protect \). Assume that \protect\( N\geq 1,\protect \)
\protect\( \nu >0,\protect \) and \protect\( 0<\alpha <1\protect \). Let \protect\( z_{N}\protect \)
be generated by \protect\( N\protect \) independent draws from \protect\( Z\protect \)
according to \protect\( P\protect \). Then \protect\( P(\exists H_{i}\in H_{I}:d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\alpha )\leq 4C(\frac{\alpha \nu }{8},H_{I})e^{\frac{-\alpha ^{2}\nu N}{8}}\protect \). }
This is essentially the normal PAC theorem generalized in several ways. For
any nonobvious definitions, please look in: http://www.cs.cmu.edu/\~{}jcl/ltol/staged.ps.
It's useful to put this equation into a simpler form. \( E(H_{i}^{l},P)>\hat{E}(H_{i}^{l},z_{N})+\epsilon \)
implies \( d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\frac{\epsilon }{2+\nu }=\alpha \).
Maximizing \( \alpha ^{2}\nu \) gives the substitutions: \( \alpha =\frac{\epsilon }{3},\nu =1 \),
which implies:
\[
P(\exists H_{i}\in H_{I}:E(H^{l}_{i},P)>\hat{E}(H^{l}_{i},z_{N})+\epsilon )\leq 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}\]
Letting, \( 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}=\delta \)
and solving for \( \epsilon \), we get: \( \epsilon ^{2}=-\frac{72}{N}\ln \frac{\delta }{4C(\frac{\epsilon }{24},H_{I})} \)
which implies \( \epsilon =\sqrt{\frac{72}{N}(\ln \frac{1}{\delta }+\ln 4C(\frac{\epsilon }{24},H_{I}))} \),
which implies:
\[
\overline{er}(H_{i})\leq \widehat{er}(H_{i},S)+6\sqrt{2}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }}{N}}\]
\subsection{the plan}
Now, if we have an algorithm (such as backprop) which makes a sequence of choices
to move through a large hypothesis space, there are two ways to bound the error,
one is by attempting to find the capacity, \( C(\epsilon ,H_{I}) \) of the
space of all reachable hypothesis and make an a. priori bound. Unfortunately,
this is not very good in almost all cases for several reasons:
\begin{enumerate}
\item The \( H_{I} \) used may just be a (loose) bound on the real \( H_{I} \) =
all hypothesis which the learning algorithm could possibly reach.
\item The learning algorithm will often terminate at some point, quite possibly well
before it can explore the full \( H_{I} \).
\end{enumerate}
If we instead, look at the algorithm as making a sequence of choices in hypothesis
spaces, \( H_{I1},H_{I2},...,H_{IM} \), any one of which may go wrong then
it may be possible to give a better bound for the true error of the learning
algorithm. If we apply the PAC theorem above to each individual hypothesis space
with \( \alpha \rightarrow \frac{\alpha }{M} \) and \( \delta \rightarrow \frac{\delta }{M} \),
then using the triangle inequality we get:
\paragraph{Simple theorem: If \protect\( \overline{er}(H_{im})\leq \widehat{er}(H_{im},S)+6\sqrt{2}\frac{1}{M}\sqrt{\frac{\ln C(\frac{\epsilon }{24M},H_{Im})+\ln \frac{1}{\delta M}}{N}}\protect \)
for each hypothesis space a choice is made in, \protect\( H_{Im}\protect \),
then the final output hypothesis has error bounded by \protect\( \overline{er}(H_{final})\leq \widehat{er}(H_{final},S)+\frac{6\sqrt{2}}{M}\sum ^{M}_{m=1}\sqrt{\frac{\ln C(\frac{\epsilon }{24M},H_{Im})+\ln \frac{1}{\delta M}}{N}}\protect \)
with probability \protect\( 1-\delta \protect \).}
Proof: triangle inequality applied \( M-1 \) times.
\subsection{simple extensions}
There are several simple extensions to this theorem.
\begin{enumerate}
\item The confidence can be divided nonuniformly amongst the individual choice spaces
so that for space, \( H_{Im} \) the chance of a ``bad'' hypothesis is bounded
by \( a_{m}\delta _{m} \) and \( \sum _{m}a_{m}=1 \).
\item The error can also be split up in a similar manner allocating \( b_{m}\epsilon _{m} \)
such that \( \sum _{m}b_{m}=1 \).
\item Take as a starting point, Dave McAllester's PAC-Bayes theorem from COLT98. This
is sometimes tighter than Haussler's theorem and allows the incorporation of
a prior into the learning algorithm.
\end{enumerate}
\subsection{More complicated improvements}
Because the PAC bounds hold for a range of \( \epsilon \), \( \delta \)
values simply adding together these two bounds on random variables is not optimal.
Instead, a convulution could be done at each step to find a final distribution
over errors at the final step, which could then be converted into a better \( (\epsilon , \)
\( \delta ) \) bound.
\subsection{application to learning algorithms}
These bounds may (or may not...) be meaningful for real machine learning algorithms.
Consider standard backprop. If we instrument backprop so that at every update
of the weights, the algorithm first tries to find the size of the hypothesis
space from which it is choosing in the weight update, then we could give an
alternative (to the normal, terribly loose) bound without interfering with the
algorithm. Fortunately, it is not necessary to enumerate the full choice space
- random draws from a space of unknown size should be able to determine the
size of the space in \( O(\sqrt{C(\epsilon ,H_{Im})}) \), which hopefully isn't
too large. I'm planning to look into whether or not this process can be useful.
\begin{thebibliography}{}
\bibitem{1}Domingos, Pedro, ``A Process-Oriented Heuristic for Model Selection'', ICML
98, http://www.gia.ist.utl.pt/\~{}pedrod/mlc98.ps.gz
\bibitem{2}Freund, Yoav, ``Self bounding learning algorithms'', COLT 98, http://www.research.att.com/\~{}yoav/papers/lsearch.ps.gz
\end{thebibliography}
\end{document}