#This file was created by Wed Aug 5 22:57:57 1998
#LyX 0.12 (C) 1995-1998 Matthias Ettrich and the LyX Team
\lyxformat 2.15
\textclass article
\language default
\inputencoding default
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default
\layout Author
John Langford
\layout Title
An analysis of microchoice learning algorithms
\layout Standard
Consider a learning algorithm that proceeds by making a sequence of small
choices which move the algorithm through the hypothesis space.
If the true error for each (small) hypothesis space can be bounded, then
we can bound the true error for all the choices that the learning algorithm
makes by a triangle inequality.
The proof of such a bound is straight forward, but the application to a
particular algorithm may not be so straight forward.
\layout Section
related work
\layout Standard
Pedro Domingos
\begin_inset LatexCommand \cite{1}
\end_inset
works in essentially the same domain to derive a heuristic which attempts
to approximate the generalization error.
The difference in this work is that I want to derive a bound rather than
a heuristic.
Yoav Freund
\begin_inset LatexCommand \cite{2}
\end_inset
worked on self-bounding learning algorithms which is a more general setting
than I am considering here.
\layout Section
A theorem to bound the error of many microchoices
\layout Subsection
starting point
\layout Standard
Take as a starting point one of the generalized PAC theorems by Haussler:
\layout Paragraph
Theorem: Assume that
\begin_inset Formula \( H_{I} \)
\end_inset
and the loss function,
\begin_inset Formula \( l \)
\end_inset
, are such that
\begin_inset Formula \( H_{I}^{l} \)
\end_inset
is permissible.
Let
\begin_inset Formula \( P \)
\end_inset
be any probability distribution on
\begin_inset Formula \( Z \)
\end_inset
.
Assume that
\begin_inset Formula \( N\geq 1, \)
\end_inset
\begin_inset Formula \( \nu >0, \)
\end_inset
and
\begin_inset Formula \( 0<\alpha <1 \)
\end_inset
.
Let
\begin_inset Formula \( z_{N} \)
\end_inset
be generated by
\begin_inset Formula \( N \)
\end_inset
independent draws from
\begin_inset Formula \( Z \)
\end_inset
according to
\begin_inset Formula \( P \)
\end_inset
.
Then
\begin_inset Formula \( P(\exists H_{i}\in H_{I}:d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\alpha )\leq 4C(\frac{\alpha \nu }{8},H_{I})e^{\frac{-\alpha ^{2}\nu N}{8}} \)
\end_inset
.
\layout Standard
This is essentially the normal PAC theorem generalized in several ways.
For any nonobvious definitions, please look in: http://www.cs.cmu.edu/~jcl/ltol/st
aged.ps.
\layout Standard
It's useful to put this equation into a simpler form.
\begin_inset Formula \( E(H_{i}^{l},P)>\hat{E}(H_{i}^{l},z_{N})+\epsilon \)
\end_inset
implies
\begin_inset Formula \( d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\frac{\epsilon }{2+\nu }=\alpha \)
\end_inset
.
Maximizing
\begin_inset Formula \( \alpha ^{2}\nu \)
\end_inset
gives the substitutions:
\begin_inset Formula \( \alpha =\frac{\epsilon }{3},\nu =1 \)
\end_inset
, which implies:
\begin_inset Formula
\[
P(\exists H_{i}\in H_{I}:E(H^{l}_{i},P)>\hat{E}(H^{l}_{i},z_{N})+\epsilon )\leq 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}\]
\end_inset
\layout Standard
Letting,
\begin_inset Formula \( 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}=\delta \)
\end_inset
and solving for
\begin_inset Formula \( \epsilon \)
\end_inset
, we get:
\begin_inset Formula \( \epsilon ^{2}=-\frac{72}{N}\ln \frac{\delta }{4C(\frac{\epsilon }{24},H_{I})} \)
\end_inset
which implies
\begin_inset Formula \( \epsilon =\sqrt{\frac{72}{N}(\ln \frac{1}{\delta }+\ln 4C(\frac{\epsilon }{24},H_{I}))} \)
\end_inset
, which implies:
\layout Standard
\begin_inset Formula
\[
\overline{er}(H_{i})\leq \widehat{er}(H_{i},S)+6\sqrt{2}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }}{N}}\]
\end_inset
\layout Subsection
the plan
\layout Standard
Now, if we have an algorithm (such as backprop) which makes a sequence of
choices to move through a large hypothesis space, there are two ways to
bound the error, one is by attempting to find the capacity,
\begin_inset Formula \( C(\epsilon ,H_{I}) \)
\end_inset
of the space of all reachable hypothesis and make an a.
priori bound.
Unfortunately, this is not very good in almost all cases for several reasons:
\layout Enumerate
The
\begin_inset Formula \( H_{I} \)
\end_inset
used may just be a (loose) bound on the real
\begin_inset Formula \( H_{I} \)
\end_inset
= all hypothesis which the learning algorithm could possibly reach.
\layout Enumerate
The learning algorithm will often terminate at some point, quite possibly
well before it can explore the full
\begin_inset Formula \( H_{I} \)
\end_inset
.
\layout Standard
If we instead, look at the algorithm as making a sequence of choices in
hypothesis spaces,
\begin_inset Formula \( H_{I1},H_{I2},...,H_{IM} \)
\end_inset
, any one of which may go wrong then it may be possible to give a better
bound for the true error of the learning algorithm.
If we apply the PAC theorem above to each individual hypothesis space with
\begin_inset Formula \( \alpha \rightarrow \frac{\alpha }{M} \)
\end_inset
and
\begin_inset Formula \( \delta \rightarrow \frac{\delta }{M} \)
\end_inset
, then using the triangle inequality we get:
\layout Paragraph
Simple theorem: If
\begin_inset Formula \( \overline{er}(H_{im})\leq \widehat{er}(H_{im},S)+6\sqrt{2}\frac{1}{M}\sqrt{\frac{\ln C(\frac{\epsilon }{24M},H_{Im})+\ln \frac{1}{\delta M}}{N}} \)
\end_inset
for each hypothesis space a choice is made in,
\begin_inset Formula \( H_{Im} \)
\end_inset
, then the final output hypothesis has error bounded by
\begin_inset Formula \( \overline{er}(H_{final})\leq \widehat{er}(H_{final},S)+\frac{6\sqrt{2}}{M}\sum ^{M}_{m=1}\sqrt{\frac{\ln C(\frac{\epsilon }{24M},H_{Im})+\ln \frac{1}{\delta M}}{N}} \)
\end_inset
with probability
\begin_inset Formula \( 1-\delta \)
\end_inset
.
\layout Standard
Proof: triangle inequality applied
\begin_inset Formula \( M-1 \)
\end_inset
times.
\layout Subsection
simple extensions
\layout Standard
There are several simple extensions to this theorem.
\layout Enumerate
The confidence can be divided nonuniformly amongst the individual choice
spaces so that for space,
\begin_inset Formula \( H_{Im} \)
\end_inset
the chance of a
\begin_inset Quotes eld
\end_inset
bad
\begin_inset Quotes erd
\end_inset
hypothesis is bounded by
\begin_inset Formula \( a_{m}\delta _{m} \)
\end_inset
and
\begin_inset Formula \( \sum _{m}a_{m}=1 \)
\end_inset
.
\layout Enumerate
The error can also be split up in a similar manner allocating
\begin_inset Formula \( b_{m}\epsilon _{m} \)
\end_inset
such that
\begin_inset Formula \( \sum _{m}b_{m}=1 \)
\end_inset
.
\layout Enumerate
Take as a starting point, Dave McAllester's PAC-Bayes theorem from COLT98.
This is sometimes tighter than Haussler's theorem and allows the incorporation
of a prior into the learning algorithm.
\layout Subsection
More complicated improvements
\layout Standard
Because the PAC bounds hold for a range of
\begin_inset Formula \( \epsilon \)
\end_inset
,
\begin_inset Formula \( \delta \)
\end_inset
values simply adding together these two bounds on random variables is not
optimal.
Instead, a convulution could be done at each step to find a final distribution
over errors at the final step, which could then be converted into a better
\begin_inset Formula \( (\epsilon , \)
\end_inset
\begin_inset Formula \( \delta ) \)
\end_inset
bound.
\layout Subsection
application to learning algorithms
\layout Standard
These bounds may (or may not...) be meaningful for real machine learning algorithms.
Consider standard backprop.
If we instrument backprop so that at every update of the weights, the algorithm
first tries to find the size of the hypothesis space from which it is choosing
in the weight update, then we could give an alternative (to the normal,
terribly loose) bound without interfering with the algorithm.
Fortunately, it is not necessary to enumerate the full choice space - random
draws from a space of unknown size should be able to determine the size
of the space in
\begin_inset Formula \( O(\sqrt{C(\epsilon ,H_{Im})}) \)
\end_inset
, which hopefully isn't too large.
I'm planning to look into whether or not this process can be useful.
\layout Bibliography
\bibitem {1}
Domingos, Pedro,
\begin_inset Quotes eld
\end_inset
A Process-Oriented Heuristic for Model Selection
\begin_inset Quotes erd
\end_inset
, ICML 98, http://www.gia.ist.utl.pt/~pedrod/mlc98.ps.gz
\layout Bibliography
\bibitem {2}
Freund, Yoav,
\begin_inset Quotes eld
\end_inset
Self bounding learning algorithms
\begin_inset Quotes erd
\end_inset
, COLT 98, http://www.research.att.com/~yoav/papers/lsearch.ps.gz
\the_end