%% This LaTeX-file was created by Wed Aug 12 21:09:56 1998
%% LyX 0.12 (C) 1995-1998 by Matthias Ettrich and the LyX Team
%% Do not edit this file unless you know what you are doing.
\documentclass{article}
\usepackage[T1]{fontenc}
\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\spacefactor1000}
\makeatother
\begin{document}
\author{John Langford jcl@cs.cmu.edu 8/10/98}
\title{Adaptions of Yoav's Self bounding learning algorithms}
\maketitle
\section{Theorem 1}
Theorem 1 states that \( \forall \alpha >0,\forall \epsilon >0 \) the generalization
error of the final hypothesis is bounded by:
\[
P_{S\sim D^{m}}[err(h^{*})-\widehat{err}(h^{*})>\epsilon ]\leq 2|Q(T_{A}(D,\alpha ))|e^{-2m\alpha ^{2}}+|H(T_{A}(D,\alpha ))|e^{-2m\epsilon ^{2}}\]
The first term in the bound bounds the probability that some query will return
an answer which is 'bad' enough to step off the tree, \( T_{A}(D,\alpha ) \).
The second term bounds the (agnostic) difference between the empirical and true
error of the hypothesis.
\subsection{first generalization}
We can generalize this theorem to state:
\[
P_{S\sim D^{m}}[err(h^{*})-\widehat{err}(h^{*})>\sqrt{\frac{1}{2m}\log \frac{1}{\delta _{h^{*}}}}]\leq 2|Q(T_{A}(D,\alpha ))|e^{-2m\alpha ^{2}}+\delta \]
where \( \delta =\sum _{h\in H(T_{A}(D,\alpha ))}\delta _{h} \). One useable
choice of \( \delta _{h} \) is \( \frac{\delta }{|Q_{1T}||Q_{2T}|...|Q_{dT}|} \)
where \( Q_{iT} \) = the branching factor (contained in \( T_{A} \)) at step
i in the algorithm and the chain of queries leads to the choice of hypothesis,
\( h \).
\subsection{second generalization}
I believe (though I'm unsure not having read Kearns paper) that a similar trick
can be played with the first term to apportion confidence amongst the various
queries in the tree.
\section{A slightly more general microchoice theorem}
In http://www.cs.cmu.edu/\~{}jcl/microchoice/moremicro.ps, section 5, we gave
a theorem that a bound on the generalization error could be given by:
\[
P_{S\sim D^{m}}[err(h^{*})-\widehat{err}(h^{*})>\sqrt{\frac{1}{2m}\log \frac{1}{\delta _{h^{*}}}}]\leq \delta \]
with \( \delta _{h}=\frac{\delta }{|H_{I1}|*|H_{I2}|*...*|H_{IM}|} \).
This theorem can be made slightly more general by moving to a setting where
a 'candidate hypothesis' is not required. Instead, identify every choice node
in an algorithm - choice nodes are locations where choices which could affect
the final hypothesis are made. For example, one choice node might rest on the
output of a statistical query. Denote by \( C_{i} \) these choice nodes. Then
we can replace the division of our confidence by, \( \delta _{h}=\frac{\delta }{|C_{1}|*|C_{2}|*...*|C_{d}|} \)
where \( d \) is the last choice made resulting in the choice of hypothesis,
\( h \).
\section{Comparison with Yoav}
The more general microchoice theorem is similar to Yoav's work because it includes
the choice nodes which allows application to algorithms which use statistical
queries. The more generali microchoice theorem is tighter for cases where \( T_{A}(D,\alpha ) \)
includes all possible choices along the path of the algorithm. Yoav's theorem
is tighter for cases where
\begin{itemize}
\item The algorithm happens to choose a final hypothesis with an above average inverse
measure. (doesn't hold for the first generalization above)
\item The branching factor of the tree is smaller than the branching factor of the
choice points.
\end{itemize}
The second case can be removed by changing the measure to reflect which choices
are 'low probability' choices by using a 'luckiness function' across a set of
choices.
\end{document}