%% This LaTeX-file was created by Tue Aug 11 13:44:23 1998
%% LyX 0.12 (C) 1995-1998 by Matthias Ettrich and the LyX Team
%% Do not edit this file unless you know what you are doing.
\documentclass{article}
\usepackage[T1]{fontenc}
\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\spacefactor1000}
\makeatother
\begin{document}
John Langford, jcl@cs.cmu.edu 8/4/98
These are some thoughts on the recent ``PAC-Bayesian Theorems'' paper by Davi
d McAllester appearing in COLT 1998.
\section{bound improvement}
I think the bounds can be slightly improved. At one point, the equation:
\[
\epsilon (U)\leq (1-\frac{1}{m})\frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta \beta }+\ln m}{(1-\beta )m}+\frac{1}{m}\]
is simplified to:
\[
\epsilon (U)\leq \frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta \beta }+\ln m}{(1-\beta )m}+\frac{1}{m}\]
then, \( \beta =\frac{1}{m} \) is substituted to get:
\[
\epsilon (U)\leq \frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta }+2\ln m}{m-1}+\frac{1}{m}\]
Instead, we should be able to avoid the simplification and substitute \( \beta =\frac{1}{m} \)
directly to get:
\[
\epsilon (U)\leq (1-\frac{1}{m})\frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta }+2\ln m}{(1-\frac{1}{m})m}+\frac{1}{m}\]
\[
=\frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta }+2\ln m+1}{m}\]
slightly improving the bound.
\section{Relationship to PAC theorems}
\subsection{goal}
A specialization of the PAC-Bayes result can be made to arrive at something
very similar to a (generalized) PAC theorem. The goal here is arrive at something
similar to Haussler's generalized PAC theorem found in ``Machine Learning Theory'':
\subparagraph{Theorem: Assume that \protect\( H_{I}\protect \) and the loss function, \protect\( l\protect \),
are such that \protect\( H_{I}^{l}\protect \) is permissible. Let \protect\( P\protect \)
be any probability distribution on \protect\( Z\protect \). Assume that \protect\( N\geq 1,\protect \)
\protect\( \nu >0,\protect \) and \protect\( 0<\alpha <1\protect \). Let \protect\( z_{N}\protect \)
be generated by \protect\( N\protect \) independent draws from \protect\( Z\protect \)
according to \protect\( P\protect \). Then \protect\( P(\exists H_{i}\in H_{I}:d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\alpha )\leq 4C(\frac{\alpha \nu }{8},H_{I})e^{\frac{-\alpha ^{2}\nu N}{8}}\protect \). }
For details about the definitions of everything here, look in:
http://www.cs.cmu.edu/\~{}jcl/ltol/staged.ps
It's useful to put this equation into the same form as the PAC Bayes theorem.
\( E(H_{i}^{l},P)>\hat{E}(H_{i}^{l},z_{N})+\epsilon \) implies \( d_{\nu }(E(H^{l}_{i},P),\hat{E}(H^{l}_{i},z_{N}))>\frac{\epsilon }{2+\nu }=\alpha \).
Maximizing \( \alpha ^{2}\nu \) gives the substitutions: \( \alpha =\frac{\epsilon }{3},\nu =1 \),
which implies:
\[
P(\exists H_{i}\in H_{I}:E(H^{l}_{i},P)>\hat{E}(H^{l}_{i},z_{N})+\epsilon )\leq 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}\]
Letting, \( 4C(\frac{\epsilon }{24},H_{I})e^{\frac{-\epsilon ^{2}N}{72}}=\delta \)
and solving for \( \epsilon \), we get: \( \epsilon ^{2}=-\frac{72}{N}\ln \frac{\delta }{4C(\frac{\epsilon }{24},H_{I})} \)
which implies \( \epsilon =\sqrt{\frac{72}{N}(\ln \frac{1}{\delta }+\ln 4C(\frac{\epsilon }{24},H_{I}))} \),
or in the language of McAllester's paper:
\[
\overline{l}(c)\leq \hat{l}(c,S)+6\sqrt{2}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }}{m}}\]
\subsection{starting point}
The starting point in the specialization is David's PAC-Bayes theorem for the
agnostic case which states: \( \forall \delta >0\textrm{ }\forall ^{\delta }S\textrm{ }\forall U:P(U)>0 \)
\[
\overline{l}(U)\leq \hat{l}(U,S)+\sqrt{\frac{\ln \frac{1}{P(U)}+\ln \frac{1}{\delta }+2\ln m}{2m}}+\frac{1}{m}\]
(here I've included the slight optimization above)
Since this applies for all measurable subsets, let's consider a partition of
the hypothesis space which realizes the cover, \( C(\frac{\epsilon }{24},_{I}) \).
With our hypothesis space split into \( C(\frac{\epsilon }{24},_{I}) \) subsets,
assign a uniform probability to each set so \( P(U)=\frac{1}{C(\frac{\epsilon }{24},H_{I})} \).
In this case, the PAC-Bayes theorem will specialize to:
\[
\overline{l}(U)\leq \hat{l}(U,S)+\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{2m}}+\frac{1}{m}\]
Let's assume all of the probability mass on a set is on the central element,
giving:
\[
\overline{l}(c_{center})\leq \hat{l}(c_{center},S)+\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{2m}}+\frac{1}{m}\]
Now, because the cover, \( C(\frac{\epsilon }{24},H_{I}) \) is defined such
that every element in the set is within error, \( \frac{\epsilon }{24} \),
of some element in the set, we know that \( \overline{l}(c)-\frac{\epsilon }{24} \)\( \leq \overline{l}(c_{center}) \)
implying:
\[
\overline{l}(c)\leq \hat{l}(c_{center},S)+\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{2m}}+\frac{1}{m}+\frac{\epsilon }{24}\]
We need to relate \( \hat{l}(c_{center},S) \) to \( \hat{l}(c,S) \). Because
the cover, \( C(\frac{\epsilon }{24},H_{I}) \) involves a maximum over all
probability distributions, we know that \( |\hat{l}(c_{center},z)- \) \( \hat{l}(c,z)|<\frac{\epsilon }{24} \)
(\( z \) = some example). Proof: assume the contrapositive: \( |\hat{l}(c_{center},z)- \)
\( \hat{l}(c,z)|\geq \frac{\epsilon }{24} \) then the probability distribution
which places all mass on \( z \) will violate the assumption that \( |\overline{l}(c) \)\( -\overline{l}(c_{center})|<\frac{\epsilon }{24} \).
Since the differenmce in losses is less than \( \frac{\epsilon }{24} \) for
every example, we know that: \( |\hat{l}(c_{center},S)- \) \( \hat{l}(c,S)|<\frac{\epsilon }{24} \).
implying:
\[
\overline{l}(c)\leq \hat{l}(c,S)+\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{2m}}+\frac{1}{m}+\frac{\epsilon }{12}\]
This equation is a little bit tricky because it essentially has \( \epsilon \)
defined in terms of itself:
\[
\epsilon =\frac{1}{\sqrt{2}}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{m}}+\frac{1}{m}+\frac{\epsilon }{12}\]
Solving to eliminate the easy dependencies gives:
\[
\overline{l}(c)\leq \hat{l}(c,S)+\frac{12}{11\sqrt{2}}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }+2\ln m}{m}}+\frac{12}{11}\frac{1}{m}\]
which is remarkably similar to the PAC result:
\[
\overline{l}(c)\leq \hat{l}(c,S)+6\sqrt{2}\sqrt{\frac{\ln C(\frac{\epsilon }{24},H_{I})+\ln \frac{1}{\delta }}{m}}\]
\subsection{comparison}
Haussler's result has a worse constant but fewer terms. For (very) large \( m \),
McAllester's bound will become:
\[
\overline{l}(c)\leq \hat{l}(c,S)+\frac{12}{11}\sqrt{\frac{\ln m}{m}}\]
which is looser than Haussler's bound. For \( m\simeq \ln (C(\frac{\epsilon }{24},H_{I})\frac{1}{\delta })\gg 1 \),
McAllester's bound is tighter by a factor of around \( 11 \). For small \( m \)
near \( 1 \), Haussler's bound is again tighter.
It is interesting that a specialization of a more general theorem can lead to
a tighter bound.
\end{document}