\documentclass{article} \usepackage{nips2002e} \def\comment#1{} %\makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \newtheorem{thm}{Theorem}[section] \newtheorem{lem}[thm]{Lemma} %%Delete [thm] to re-start numbering \newtheorem{cor}[thm]{Corollary} %%Delete [thm] to re-start numbering %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\usepackage[T1]{fontenc} \usepackage{amssymb} \newenvironment{proof}[Proof]{\textbf{#1.} }{\ \rule{0.5em}{0.5em}} \newcommand{\bfx}{\mathbf{x}} \newcommand{\bfw}{\mathbf{w}} \newcommand{\bfe}{\mathbf{e}} \newcommand{\<}{\langle} \renewcommand{\>}{\rangle} \newcommand{\KL}{\mathrm{KL}} \renewcommand{\Re}{\mathbb{R}} \newcommand{\PC}{\mathrm{PC}} \newcommand{\sign}{\mathrm{sign}} \begin{document} \title{PAC-Bayes \& Margins} \author{John Langford \\ IBM Research \\ Email: {\tt jcl@cs.cmu.edu} \And John Shawe-Taylor\\ Royal Holloway, University of London\\ Email: {\tt jst@cs.rhul.ac.uk}} \maketitle \begin{abstract} We show two related things: (1) Given a classifier which consists of a weighted sum of features with a large margin, we can construct a stochastic classifier with negligibly larger training error rate. The stochastic classifier has a future error rate bound that depends on the margin distribution and is independent of the size of the base hypothesis class. (2) A new true error bound for classifiers with a margin which is simpler, functionally tighter, and more data-dependent than all previous bounds. \end{abstract} \section{Introduction} {\renewcommand{\baselinestretch}{1.0}} PAC-Bayes bounds \cite{PB} (improved in \cite{averaging_tech} and again in \cite{Seeger}) are interesting for constructing bounds on future error rate in classification given only an assumption that examples are drawn independently from some (unknown) distribution. One drawback of PAC-Bayes bounds is that they only apply strongly to \emph{stochastic} classifiers---classifiers which make randomized predictions. Most learning algorithms do not produce stochastic predictors, so the PAC-Bayes bound can only be applied to classifiers altered to form a stochastic classifier. This alteration can be done either by a sensitivity analysis \cite{SNN} or by using the Bayesian posterior \cite{Seeger} directly. Although stochastic classifiers are rare in practice, voting classifiers are quite common. These include Bayes optimal'' classifiers, Maximum Entropy classifiers \cite{ME}, Adaboost \cite{Boosting}, and others which has motivated analysis of true error bounds specialized to voting classifiers. In particular, we know that when the vote is decided by a margin of at least $$\gamma$$ on the training set, the true error rate is more tightly bound than might be naively expected. Unfortunately, these bounds are still unsatisfying for practical use because they often give the meaningless future error rate is less than $$1$$'' predictions. Here we analyze the connection between PAC-Bayes and margin bounds. In particular, we show that the build a stochastic classifier'' approach \cite{SNN} to constructing a non-vacuous PAC-Bayes bound is always possible given that some margin $$\gamma$$ holds for the training set. This also shows that PAC-Bayes bounds are capable of capturing the low effective sample complexity'' which exists when a large margin classifier is used. An alteration to the PAC-Bayes result proves a new bound for margins which is a shorter argument and much tighter than previous margin bounds. There are two mathematical flavors of margin bound dependent upon the weights $w_i$ of the vote and the features $x_i$ that the vote is taken over. \begin{enumerate} \item Those (\cite{ShaBarWilAnt98}, \cite{Bartlett98}) with a bound on $\sum_i w_i^2$ and $\sum_i x_i^2$ ($l_2/l_2$'' bounds). \item Those (\cite{Margin}, \cite{averaging_icml}) with a bound on $\sum_i w_i$ and $\max_i x_i$ ($l_1/l_\infty$'' bounds). \end{enumerate} The results here are of the $l_2/l_2$'' form. We improve on Shawe-Taylor et al.~\cite{ShaBarWilAnt98} and Bartlett~\cite{Bartlett98} by a $\log(m)^2$ sample complexity factor and much tighter constants (1000 or unstated versus 9 or 18 as suggested by Section~\ref{discussion1}). In addition, the bound here covers margin errors without weakening the error-free case. Herbrich and Graepel \cite{HerbGraep} moved significantly towards the approach adopted in our paper, but the methodology adopted meant that their result does not scale well to high dimensional feature spaces as the bound here (and earlier results) do. The layout of our paper is simple - we first show how to construct a stochastic classifier with a good true error bound given a margin, and then construct a margin bound. \section{Margin Implies PAC-Bayes Bound} \subsection{Notation and theorem} Consider a feature space $$X$$ which may be used to make predictions about the value in an output space $$Y=\{-1,+1\}$$. We use the notation $$\bfx =(\bfx _{1},\ldots ,\bfx _{N})$$ to denote an $$N$$ dimensional vector. Let the vote of a voting classifier be given by: $v_\bfw(\bfx) = \bfw \bfx = \sum _{i}\bfw _{i}\bfx_{i}.$ The classifier is given by $$c(\bfx)=\textrm{sign}\left( v_\bfw(\bfx) \right)$$. The number of margin violations'' or margin errors'' at $\gamma$ is given by: $\hat{e}_{\gamma}(c) = \Pr_{(\bfx,y) \sim U(S)} ( yv_\bfw(\bfx) < \gamma ),$ where $U(S)$ is the uniform distribution over the sample set $S$. For convenience, we assume $$v_\bfx(\bfx) \leq 1$$ and $$v_\bfw(\bfw) \leq 1$$. Without this assumption, our results scale as $$\sqrt{v_\bfx(\bfx)} \sqrt{v_\bfw(\bfw)} / \gamma$$ rather than $$1 / \gamma$$. Any margin bound applies to a vector $$\bfw$$ in $$N$$ dimensional space. For every example, we can decompose the example into a portion which is parallel to $$\bfw$$ and a portion which is perpendicular to $$\bfw$$. \begin{eqnarray*} \bfx _{\top } = \bfx -\frac{v_\bfw (\bfx)}{\Vert \bfw \Vert^2 }\bfw \mbox{\quad} \bfx _{\Vert } = \bfx -\bfx _{\top } \end{eqnarray*} The argument is simple: we exhibit a prior'' over the weight space and a posterior'' over the weight space with an analytical form for the KL-divergence. The stochastic classifier defined by the posterior has a slightly larger empirical error and a small true error bound. For the next theorem, let $\bar{F}(x) = 1 - \int_{-\infty}^x \frac{1}{\sqrt{2\pi}} e^{- x^2/2} dx$ be the tail probability of a Gaussian with mean $0$ and variance $1$. Also let $e_{Q(\bfw,\gamma,\epsilon)} = \Pr_{(\bfx,y) \sim D, h \sim Q(\bfw,\gamma,\epsilon)} (h(\bfx)\neq y)$ be the true error rate of a stochastic classifier with distribution $Q(\epsilon,\bfw,\gamma)$ dependent on a free parameter $\epsilon$, the weights $\bfw$ of an averaging classifier, and a margin $\gamma$. \begin{thm} \label{thm-pb-from-margin}There exists a function $Q$ mapping a weight vector $\bfw$, margin $\gamma$, and value $\epsilon > 0$ to a distribution $Q(\bfw, \gamma, \epsilon)$ such that $\Pr _{S\sim D^{m}}\left( \forall \bfw,\gamma,\epsilon:\,\,\, \KL (\hat{e}_{\gamma}(c)+\epsilon \Vert e_{Q(\bfw,\gamma,\epsilon)}) \leq \frac{ \ln \frac{1}{\bar{F}\left(\frac{\bar{F}^{-1}(\epsilon)}{\gamma}\right)} +\ln\frac{m+1}{\delta }}{m}\right) \geq 1-\delta$ where $\KL( q \Vert p ) = q \ln \frac{q}{p} + (1-q) \ln \frac{1-q}{1-p}$ = the Kullback-Leibler divergence between two coins of bias $q 0$ stochastic error rates, If we choose $\mu > 0$ such that \begin{eqnarray} \label{mu-formula} E_{\bfx,y \sim S} \bar{F} \left( \frac{y x_\Vert }{x_\top} \mu \right) = \epsilon \end{eqnarray} then there exists a posterior distribution $Q(\bfw,\mu,\epsilon)$ such that $\Pr _{S\sim D^{m}}\left( \forall \epsilon,\bfw,\mu:\,\,\, \KL (\epsilon \Vert e_{Q(\bfw,\mu,\epsilon)}) \leq \frac{ \ln \frac{1}{\bar{F}\left(\mu\right)} +\ln\frac{m+1}{\delta }}{m}\right) \geq 1-\delta$ where $\KL( q \Vert p ) = q \ln \frac{q}{p} + (1-q) \ln \frac{1-q}{1-p}$ $=$ the Kullback-Leibler divergence between two coins of bias $q 0 \] Since$\hat{w}_\top$is drawn from$N(0,1)$the probability of this event is: $\Pr \left( y(\mu x _{\Vert }+\hat{w }_{\top }x _{\top }) > 0 \right) \geq 1 - \bar{F} \left( \frac{y x _\Vert }{x_\top} \mu \right)$ And so, the empirical error rate of the stochastic classifier is bounded by: $\hat{e}_Q \leq E_{\bfx,y \sim S} \bar{F} \left( \frac{y x _\Vert }{x_\top} \mu \right) = \epsilon$ as required. \end{proof} \subsection{Proof of Theorem \ref{thm-pb-from-margin}} \begin{proof} (sketch) The theorem follows from a relaxation of Theorem~\ref{thm-pb-from-margin-full}. In particular, we treat every example with a margin less than$\gamma$as an error and use the bounds$\Vert \bfx_\top \Vert \leq 1$and$\|\bfx_\Vert\| \geq \gamma$. \end{proof} \subsection{Further results} Several aspects of the Theorem \ref{thm-pb-from-margin-full} appear arbitrary, but they are not. In particular, the choice of prior'' is not that arbitrary as the following lemma indicates. \begin{lem} The set of $$P$$ satisfying $$\exists P_{\Vert \Vert }:\,\,$$ $$P(\bfx)=P_{\Vert \Vert }(\Vert \bfx\Vert^2 )$$ (rotational invariance) and $$P(\bfx)=\prod _{i=1}^{N}p_{i}(\bfx_i)$$ (independence of each dimension) is $$N(0, \lambda I)$$ for $$\lambda > 0$$. \end{lem} \begin{proof} Rotational invariance together with the dimension independence imply that for all $$i,j,x:\,\,p_{i}(x)=p_{j}(x)$$ which implies that: $P(\bfx)=\prod _{i=1}^{N}p(\bfx_{i})$ for some function $$p(\cdot)$$. Applying rotational invariance, we have that: $P(\bfx)=P_{\Vert \Vert }(\Vert \bfx\Vert^2 )=\prod _{i=1}^{N}p(\bfx_{i})$ This implies: $\log P_{\Vert \Vert }\left( \sum _{i=1}^{N}\bfx_{i}^{2}\right) =\sum _{i=1}^{N}\log p(\bfx_{i}).$ Taking the derivative of this equation with respect to$\bfx_i$gives $\frac{P'_{\Vert \Vert }(\Vert \bfx\Vert^2 )2\bfx_i}{P_{\Vert \Vert }(\Vert \bfx\Vert^2 )} = \frac{p'(\bfx_i)}{p(\bfx_i)}.$ Since this holds for all values of$\bfx$we must have $P_{\Vert \Vert }(t) = \lambda P'_{\Vert \Vert }(t)$ for some constant$\lambda$, or$P_{\Vert \Vert }(t) = C\exp(\lambda t)$, for some constant$C$. Hence,$ P(\bfx) = C\exp(\lambda \Vert \bfx \Vert^2), $as required. \end{proof} The constant $$\lambda$$ in the previous lemma is a free parameter. However, the results do not depend upon the precise value of $$\lambda$$ so we choose $$1$$ for simplicity. Some freedom in the choice of the posterior'' $$Q$$ does exist and the results are dependent on this choice. A rectified gaussian appears simplest. \section{Margin Implies Margin Bound} \label{sec-margin-bound} There are two methods for constructing a margin bound for the original averaging classifier. The first method is simplest while the second is sometimes significantly tighter. \subsection{Simple Margin Bound} First we note a trivial bound arising from a folk theorem and the relationship to our result. \begin{lem} \label{lem-simple}(Simple Averaging bound) For any stochastic classifier with distribution$Q$and true error rate$e_Q$, the averaging classifier, $c_Q(\bfx) = \mbox{sign} \left(\int_H h(\bfx) dQ(h) \right)$ has true error rate: $e(c_Q) \leq 2 e_Q$ \end{lem} \begin{proof} For every example$(\bfx,y)$, every time the averaging classifier errs, the probability of the stochastic classifier erring must be at least$1/2$. \end{proof} This result is interesting and of practical use when the empirical error rate of the original averaging classifier is low. Furthermore, we can prove that$c_Q(\bfx)$\emph{is} the original averaging classifier. \begin{lem} \label{sto-det} For$Q = Q(\bfw,\gamma,\epsilon)$derived according to Theorems \ref{thm-pb-from-margin} and \ref{thm-pb-from-margin-full} and$c_Q(\bfx)$as in lemma \ref{lem-simple}: $c_Q(\bfx) = \mbox{sign}\left( v_\bfw (\bfx) \right)$ \end{lem} \begin{proof} For every$\bfx$this equation holds because of two simple facts: \begin{enumerate} \item For any$\hat{\bfw}$that classifies an input$\bfx$differently from the averaging classifier, there is a unique equiprobable paired weight vector that agrees with the averaging classifier. \item If$v_\bfw(\bfx) \neq 0$, then there exists a nonzero measure of classifier pairs which always agrees with the averaging classifier. \end{enumerate} Condition (1) is met by reversing the sign of$\hat{\bfw}_\top$and noting that either the original random vector or the reversed random vector must agree with the averaging classifier. Condition (2) is met by the randomly drawn classifier$\hat{\bfw} = \lambda \bfw$and nearby classifiers for any$\lambda > 0$. Since the example is not on the hyperplane, there exists some small sphere of paired classifiers (in the sense of condition (1)). This sphere has a positive measure. \end{proof} The simple averaging bound is elegant, but it breaks down when the empirical error is large because: $e(c) \leq 2 e_Q = 2 (\hat{e}_Q + \Delta_m) \simeq 2 \hat{e}_\gamma(c) + 2 \Delta_m$ where$\hat{e}_Q$is the empirical error rate of a stochastic classifier and$\Delta_m$goes to zero as$m \rightarrow \infty$. Next, we construct a bound of the form$e(c_Q) \leq \hat{e}_\gamma(c) + \Delta_m'$where $$\Delta_m' > \Delta_m$$ but $$\hat{e}_\gamma(c) \leq 2 \hat{e}_\gamma(c)$$. \subsection{A (Sometimes) Tighter Bound} By altering our choice of $$\mu$$ and our notion of error'' we can construct a bound which holds \emph{without} randomization. In particular, we have the following theorem: \begin{thm} \label{thm-margin-bound} For all averaging classifiers$c$with normalized weights$\bfw$for all$\epsilon > 0$extra'' error rates and$\gamma>0$margins: $\Pr _{S\sim D^{m}}\left( \forall \epsilon,\bfw,\gamma:\,\,\, \KL (\hat{e}_\gamma(c) + \epsilon \Vert e(c) - \epsilon) \leq \frac{ \ln \frac{1}{\bar{F}\left(\frac{2 \bar{F}^{-1}(\epsilon)}{\gamma}\right)} +2 \ln\frac{m+1}{\delta }}{m}\right) \geq 1-\delta$ where$\KL( q \Vert p ) = q \ln \frac{q}{p} + (1-q) \ln \frac{1-q}{1-p}$= the Kullback-Leibler divergence between two coins of bias$q