0$ stochastic error rates, If we choose $\mu > 0$ such that \begin{eqnarray} \label{mu-formula} E_{\bfx,y \sim S} \bar{F} \left( \frac{y x_\Vert }{x_\top} \mu \right) = \epsilon \end{eqnarray} then there exists a posterior distribution $Q(\bfw,\mu,\epsilon)$ such that \[ \Pr _{S\sim D^{m}}\left( \forall \epsilon,\bfw,\mu:\,\,\, \KL (\epsilon \Vert e_{Q(\bfw,\mu,\epsilon)}) \leq \frac{ \ln \frac{1}{\bar{F}\left(\mu\right)} +\ln\frac{m+1}{\delta }}{m}\right) \geq 1-\delta \] where $\KL( q \Vert p ) = q \ln \frac{q}{p} + (1-q) \ln \frac{1-q}{1-p}$ $=$ the Kullback-Leibler divergence between two coins of bias $q

0 \] Since $\hat{w}_\top$ is drawn from $N(0,1)$ the probability of this event is: \[ \Pr \left( y(\mu x _{\Vert }+\hat{w }_{\top }x _{\top }) > 0 \right) \geq 1 - \bar{F} \left( \frac{y x _\Vert }{x_\top} \mu \right) \] And so, the empirical error rate of the stochastic classifier is bounded by: \[ \hat{e}_Q \leq E_{\bfx,y \sim S} \bar{F} \left( \frac{y x _\Vert }{x_\top} \mu \right) = \epsilon \] as required. \end{proof} \subsection{Proof of Theorem \ref{thm-pb-from-margin}} \begin{proof} (sketch) The theorem follows from a relaxation of Theorem~\ref{thm-pb-from-margin-full}. In particular, we treat every example with a margin less than $\gamma$ as an error and use the bounds $\Vert \bfx_\top \Vert \leq 1$ and $\|\bfx_\Vert\| \geq \gamma$. \end{proof} \subsection{Further results} Several aspects of the Theorem \ref{thm-pb-from-margin-full} appear arbitrary, but they are not. In particular, the choice of ``prior'' is not that arbitrary as the following lemma indicates. \begin{lem} The set of \( P \) satisfying \( \exists P_{\Vert \Vert }:\,\, \) \( P(\bfx)=P_{\Vert \Vert }(\Vert \bfx\Vert^2 ) \) (rotational invariance) and \( P(\bfx)=\prod _{i=1}^{N}p_{i}(\bfx_i) \) (independence of each dimension) is \( N(0, \lambda I) \) for \( \lambda > 0 \). \end{lem} \begin{proof} Rotational invariance together with the dimension independence imply that for all \( i,j,x:\,\,p_{i}(x)=p_{j}(x) \) which implies that: \[ P(\bfx)=\prod _{i=1}^{N}p(\bfx_{i})\] for some function \( p(\cdot) \). Applying rotational invariance, we have that: \[ P(\bfx)=P_{\Vert \Vert }(\Vert \bfx\Vert^2 )=\prod _{i=1}^{N}p(\bfx_{i})\] This implies: \[ \log P_{\Vert \Vert }\left( \sum _{i=1}^{N}\bfx_{i}^{2}\right) =\sum _{i=1}^{N}\log p(\bfx_{i}).\] Taking the derivative of this equation with respect to $\bfx_i$ gives \[ \frac{P'_{\Vert \Vert }(\Vert \bfx\Vert^2 )2\bfx_i}{P_{\Vert \Vert }(\Vert \bfx\Vert^2 )} = \frac{p'(\bfx_i)}{p(\bfx_i)}. \] Since this holds for all values of $\bfx$ we must have \[ P_{\Vert \Vert }(t) = \lambda P'_{\Vert \Vert }(t) \] for some constant $\lambda$, or $P_{\Vert \Vert }(t) = C\exp(\lambda t)$, for some constant $C$. Hence, $ P(\bfx) = C\exp(\lambda \Vert \bfx \Vert^2), $ as required. \end{proof} The constant \( \lambda \) in the previous lemma is a free parameter. However, the results do not depend upon the precise value of \( \lambda \) so we choose \( 1 \) for simplicity. Some freedom in the choice of the ``posterior'' \( Q \) does exist and the results are dependent on this choice. A rectified gaussian appears simplest. \section{Margin Implies Margin Bound} \label{sec-margin-bound} There are two methods for constructing a margin bound for the original averaging classifier. The first method is simplest while the second is sometimes significantly tighter. \subsection{Simple Margin Bound} First we note a trivial bound arising from a folk theorem and the relationship to our result. \begin{lem} \label{lem-simple}(Simple Averaging bound) For any stochastic classifier with distribution $Q$ and true error rate $e_Q$, the averaging classifier, \[ c_Q(\bfx) = \mbox{sign} \left(\int_H h(\bfx) dQ(h) \right) \] has true error rate: \[ e(c_Q) \leq 2 e_Q \] \end{lem} \begin{proof} For every example $(\bfx,y)$, every time the averaging classifier errs, the probability of the stochastic classifier erring must be at least $1/2$. \end{proof} This result is interesting and of practical use when the empirical error rate of the original averaging classifier is low. Furthermore, we can prove that $c_Q(\bfx)$ \emph{is} the original averaging classifier. \begin{lem} \label{sto-det} For $Q = Q(\bfw,\gamma,\epsilon)$ derived according to Theorems \ref{thm-pb-from-margin} and \ref{thm-pb-from-margin-full} and $c_Q(\bfx)$ as in lemma \ref{lem-simple}: \[ c_Q(\bfx) = \mbox{sign}\left( v_\bfw (\bfx) \right) \] \end{lem} \begin{proof} For every $\bfx$ this equation holds because of two simple facts: \begin{enumerate} \item For any $\hat{\bfw}$ that classifies an input $\bfx$ differently from the averaging classifier, there is a unique equiprobable paired weight vector that agrees with the averaging classifier. \item If $v_\bfw(\bfx) \neq 0$, then there exists a nonzero measure of classifier pairs which always agrees with the averaging classifier. \end{enumerate} Condition (1) is met by reversing the sign of $\hat{\bfw}_\top$ and noting that either the original random vector or the reversed random vector must agree with the averaging classifier. Condition (2) is met by the randomly drawn classifier $\hat{\bfw} = \lambda \bfw$ and nearby classifiers for any $\lambda > 0$. Since the example is not on the hyperplane, there exists some small sphere of paired classifiers (in the sense of condition (1)). This sphere has a positive measure. \end{proof} The simple averaging bound is elegant, but it breaks down when the empirical error is large because: \[ e(c) \leq 2 e_Q = 2 (\hat{e}_Q + \Delta_m) \simeq 2 \hat{e}_\gamma(c) + 2 \Delta_m \] where $\hat{e}_Q$ is the empirical error rate of a stochastic classifier and $\Delta_m$ goes to zero as $m \rightarrow \infty$. Next, we construct a bound of the form $e(c_Q) \leq \hat{e}_\gamma(c) + \Delta_m'$ where \( \Delta_m' > \Delta_m \) but \( \hat{e}_\gamma(c) \leq 2 \hat{e}_\gamma(c) \). \subsection{A (Sometimes) Tighter Bound} By altering our choice of \( \mu \) and our notion of ``error'' we can construct a bound which holds \emph{without} randomization. In particular, we have the following theorem: \begin{thm} \label{thm-margin-bound} For all averaging classifiers $c$ with normalized weights $\bfw$ for all $\epsilon > 0$ ``extra'' error rates and $\gamma>0$ margins: \[ \Pr _{S\sim D^{m}}\left( \forall \epsilon,\bfw,\gamma:\,\,\, \KL (\hat{e}_\gamma(c) + \epsilon \Vert e(c) - \epsilon) \leq \frac{ \ln \frac{1}{\bar{F}\left(\frac{2 \bar{F}^{-1}(\epsilon)}{\gamma}\right)} +2 \ln\frac{m+1}{\delta }}{m}\right) \geq 1-\delta \] where $\KL( q \Vert p ) = q \ln \frac{q}{p} + (1-q) \ln \frac{1-q}{1-p}$ = the Kullback-Leibler divergence between two coins of bias $q