\frac{\textrm{KL}(Q||P)+\ln\frac{1}{m\delta}}{m}\right)\geq\delta-\delta^{2}\] \end{thm} \begin{proof} The proof is exactly the same as for the Occam's Razor tightness result (theorem \ref{th-dhlub}), except with a few twists. \end{proof} \subsection{Application of the PAC-Bayes Bound} \label{sub:Application-of-the} Actually applying the PAC-Bayes bound requires some specialization \cite{PB-margin}. Here, we specialize to classifiers of the form: \[ c(x)=\textrm{sign}\left(\vec{w}\cdot\vec{x}\right)\] Note that via the kernel trick, Support Vector Machines also have this form. The specialization is naturally expressed in terms of a few derived quantities: \begin{enumerate} \item The cumulative distribution of a Gaussian. Let $\bar{F}(x)=\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$. Here we use $\bar{F}$ rather than $F$ to denote the fact that we integrate from $\infty$ to $x$ rather than $-\infty$ to $x$. \item A {}``posterior'' distribution $Q(\vec{w},\mu)$ which is $N(\mu,1)$ in the direction of $\vec{w}$ and $N(0,1)$ in all perpendicular directions. \item The normalized margin of the examples\[ \gamma(\vec{x},y)=\frac{y\vec{w}\cdot\vec{x}}{||\vec{w}||||\vec{x}||}\] \item A stochastic error rate, $\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim U(S)}\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ \end{enumerate} This last quantity in particular is very important to understand. Consider the case as $\mu$ approaches $\infty$. When the margin is negative (indicating an incorrect classification), $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ approaches $1$. When the margin is positive $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$ approaches $0$. Thus, $\epsilon_{S,\mu}$ is a softened form of the empirical error $\hat{c}_{S}$ which takes into account the margin. \begin{cor} (PAC-Bayes Margin Bound) For all distributions $D$, for all $\delta\in(0,1]$, we have: \[ \Pr_{S\sim D^{m}}\left(\forall\vec{w},\mu:\,\,\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||Q(\vec{w},\mu)_{D}\right)\leq\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta\] \end{cor} \begin{proof} The proof is very simple. We just choose the prior $P=N(0,1)^{n}$ and work out the implications. Since the Gaussian distribution is the same in every direction, we can reorient the coordinate system of the prior to have one dimension parallel to $w$. Since the draws in the parallel and perpendicular directions are independent, we have:\[ \textrm{KL}(Q||P)=\textrm{KL}(Q_{\perp}||P_{\perp})+\textrm{KL}(N(\mu,1)||N(0,1))\] \[ =\frac{\mu^{2}}{2}\] as required. All that remains is calculating the stochastic error rate $\hat{Q}(\vec{w},\mu)_{S}$. Fix a particular example, $(\vec{x},y)$. This example has a natural decomposition $\vec{x}=\vec{x}_{||}+\vec{x}_{\perp}$ into a component, $\vec{x}_{||}$ parallel to the weight vector $\vec{w}$ and a component $\vec{x}_{\perp}$ perpendicular to the weight vector. To classify, we draw weight vector $\vec{w}^{'}$ from $\hat{Q}(\vec{w},\mu)$. This $\vec{w}^{'}$ consists of $3$ components, $\vec{w}^{'}=\vec{w}_{||}^{'}+\vec{w}_{\perp}^{'}+\vec{w}_{\perp\perp}^{'}$. Here $\vec{w}_{||}^{'}\sim N(\mu,1)$ is parallel to the original weight vector, $\vec{w}_{\perp}^{'}\sim N(0,1)$ which is parallel to $\vec{x}_{\perp}$ and $\vec{w}_{\perp\perp}^{'}$ is perpendicular to both $\vec{w}$ and $\vec{x}$. We have:\[ E_{\vec{x},y\sim U(S),\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\neq\textrm{sign}\left(\vec{w}^{'}\cdot\vec{x}\right)\right)\] \[ =E_{\vec{x},y\sim U(S),\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\vec{w}\cdot\vec{x}\leq0\right)\] If we let $w_{||}^{'}=||w_{||}^{'}||$, $w_{\perp}^{'}=||w_{\perp}^{'}||$, $x_{||}=||\vec{x}_{||}||$, and $x_{\perp}=||\vec{x}_{\perp}||$, and assume (without loss of generality) that $y=1$ we get: \[ =E_{\vec{x},y\sim U(S),w_{||}^{'}\sim N(\mu,1),w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \[ =E_{\vec{x},y\sim U(S)}E_{w_{||}^{'}\sim N(\mu,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\] \[ =E_{\vec{x},y\sim U(S)}E_{w_{||}^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq-yw_{||}^{'}-yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] Using the symmetricity of the Gaussian, this is:\[ =E_{\vec{x},y\sim U(S)}E_{w_{||}^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq yw_{||}^{'}+yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\] Using the fact that the sum of two Guassians is a Gaussian:\[ =E_{\vec{x},y\sim U(S)}E_{v\sim N\left(0,1+\frac{x_{\perp}^{2}}{x_{||}^{2}}\right)(0,1)}I\left(y\mu\leq yv\right)\] \[ =E_{\vec{x},y\sim U(S)}E_{v\sim N\left(0,\frac{1}{\gamma(\vec{x},y)^{2}}\right)(0,1)}I\left(y\mu\leq yv\right)\] \[ =E_{\vec{x},y\sim U(S)}\bar{F}\left(\mu\gamma(\vec{x},y)\right)\] finishing the proof. \end{proof} Using the corollary, the true error bound $\bar{Q}(\vec{w},\mu)_{D}$ satisfies the equation: \[ \textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||\bar{Q}(\vec{w},\mu)_{D}\right)=\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\] This is an implicit equation for $\bar{Q}$ which can be easily solved numerically. The bound is stated in terms of dot products here, so naturally it is possible to kernelize the result using methods from \cite{kernelize}. In kernelized form, the bound applies to classifiers of the form: \begin{equation} c(x)=\textrm{sign}\left(\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)\label{eq:class-kernel}\end{equation} Since, by assumption, $k$ is a kernel, we know that $k(x_{i},x)=\vec{\Phi}(x_{i})\cdot\vec{\Phi}(x)$ where $\vec{\Phi}(x)$ is some projection into another space. In kernelized form, we get $\vec{w}\cdot\vec{x}=\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)$, $\vec{x}\cdot\vec{x}=k(x,x)$, $\vec{w}\cdot\vec{w}=\sum_{i,j}\alpha_{i}\alpha_{j}k(x_{i},x_{j})$, defining all of the necessary quantities to calculate the normalized margin, \[ \gamma(x,y)=\frac{\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)}{\sqrt{k(x,x)\sum_{i,j=1,1}^{m,m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})}}\] One element remains, which is the value of $\mu$. Unfortunately the bound can be nonmonotonic in the value of $\mu$, but it turns out that for classifiers learned by support vector machines on reasonable datasets, there is only one value of $\mu$ which is (locally, and thus globally) minimal. A binary search over some reasonable range of $\mu$ (say from $1$ to $100$) can find the minima quickly, given the precomputation of the margins. It is worth noting again here that we are not {}``cheating''---the bound holds for all values of $\mu$ simultaneously. The computational time of the bound calculation is dominated by the calculation of the margins which is $O\left(m^{2}\right)$ where $m$ is the number of support vectors with a nonzero associated $\alpha$. This computational time is typically dominated by the time of the SVM learning algorithm. \subsubsection{Results} Application of this bound to support vector machines is of significant importance because SVMs are reasonably effective and adaptable classifiers in common and widespread use. A SVM learns a kernelized classifier as per equation \ref{eq:class-kernel}% \footnote{Some SVM learning algorithms actually learn a classifier of the form: $c(x)=\textrm{sign}\left(b+\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)$. We don't handle this form here.% }. We apply the support vector machine to 8 UCI database problems chosen to fit the criteria {}``two classes'' and {}``real valued input features''. The problems vary in size over an order of magnitude from $145$ to $1428$ examples. In figure \ref{fig:test} we use a 70/30 train/test split of the data. In all experiments, we use SVMlight with a gaussian kernel and the default bandwidth. Results for other choices of the {}``C'', bandwidth, and/or kernel appear to be qualitatively similar (although of course differing quantitatively). % \begin{figure} \includegraphics[% width=0.60\columnwidth, angle=270]{test_and_margin.ps} \caption{\label{fig:test}} This figure shows the results of applying SVMlight to 8 datasets with a Gaussian kernel and a 70/30 train/test split. The observed test error rate is graphed as an X. On the test set, we calculate a Binomial confidence interval (probability of bound failure = $0.01$) which upper bounds the true error rate. On the training set we calculate the PAC-Bayes margin bound for an optimized choice of $\mu$. \end{figure} It is important to note that the PAC-Bayes margin bound is \emph{not} precisely a bound (or confidence interval) on the true error rate of the learned classifier. Instead, it is a true error rate bound on an associated stochastic classifier chosen so as to have a similar test error rate. These bounds can be regarded as bounds for the original classifier only under an additional assumption: that picking a classifier according to the majority vote of this stochastic distribution does not worsen the true error rate. This is not true in general, but may be true in practice. It is of course unfair to compare the train set bound with the test set bound on a 70/30 train/test split because a very tight train set bound would imply that it is unnecessary to even have a test set. In figure \ref{fig:full-train} we compare the true error bounds on all of the data to the true error bounds generated from the 70/30 train/test split. % \begin{figure} \includegraphics[% width=0.60\columnwidth, angle=270]{test_vs_margin.ps} \caption{\label{fig:full-train}} In addition to comparing with everything in figure \ref{fig:test}, we graph the margin bound when \emph{all} of the data is used for the train set. Note that it improves somewhat on the margin bound calculated using the 70\% train set (7/10 margin bound), but not enough to compete with the test set bound. \end{figure} The results show that the PAC-Bayes margin bound is tight enough to give useful information, but still not competitive with the test set bounds. This is in strong contrast with a tradition of quantitatively impractical margin bounds. There are several uses available for bounds which provide some information but which are not fully tight. \begin{enumerate} \item They might be combined with a train/test bound \cite{TnT}. \item The train set bound might easily become tighter for smaller sample sizes. This was observed in \cite{TnT}. \end{enumerate} Missing: how well does the bound work for parameter selection? \section{Sample Compression Bound} \label{sec:Sparsity-Bound} The sample compression bound \cite{LW}, \cite{FW} is like the PAC-Bayes bound in that it applies to arbitrary precision continuous valued classifiers. Unlike the PAC-Bayes bound, it applies meaningfully to nonstochastic classifiers. Mainstream learning algorithms do not optimize the sample compression metric, so the bound application is somewhat rarer. Nonetheless, there do exist some reasonably competitive learning algorithms for which the sample compression bound produces significant results. The section is organized as follows: \begin{enumerate} \item Subsection \ref{sub:The-Sample-Compression-Bound} States and proves the sample compression bound. \item Subsection \ref{sub:SC-tightness} shows that the sample compression bound is nearly as tight as possible given the observations. \item Subsection \ref{sub:SC-application} discusses results from the application of the PAC-Bayes bound to support vector machines. \end{enumerate} \subsection{The Sample Compression Bound} \label{sub:The-Sample-Compression-Bound} The Sample Compression bound stated here is a little bit differs somewhat from other results by generalization and simplification but the bound behavior is qualitatively identical. Suppose we have a learning algorithm, $A(S)$ whose training is {}``sparse''% \footnote{This is satisfied, for example, by the Support Vector Machine algorithm which only depends upon the set of support vectors.% } in the sense that the output classifier is dependent upon only a subset of the data, $A(S)=A(S')$ for $S'\subseteq S$. The sample compression bound is dependent on the error rate, $\hat{c}_{S-S'}$ on the subset $S-S'$. The motivation here is that the examples which the learning algorithm does \emph{not} depend upon are {}``almost'' independent and so we can {}``almost'' get a test set bound. \begin{thm} (Sample Compression Bound) For all $\delta\in(0,1]$, $D$, $A$:\[ \Pr_{S\sim D^{m}}\left(\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{m \choose |S-S'|}}\right)\right)\geq1-\delta\] \end{thm} \begin{proof} Suppose we knew in advance that the learning algorithm will not depend upon some subset of the examples. Then, the {}``undependent'' subset acts like a test set and gives us a test set bound.\[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\frac{\delta}{m{{m \choose |S-S'|}}}\] (Note that, technically, it is possible to refer to $S'$ unambiguously before randomizing over $S$ by specifying the indicies of $S$ contained in $S'$.) Negating this, we get: \[ \forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\frac{\delta}{m{{m \choose |S-S'|}}}\] and using the union bound ($\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$ over each possible subset, $S'$, we get:\[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\delta\] Negating this again gives us the proof. \end{proof} \subsection{The Sample Compression Bound is Sometimes Tight} \label{sub:SC-tightness} We can construct a learning algorithm/learning problem pair such that the Sample compression bound is provably near optimal, as a function of it's observables. \begin{thm} (Sample Compression Tightness) For all $\delta\in(0,1]$, $\frac{k}{m}$, there exists a distribution $D$ and learning algorithm $A$ s.t.\[ \Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{m \choose |S-S'|}}\right)\right)>\delta-\delta^{2}\] furthermore, if $S^{*}$ minimizes $\overline{\textrm{Bin}}\left(\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)$, then \[ \Pr_{S\sim D^{m}}\left(c^{*}=A(S^{*}):\,\,\, c_{D}^{*}>\overline{\textrm{Bin}}\left(\hat{c}_{S-S^{*}}^{*},\frac{\delta}{m{m \choose |S-S^{*}|}}\right)\right)>\delta-\delta^{2}\] \end{thm} \begin{proof} The proof is constructive and similar to the Occam's Razor tightness result. In particular, we show how to construct a learning algorithm which outputs classifiers that err independently depending on the subset $S'$ used. Consider an input space $X=\{0,1\}^{2^{m}}$. Each variable in the input space $x_{S'}$ can be thought of as indexing a unique subset $S'\subseteq S$ of the examples. In the rest of the proof, we index variables by the subset they correspond to. Draws from the distribution $D$ can be made by first flipping an unbiased coin to get $y=1$ with probability $0.5$ and $y=-1$ with probability $0.5$. The distribution on $X$ consists of a set of independent values after conditioning on $y$. Choose \[ \Pr(x_{S'}\neq y)=\overline{\textrm{Bin}}\left(\frac{k}{m},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] Now, the learning algorithm $A(S')$ is very simple---it just outputs the classifier $c(x)=x_{S'}$. On the set $S-S'$, we have:\[ \forall S'\,\,\,\Pr_{S\sim D^{m}}\left(\hat{c}_{S-S'}\geq\frac{k}{m}\right)=1-\frac{\delta}{m{{m \choose |S-S'|}}}\] Using independence, we get:\[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}\geq\frac{k}{m}\right)=\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] Negating, we get:\[ \Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}<\frac{k}{m}\right)=1-\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\] and doing some algebra, we get the result. \end{proof} \subsection{Application of the Sample Compression Bound} \label{sub:SC-application} One obvious application of the Sample Compression bound is to support vector machines, since the learned classifier is only dependent on the set of support vectors. If $S'$is the set of support vectors then $S-S'$ is the set of nonsupport vectors. Unfortunately, it turns out that this does not work so well, as observed in figure \ref{fig:sc-svm}. % \begin{figure} \includegraphics[% width=0.60\columnwidth, angle=270]{test_and_compress.ps} \caption{\label{fig:sc-svm}The sample compression bound applied to the output of a support vector machine with a Gaussian kernel. Here we use $\delta=0.01$} \end{figure} There are other less common learning algorithms for which the sample compression bound works well. The Set Covering machine \cite{SCM} has an associated bound which is a variant of the Sample Compression Bound. \section{Conclusion} This introduction to sample complexity in learning theory covered two styles of bound: the test set bound and the train set bound. There are two important lessons here: \begin{enumerate} \item Test set bounds provide a better way to report error rates and confidence intervals on future error rates than some current methods. \item Train set bounds can be used practically to drive a learning algorithm. \end{enumerate} It is important to note that the train set bound and test set bound techniques are not mutually exclusive. It is possible to use both simultaneously \cite{TnT}, and doing so is often desirable. Test set bounds are improved by the {}``free'' information about the training error and train set bounds benefit by widely increased applicability. \section*{Appendix} \label{sec-model} For those interested in comparing models, the uniform convergence model \cite{Vapnik} requires the additional assumption of the axiom of choice (to achieve empirical risk minimization) and a bound on the hypothesis space complexity. Typical theorems are of the form {}``after $m$ examples, all training errors are near to true errors''. The PAC learning model\cite{Valiant} requires a polynomial time complexity learning algorithm and the assumption that the learning problem comes from some class. Theorems are of the form {}``after $m$ examples learning will be achieved''. Both of these models can support stronger statements than the basic sample complexity model presented here. Results from both of these models can apply to the sample complexity model presented here after appropriate massaging of results. The online learning model \cite{Online} makes \emph{no} assumptions. Typical theorems have the form {}``This learning algorithm's performance will be nearly as good as anyone of a set of classifiers.'' The online learning model has very general results and no ability to answer questions about future performance as we address here. The sample complexity model can most simply be understood as a slight refinement of the information theory model. \begin{thebibliography}{10} \bibitem{BEHW}A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth. {}``Occam's Razor.'' Information Processing Letters 24: 377-380, 1987. \bibitem{Progressive}Avrim Blum, Adam Kalai, and John Langford 1999. Beating the Holdout: Bounds for K-Fold and Progressive Cross-Validation. COLT99. http://www.cs.cmu.edu/\textasciitilde{}jcl/papers/progressive\_validation/coltfinal.ps \bibitem{Devroye}Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, {}``A Probabilistic Theory of Pattern Recognition'' Springer-Verlag New York, 1996. \bibitem{FW}Sally Floyd and Manfred Warmuth, {}``Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension'', Machine Learning, Vol.21 (3), pp. 269--304, December 1995 http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/sallycompr.forgalleys.ps \bibitem{kernelize}R. Herbrich and T. Graepel. {}``Large scale Bayes point machines'' In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information System Processing 13, pages 528-534, Cambridge, MA, 2001. \bibitem{Sanity}Michael Kearns and Dana Ron, {}``Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.'' Neural Computation 11(6), pages 1427-1453, 1999. Also in Proceedings of the Tenth Annual Conference on Computational Learning Theory, ACM Press, 1997, pages 152--162. http://www.cis.upenn.edu/\textasciitilde{}mkearns/papers/multi.ps \bibitem{Online}J. Kivinen and M. Warmuth, \char`\"{}Additive Versus Exponentiated Gradient Updates for Linear Prediction,\char`\"{} in Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, January 1997. http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/lin.ps \bibitem{bound}John Langford, Program {}``bound'', http://www-2.cs.cmu.edu/\textasciitilde{}jcl/programs/bound/bound.html \bibitem{MC_journal}John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding learning algorithms. Machine Learning Journal. http://www.cs.cmu.edu/\textasciitilde{}jcl/papers/microchoice/journal/journal\_final.ps \bibitem{Shell}John Langford and David McAllester, {}``Computable Shell Decomposition Bounds'' COLT 2000. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/computable\_shell/colt\_final.ps \bibitem{averaging}John Langford and Matthias Seeger, {}``Bounds for Averaging Classifiers'' Technical report, Carnegie Mellon 2001 http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/averaging/averaging\_tech.ps \bibitem{PB-margin}John Langford and John Shawe-Taylor, {}``PAC-Bayes \& Margins'', NIPS 2002. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/PB-margin/PB-margin.ps \bibitem{TnT}John Langford, {}``Combining Train Set and Test Set Bounds'', ICML2002. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/tnt\_icml/train\_n\_test\_final.ps \bibitem{LW}Nick Littlestone and Manfred Warmuth, {}``Relating Data Compression and Learnability'' Unpublished Manuscript http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/lrnk-olivier.ps \bibitem{PB}David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999. http://www.autoreason.com/posterior01.ps \bibitem{SCM}Mario Marchand and John Shawe-Taylor, {}``The Set Covering Machine'', ICML 2001. http://www.ai.mit.edu/projects/jmlr/papers/volume3/marchand02a/marchand02a.pdf \bibitem{Matthias}Matthias Seeger, {}``PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification'', Journal of Machine Learning Research 3 (2002), 233--269. http://www.dai.ed.ac.uk/homes/seeger/papers/seeger02a.ps.gz \bibitem{Valiant}L.G. Valiant. {}``A Theory of the Leuarnable.'' Communications of the ACM 27(11):1134-1142, November 1984 \bibitem{Vapnik}V. N. Vapnik and A. Y. Chervonenkis. {}``On the uniform convergence of relative frequencies of events to their probabilities.'' Theory of Probab. and its Applications, 16(2):264-280, 1971. \end{thebibliography} \end{document}