In the PAC-Bayes setting, a classifier is defined by a distribution over the hypothesis space. Each classification is carried out according to a hypothesis sampled from . We are interested in the gap between the expected generalization error and the expected empirical error , where both expectations are taken with respect to . The gap will be parameterized by the Kullback-Leibler divergence (see [10]). Recall that:
(6.1.1) |
(6.1.2) |
PROOF. Given in [39]. ▫
This PAC-Bayes bound is almost the same as the Occam’s Razor bound (theorem 4.6.1) when the distribution is peaked on a single hypothesis and the Occam’s razor bound is proved using the looser Hoeffding inequality. This can be seen by noting that the KL-divergence when is all on one hypothesis, satisfies . Comparing with the Occam’s Razor bound, we see that a (small) extra term of size has been introduced in return for the capability to average with respect to any posterior . It is unclear yet that this term needs to be there.
The real power of the PAC-Bayes bound occurs when the average is over many hypotheses. This might occur if the distribution is picked using Bayes law or a Gibbs distribution. One of the most interesting aspects of the PAC-Bayes bound is that it holds for finite and infinite hypothesis spaces.