The first shell bound paper was joint work with David McAllester and was presented at Colt [33]. The work presented here incorporates significant refinement, generalization, and simplification of the first Colt paper.
Roughly speaking, the shell bound (usually) provides much tighter true error rate upper bounds on learned hypotheses than conventional Occam’s Razor bound (theorem 4.6.1) or PAC-Bayes bounds (theorem s 6.2.1). It does this without violating lower upper bounds 4.4 by incorporating much more information into the bound calculation.
The inspiration behind the work on Shell bounds rests on two pieces of work. In [22] by Haussler, Kearns, Seung, and Tishby, learning theory curves are investigated from an omniscient point of view where the true error rates of various hypotheses are known. The principle improvement in this paper is that our bounds are reduced to observable quantities. Put another way, we do not need to know the underlying learning distribution, . In [47], an analysis was made assuming some distribution over true error rates. Our analysis does not rely on any assumption about the distribution of true error rates—only the independence assumption is made. Despite using only observable information and making no extra assumptions, the results here are quite tight and yield practical results.
We start with the distribution of empirical errors over hypotheses and subtract a small amount from the empirical error rates to create a pessimistic distribution. With high probability, the cumulative of the pessimistic distribution will lower bound the cumulative distribution of hypothesis true error rates. Given this, we can directly calculate a bound on the probability that a “large” hypothesis will produce a misleadingly small error. This bound can be much tighter than standard union bound techniques although the quantity of improvement is highly problem dependent.
After presenting the first bound, we will transform it into a bound on continuous hypothesis spaces using a PAC-Bayes like approach [39].
Viewed as an interactive proof of learning (figure 8.0.1), the stochastic shell bound is much like the PAC-Bayes bound.
The strongest criticism of shell bounds is, in fact, that too much information is required. While this information is always theoretically observable, it may not be tractable to collect. There are two answers to this criticism given here. The first is an empirical employment on decision tree learning algorithms which shows that in practice, there are often shortcuts which make the information gathering feasible. The second answer is to construct a sampled version of the shell bound which approximate versions of the information required by the shell bound. We will: