Let be the probability that hypothesis with true error produces errors on independent examples.
The discrete shell bound works directly with the probability that there will be a confusingly small empirical error. Let
Intuitively, is a bound on the probability that a hypotheses with a true error rate larger than will have an empirical error rate of . The contribution to the sum will fall off exponentially as the true error, , increases.
Our first step is stating a shell bound which requires unknown information. The purpose of this bound is motivational - it provides incite into why we can expect a large improvement. Later, we will remove the unknown information requirements and recover a useful bound.
The full knowledge theorem relies on unobservable information—the true error rates of all hypotheses. This theorem is not (quite) trivial because it does not rely upon information about which hypothesis has a particular true error.
PROOF. For every hypothesis with a true error rate of the probability of producing an empirical error of is . Applying the union bound over all hypotheses and all possible nontrivial values of completes the proof. ▫
There are a couple things to note about this theorem. First, for most balanced machine learning problems most hypotheses typically have a true error rate near to . Given this, and noticing that Binomial tails fall of exponentially, dramatic improvements in the bound are feasible.
Second, we must use rather than in order to make the proof work. It is possible that theorem holds without splitting “-ways”. Removing the factor of is an open problem. For the special case of empirical risk minimization algorithms, McDiarmid’s inequality [40] implies that the range of hypotheses with minimum empirical error is of size with high probability. Therefore, we need only apply the union bound to possible minimum empirical error rates.
Applying the full knowledge theorem ( 8.1.1) is not practical in almost all learning settings because we do not know the distribution of true error rates. Therefore, it is necessary to construct a slightly looser theorem which relies upon only observable quantities. Surprisingly, this is possible while introducing only slightly more slack.
First, we need a couple of definitions.
Intuitively, is a lower bound on the true error rate of the hypothesis with an empirical error of .
The quantity is an upper bound on the probability that one of the hypotheses with a true error rate of or more could produce empirical errors.
Noting that there are only possible empirical errors, we can first let be the count of empirical errors at . Then we can redefine:
Later, we will prove that with high probability, . Given that this is so, we can prove a theorem which only relies on observable quantities.
The observable shell bound preserves the important locality property of the full knowledge shell bound. In particular, when most of the true error rates are “far” from the empirical error rate (and is large enough), we expect to make large (functional) improvements on the discrete hypothesis bound 4.2.1.
The proof rests upon a technical lemma which allows us to bound the unobservable “probability of a misleading event” with an observable event.
This lemma is powerful because it bounds the unobservable right hand side in terms of the observable left hand side.
PROOF. Let Let . Then: set and replace with to achieve the result. ▫
We now have all the tools required to prove the theorem.
PROOF. (of theorem 8.1.2) Choose = the uniform distribution on a our hypothesis space. Then, we know that with probability , . Therefore, we can (arbitrarily) allocate a probability of failure to the unobservable bound 8.1.3 and a probability of failure to the full knowledge bound 8.1.1. Assuming , the observable bound will be more pessimistic than the full knowledge bound. ▫
The Observable Shell bound behaves in a strange manner which is unlike other true error bounds. In particular, the true error bound can be discontinuous in the value of . This discontinuity implies that relatively small improvements in the shell bounds can result in dramatic improvements in the value of the true error bound. While dramatic improvements can happen with small improvements in the shell bound, we expect that in practice, such large improvements will not be too common, simply because a small improvement is unlikely (amongst all learning problems) to shift the bound across one of these discontinuous transitions.