It is important to demonstrate that this covering number is feasible to calculate and gives a better answer than the traditional approach. We will do this by first calculating the bracketing covering number for a very simple continuous classifier and then comparing the results with the traditional covering number approach.
Bracketing covering numbers have already been proved for many function classes [13][49]. Here, we will present a proof for the simplest of continuous hypothesis spaces: the step function on a line segment. Each hypothesis will be indexed by a number according to: What is for this hypothesis set?
PROOF. Consider a range of hypotheses from to . For this range of hypotheses, we can choose a bracketing pair, . In particular, we can choose and which agree on and and always predict either incorrectly () or correctly () on . The distance between these functions satisfies: and every hypothesis in satisfies . Consequently, if and are chosen appropriately, we will observe .
If can be described by a probability distribution, then we can simply calculate the marginal distribution, , and the cumulative distribution of the margin, . Now, find for which for . Choose . There are at most intervals, each with a measure (according to ) of at most . Consequently, we can cover with pairs of . ▫
Given that the bracketing cover is , we can use theorem 9.3.1 to define a constraint that the true error rate must satisfy with high probability. Setting , we get: and To be fair in comparison to the standard covering number approaches, we should relax our theorem to use the Hoeffding approximation. Note that this is a bit unfair because the first inequality is (inherently) a highly biased Binomial with lower variance. Relaxing to the Hoeffding bound, we get: and which implies: Note again that we are being “unfair” to the new approach by using the Hoeffding approximation rather than exact Binomial-tail bounds. The standard covering number approach has not yet been reduced to exact Binomial-tail bounds. Using the standard approach, the covering number, we get . This implies a bound of: Comparing the bounds, we see that the new approach is about times more efficient in the number of examples required to achieve a bound on a given deviation.
There are several important things to note about this proof.
In fact, the proof can be extended to all (even ones with point masses) at the cost of a factor of worsening and a messier argument. Property (2) is desirable because it is often not the case that we know the distribution when we wish to apply the bound. Property (1) is an essential technique that can be used to prove other covering number bounds for this notion of covering number.
Can we show partial order covering number bounds for other classifiers? There is a straightforward extension of the previous proof for classifiers which consist of axis parallel intervals in . More work is required to prove partial order covering numbers for the hypothesis spaces of standard learning algorithms.