The next section is devoted to an improvement of the microchoice bound called adaptive microchoice, which arises from synthesizing Freund’s query trees [17] with the microchoice bound. This improvement is not easily expressed as a simplification of Structural Risk Minimization. In essence, the adaptive microchoice bound can gain from dependence on the learning problem distribution and can take advantage of an “easy” distribution.
First we require some background material in order to state and understand Freund’s bound.
The statistical query framework introduced by Kearns [26] restricts learning algorithm to only access the data using statistical queries. A statistical query takes as input a binary predicate, , mapping examples to a binary output: . The output of the statistical query is the average of over the examples seen. Let be the th labeled example, then:
The output is an empirical estimate of the true value of the query under the distribution 2 . One simple example of a predicate is “the first bit is ”. A more complicated predicate might be “the third bit xor the 4th bit is ”. Naturally, the distribution of will be the familiar Binomial distribution.
It is convenient to define and and let Intuitively, is a (fixed) interval in which the random variable will fall with high probability. In other words, we know that: Now, we want to construct a confidence interval based upon the high confidence interval . We can do this using the inversion lemma ( 3.4.1) to get: and
The random interval defined here contains the “real” answer with high probability. In other words, we have:
Freund [17] considers choice algorithms that at each step perform a Statistical Query on the sample, using the result to determine which choice to take. For an algorithm , tolerance (defined next), and distribution , Freund defines the query tree as the choice tree created by considering only those choices resulting from answers to queries such that . The idea is that if a particular predicate, , is true with probability (for example) on a random sample it is very unlikely that the empirical result of the query will be . More generally, the chance the answer to a given query is off by more than is at most by Hoeffding’s inequality. So, if the entire tree contains a total of queries in it, the probability any of these queries is off by more than is at most . In other words, this is an upper bound on the probability the algorithm ever “falls off the tree” and makes a low probability choice. The point of this is that we can allocate half (say) of the confidence parameter to the event that the algorithm ever falls off the tree, and then spread the remaining half evenly on the hypotheses in the tree (which hopefully is a much smaller set than the entire hypothesis set).
Unfortunately, the query tree suffers from the same problem as the distribution considered in section ( 5.1), namely that to compute it, one needs to know . So, Freund proposes an algorithmic method to find a super-set approximation of the tree. The idea is that by analyzing the results of queries, it is possible to determine which outcomes were unlikely given that the query is close to the desired outcome. In particular, each time a query is asked and a response is received, if it is true that , then the range contains the range . Thus, under the assumption that no query in the correct tree is answered badly, a super-set of the correct tree can be produced by exploring all choices resulting from responses within of the response actually received. By applying this method to every node in the query tree we can generate an empirically observable super-set of the query tree: that is, the original query tree is a pruning of the empirically constructed tree.
A drawback of this method is that it can easily take exponential time to produce the approximate tree, because even the smaller correct tree can have a size exponential in the running time of the learning algorithm. Instead, we would much rather simply keep track of the choices actually made and the sizes of the nodes actually followed, which is what the microchoice approach allows us to do. As a secondary point, given , computing a good value of for Freund’s approach is not trivial, see [17]; we will be able to finesse that issue and use the tighter bound .
In order to apply the microchoice approach, we modify Freund’s query tree so that different nodes in the tree receive different confidence, , much in the same way that different hypotheses in our choice tree receive different values of .
The manipulations of the choice tree are now reasonably straightforward. We begin by describing the true microchoice query tree and then give the algorithmic approximation. As with the choice tree in section ( 5.2), one should think of each node in the tree as representing the current internal state of the algorithm.
We incorporate Freund’s approach into the choice tree construction by having each internal node allocate a portion, of its “supply” of failure probability to the event that . The node then splits the remainder of its supply evenly among the children corresponding to choices that result from answers with . Choices that would result from “bad” answers to the query are pruned away from the tree and get nothing. This continues down the tree to the leaves. Pictorially, this looks like:
How should be chosen? Smaller values of result in larger intervals leading to more children in the pruned tree and less confidence given to each. Larger values of result in less left over to divide among the children. Unfortunately, our algorithmic approximation (which only sees the empirical answers and needs to be efficient) will not be able to make this optimization. Therefore, we define in the true microchoice query tree to be where is the depth of the current node. This choice will imply that the adaptive microchoice bound is never much worse than the Microchoice bound, and sometimes much better.
Since a particular query value implies a particular choice , we can think of the interval as containing choices rather than query results. After all, we only care about the choices the algorithm makes. We can calculate the probability assigned to a hypothesis in the true adaptive microchoice query tree according to the following algorithm:
There are two important things to note about this algorithm. First of all, we could plug the value it returns into the Occam’s Razor Bound 4.6.1 and receive a bound on the true error rate of our chosen classifier.
Second, this algorithm can not be executed. The essential problem is determining whether or not which cannot be done without knowledge of the underlying distribution . However, we can calculate an approximate version of this algorithm which, with high probability, returns a value which is smaller. Since a smaller value is pessimistic, we can use it in our bounds.
The algorithmic approximation uses the idea in [17] of including all choices within the double confidence interval of the observed value . Unlike [17], however, we do not actually create the tree; instead we just follow the path taken by the learning algorithm, and argue that the “supply” probability remaining at the leaf is no greater than the amount that would have been there in the original tree. Finally, the algorithm outputs a bound calculated with .
Specifically, the algorithm is as follows. Suppose we are at a node of the tree containing statistical query at depth and we have a supply of parameter. (If the current node is the root, then and ). We choose , ask the query , and receive . Let and with
We now let be the number of children of our node corresponding to answers in the range . We then go to the child corresponding to the answer that we received, giving this child a confidence parameter supply of . This is the same as we would have given it had we allocated to the children equally. We then continue from that child. Finally, when we reach a leaf, we output the probability left for the hypothesis. Pictorially, this looks like:
Notice that the second choice set is larger than in the true adaptive microchoice set tree. This can easily happen and it makes our results somewhat more pessimistic. The approximate adaptive microchoice algorithm is specified as follows:
Let be the depth of some hypothesis in the empirical path and , , , be the sequence of choice sets resulting in in the algorithmic construction; i.e., is the number of unpruned children of the -th node. Then, the confidence placed on will be:
(5.3.1) |
(Adaptive Microchoice Bound) For all hypothesis spaces, , for all : where , and is as defined in equation 5.3.1.
PROOF. By design, with probability all queries in the true microchoice query tree receive good answers, and all hypotheses in that tree have their true errors within their estimates.
We will prove that in the high probability case, the output of the Approximate_Adaptive_Microchoice algorithm is less than the output of the True_Adaptive_Microchoice algorithm. Since a smaller makes the bound more pessimistic, we will prove the bound.
Assume inductively that at the current node of our empirical path the supply is no greater than the supply given to that node in the true tree. This is clearly satisfied in the base case when .
Under the assumption that the response falls in the interval , it must be the case that the interval contains the interval . Therefore, the supply given to any child in the empirical path is no greater than the supply given in the true tree. ▫
The corresponding relative entropy corollary is:
The bound in theorem ( 5.3.3) is very similar to ( 5.2.2) except that the choice complexity is slightly worsened with the term but improved by replacing with the smaller .
Most natural Statistical Query algorithms make each choice based on responses to a set of queries, not just one. For instance, to decide what variable to put at the top of a decision tree, we ask queries, one for each feature; we then choose the feature whose answer was most “interesting”. This suggests generalizing the query tree model to allow each tree node to contain a set of queries, executed in batch. Requiring each node in the query tree to contain just a single query as in the above construction would result in an unfortunately high branching factor just for the purpose of “remembering” the answers received so far. 3
Extending the algorithmic construction to allow for batch queries is easily done. If a node has queries , we choose the query confidence as before, but we now split the mass evenly among all queries. We then let be the number of children corresponding to answers to the queries in the ranges respectively. We then go to the child corresponding to the answers we actually received, and as before give the child a probability supply of . Theorem ( 5.3.3) holds exactly as before; the only change is that means the size of the -th choice set in the batch tree rather than the size in the single-query-per-node tree.
When growing a decision tree, it is natural to make a batch of queries and then make a decision about which feature to place in a node. The process is then repeated to grow the full tree structure. As in the decision tree example described in the simple microchoice section, if we have features and are considering adding a node at depth , there are possible features that could be chosen for placement in a particular node. The decision of which feature to use is made by comparing the results of queries to pick the best feature according to some criteria, such as information gain. We can choose , then further divide into confidences of size , placing each divided confidence on one of the statistical queries. We now may be able to eliminate some of the choices from consideration, allowing the remaining confidence, to be apportioned evenly amongst the remaining choices. Depending on the underlying distribution this could substantially reduce the size of the choice set. The best case occurs when one feature partitions all examples reaching the node perfectly and all other features are independent of the target. In this case the choice set will have size if there are enough examples.
The adaptive microchoice bound is a significant improvement over the simple microchoice bound when the distribution is such that each choice is clear. For example, consider Boolean features and examples. Suppose that one feature is identical to the label and all the rest of the features are determined with a coin flip independent of the label.
When we apply a decision tree to a data set generated with this distribution, what will be the resulting bound? Given enough examples, with high probability there will only be one significant choice for the first batch query: the feature identical to the label. The second and third batch queries, corresponding to the children of the root feature, will also have a choice space of size with very high probability. The “right” choice will be the label value. Each choice set has size resulting in a complexity of due to allocation of confidence to the statistical queries necessary for learning the decision tree. is considerably better than which the simple version of the microchoice bound provides. Note that the complexity reduction only occurs with a large enough number of examples implying that the value of calculated can improve faster than (inverse) linearly in the number of examples.
The adaptive microchoice bound is never much looser than the simple microchoice bound because under the assumption that choice sets are of size at least , the penalty for using the adaptive version, , is always small compared to the complexity term for the simple microchoice bound, .
The adaptive microchoice bound provides a simple scheme for dividing confidence between choices and queries. There are other choices which may be useful in some settings. Any scheme which a priori divides the confidence between queries and choices at every node will generally work. Here are two schemes which may be useful:
Freund’s approach for self-bounding learning algorithms can require exponentially more computation then the microchoice approach. In its basic form, it requires explicit construction of every path in the state space of the algorithm not pruned in the tree. There exist some learning algorithms where this process can be done implicitly making the computation feasible. However, in general this does not appear to be possible.
The adaptive microchoice bound only requires explicit construction of the size of each subset from which a choice is made. Because many common learning algorithms work by a process of making choices from small subsets, this is often computationally easy. The adaptive microchoice bound does poorly, however, when Freund’s query tree has a high degree of sharing; for example, when many nodes of the tree correspond to the same query, or many leaves of the tree have the same final hypothesis. Allowing batch queries alleviates the most egregious examples of this. It is also possible to interpolate between the adaptive microchoice bound and Freund’s bound by a process of conglomerating the subsets of the microchoice bound.
The mechanism of choice set conglomeration is a similar to the batch query technique. It allows you to trade increased computation for a tighter bound. When starting with the simple microchoice bound, this technique can smoothly interpolate with the discrete hypothesis bound ( 4.2.1). When starting with the adaptive microchoice bound, we can interpolate with Freund’s bound.
Consider a particular choice set, , with elements . Each indexes another choice set, . If the computational resources exist to calculate the union , then can be used in place of in the adaptive microchoice bound. The conglomeration can be done repeatedly to build large choice sets and also applies to the simple microchoice bound ( 5.2.2). Conglomeration can be useful for tightening the bound when there are multiple choice sequences leading to the same hypothesis. However, choice set conglomeration is not always helpful because it trades away the fine granularity of the microchoice bound. The extreme case where all choice sets are conglomerated into one choice set and every hypothesis and query have the same weight is equivalent to Freund’s bound.
When the choices of the attached choice sets are all different, conglomeration will have little use because the size of the union of the choice sets is the sum of the sizes of each choice set . If the child sets each have the same size then this simplifies to which results in the same confidence applied to each choice whether conglomerating or not. The best case for conglomeration is equivalent to the batch query case: every sub-choice set contains the same elements. Then we have and can pay no cost for the choice set .