15-859(B) Machine Learning Theory 04/21/14 (A few loose ends) * Compression bounds * Bshouty's algorithm for learning decision trees in PAC+MQ. =============================================================== Compression Bounds For some learning algorithms, the hypothesis produced can be uniquely described by a small subset of k of the training examples. E.g., if you are learning an interval on the line using the simple algorithm ``take the smallest interval that encloses all the positive examples,'' then the hypothesis can be reconstructed from just the outermost positive examples, so k=2. For a conservative Mistake-Bound learning algorithm, you can reconstruct the hypothesis by just looking at the examples on which a mistake was made, so k \leq M, where M is the algorithm's mistake-bound. Note that in this case, we would also care about the {\em order} in which those examples arrived. Let's say that an learning algorithm compresses a sample S of size m down to a hypothesis of size k if it can produce a set (or sequence) of k examples in S such that its hypothesis can be reconstructed from that sequence. The question we want to ask is: can we get a sample complexity bound (or error bound) as a function of k? Let's be a little more formal, and then see how we can prove a PAC guarantee based on k. First, before seeing the data, the algorithm specifies a discription language: a mapping from sequences of examples to hypotheses. So, so for a given sequence S' of examples we have a well-defined hypothesis h_{S'}. Claim: If |S| >= (1/epsilon)[k ln(|S|) + epsilon*k + \ln(1/delta)] then with probability at least 1-delta, all h_{S'} with zero error on S have true error <= epsilon. (For S' a sequence of k examples in S). [[Note: we have a ln(|S|) on the RHS but can replace that with O(ln(1/(epsilon*delta*k)))]] Proof: Fix a sequence s' of k indices from 1..m. For a given list of m examples S, define S|s' to be the examples in S in the positions given by the indices in s'. Now imagine choosing S by drawing the examples in s' first. At this point, h_{S|s'} is now fully defined. If the true error of h_{S|s'} is greater than epsilon, then the probability that h_{S|s'} makes zero mistakes on the remaining examples is at most (1-\epsilon)^{m-k}. Formally, if A is the event "some h_{S'} with zero error on S has true error > epsilon" whose probability we wish to bound, and A_{s'} is the event h_{S|s'} has error 0 on S-{x_i: i \in s'} but true error > epsilon" (so A = Union_{s'} A_{s'}), then we have Pr(A_{s'}) \leq (1-\epsilon)^{m-k}. Therefore, Pr(A) \leq m^k * (1-\epsilon)^{m-k}, which is at most delta for m satisfying the above inequality. Note that it's importat we are doing a union bound over the sequences of indices s' and not the set of possible hypotheses h_{S'}. In particular, there are potentially an infinite number of possible hypotheses $h_{S'}$ so you don't want to do a union bound over all sets $S' \sim D^k$. ================================================ Bshouty's algorithm =================== And now for something completely different... When we were talking about membership query algorithms, we discussed the KM algorithm for finding large Fourier coefficients and argued how this allowed us to learning decision trees over the uniform distribution. A natural question then is: can we use MQs to learn decision trees wrt arbitrary distributions? We've seen that without MQs we can learn decision trees in n^(log s) time by converting to decision lists, but we want poly time. To put this into context: It turns out you can show that a bad distribution can make MQs worthless for learning DNF. But this doesn't go through for decision trees. Problem solved by Bshouty: algorithm for learning DTs in PAC+MQ model. In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where T_i are conjunctions. Why is this a generalization of DTs? We're going to solve by building a decision-tree-like-thing (a "parity decision graph") of depth n and width O(m). So, by Occam results, just need to take some initial sample S of size O(nm/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nm). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > m leaves. This will happen at some depth d = log(m). Define T_1', ..., T_m' to be what you get by taking the original terms T_1,...,T_m and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the T' terms. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like: (T1' XOR T3'), T1', (T2' XOR T3'), T2'. Now, the claim is that since there are > m leaves, some of these must be XORs of others. In fact, there must exist some set L of at most m leaves, such that any other leaf is an XOR of some subset of leaves in L. (L is a maximal linearly-independent set mod 2.) E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as m^2, so let's have S be of size nm^2/epsilon) 2. keep building the tree down from leaves in L until have more than m leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector with one output per active leaf. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write these vectors out for all the examples, we get a matrix with 1 row per example, where the set L corresponds to a set of <= m columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. A subtle issue here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true for strings whose suffixes match examples in S and whose prefixes match the paths down to the current nodes. But, that's OK since our goal is just to find a hypothesis consistent with S, and we will maintain inductively (see step 3 below) that in following the decision graph to classify an example in S we are always looking at strings with that property. To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node and conceptually assign prefixes accordingly. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is V and say node y in V is inactive, then look at set (V - {y}) XOR (positions listed at y). Finally, as we go down the tree, the T' terms will become constant 1 or 0, and we just take the XOR. Pretty clever.