Notes for class 03/09/98 * Using membership queries to: - find large fourier coefficients - learn XOR of terms, which is a generalization of Decision Trees ============================================================================= So far, we've been talking pretty much talking only about learning from passive observations. Today: start on algorithms for learning when algorithm can be more active. In particular, when algorithm can request classification of examples of its own choosing. These are called Membership Queries. Today, give two interesting algorithms. The first one (by Kushilevitz and Mansour, and also by Goldreich and Levin) is for finding all large (1/poly) fourier coefficients of some boolean function, over parity basis. Reminder: if A is some set of variables, then the Ath fourier coefficient of f (over parity basis) is defined as the correlation of f with the parity function phi_A, over the uniform distribution. If f is a boolean function, then the sum of the coeffs squared equals 1. Last time we showed: - any DNF has at least one non-negligable fourier coefficient. This means we can use the algorithm to WEAK-LEARN any DNF with respect to the uniform distribution (using membership queries). Jeff Jackson showed how could combine with boosting to STRONG-LEARN with respect to the uniform distribution - if can find "most" of f (fourier coeffs whose sum of squares is almost 1) then have a good approx of f, with respect to uniform distribution. This is nice since, for instance, any poly size DECISION TREE can be approximated by its large fourier coefficients. Rough Argument: Fourier coeff is zero for any parity that is not a subset of some path to leaf. (prove by induction. For instance, look at tree with x1 at root and x2 and x3 as children, and the parity function x1+x2+x3). Also, very deep leaves don't affect coeffs much since unlikely a random example will go that far. So, there is a relatively small number of parities that are sharing most of the function. KM algorithm: how to find the large fourier coeffs -------------------------------------------------- Let's say our goal is to find all \hat{f}_A such that (\hat{f}_A)^2 > \tau. Say S is the set of all parity functions, so the sum, over A in S, of the (\hat{f}_A)^2 equals 1. Suppose we could split S into two equal-sized sets S_0 and S_1 and for each one, ask for the sum of squares of the \hat{f}_A in that set. And, suppose we could do this recursively. Then what? Then we could build up a tree, splitting each node whose answer is > tau, and ignoring those < tau. - max width of tree is 1/tau. - Just keep going to leaves at depth n. => Only n/tau questions to ask. So, all boils down to: how can we ask for the sum of squares of the \hat{f}_A's inside these sets? First of all, to be able to do this, we're going to need to split these sets in a particular way. Specifically, we'll do this by fixing a prefix of the variables, and then the set will be "all parities consistent with this prefix". For instance, S_0 = {A: A doesn't include x_1} S_1 = {A: A does include x_1} Then, will split S_0 into S_00 = {A: A doesn't include x_1 or x_2} S_01 = {A: A doesn't include x_1 but does include x_2} etc. In general, our goal is now to estimate, for some string alpha of 0s and 1s, the sum of the squares of the \hat{f}_A's for all A in S_alpha. (Note: we won't find the sum exactly. We'll find an estimate of the sum. But that's fine. It just means we throw out nodes whose estimate is < tau/2 instead of those whose estimate is < tau) There's now a neat trick for doing this. Let's do case of alpha = 00000 (a string of k zeros). So, we want the sum of squares over all parities that don't include any of first k variables. Remember, f is a {-1, +1}-valued function. The claim is that this sum just happens to equal the expected value, over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx). I.e. pick random suffix x. Pick two random prefixes y and z and multiply f(yx) by f(zx). Repeat many times and take the average. Intuition: independent y and z will cancel out correlations we don't want. E.g., what if f was a parity function? If f agrees with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1. If f doesn't agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0. Now, since any f can be written as a sum of parities, hopefully we can reduce the case of general f to the case of "f is a parity function"..... It's actually almost but not quite so easy. Question we need to answer is: for two (possibly different) sets A and B, what is E [ phi_A(yx) * phi_B(zx) ] ? x,y,z Claim: if A contains (at least) one of first k bits then this is 0 if B contains (at least) one of first k bits then this is 0 else if A != B, then this is 0 else (A = B and they don't contain any of first k bits) then this is 1. This is enough. Now we get: E[f(yx)f(zx)] = E[ (sum_A \hat{f}_A \phi_A(yx)) (sum_B \hat{f}_B \phi_B(zx))] = what we want. So, we're done for the case of alpha = "all zeros". What about other alpha (i.e., requiring parity to have a certain fixed dependence on the first k bits)? Since we know the dependence (alpha) just "undo it". Specifically, we instead ask for: E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)] where phi_alpha is the parity of the variables set to 1 in alpha. That's it. qed. Ta da. Actually, one technical point: we need to estimate an expectation. We'll do it by sampling. OK since the quantity inside the expectation is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he shows is that instead of modifying the distribution, you can modify f to a non-boolean function. Then, need to argue that (based on how the boosting results work) f doesn't get too large, so you can still estimate f(yx)f(zx) from sampling. ============================================================================ Here's another Membership query algorithm that in some ways is more powerful than the above. Use to learn decision trees over *any* distribution. "PAC + MQ" model: draw from D, and also can ask MQs. In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_t, where T_i are conjunctions. Why is this a generalization of DTs? There's a trick to this.... We're going to solve by building a decision-tree-like-thing (a "parity decision graph"?) of depth n and width O(t). So, by Occam results, just need to take some initial sample S of size O(nt/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nt). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > t leaves. This will happen at some depth d = log(t). Define \hat{T}_1, ..., \hat{T}_t to be what you get by taking the original terms T_1,...,T_t and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the \hat{T}'s. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like ((add hats to these)): (T1 XOR T3), T1, (T2 XOR T3), T2. Now, the claim is that since there are > t leaves, some of these must be XORs of others. In fact, there must exist some set L of at most t leaves, such that any other leaf is an XOR of some subset of leaves in L. E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as t^2, so let's have S be of size nt^2/epsilon to be safe....) 2. keep building the tree down from leaves in L until have more than t leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector of outputs. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write this out for all the examples, we get a big matrix where the set L corresponds to a set of <= t columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. (What's a little confusing here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true over examples with suffixes matching S. But, that's OK since our goal is just to find a hypothesis consistent with S.) To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is N and say node y in N is inactive, then look at set (N - {y}) XOR (positions listed at y). Why is this legal? We can see this by arguing bottom-up: Say (as above) that L is the set of active nodes at the current level, and inductively assume that the trees rooted at nodes of L are correct on all examples produced by taking some x in S and replacing its prefix by whatever is needed to reach that node. Then (by construction) the inactive nodes (the ones not in L) are also correct on examples produced by taking some x in S and replacing its prefix with whatever was needed to reach them, too. Therefore, all *parents* of this level are correct on examples produced by taking some x in S and replacing prefixes. And so on.