15-859(B) Machine Learning Theory 04/03/06 Membership Query learning, contd * Recap of where we are now * Fourier spectrum of Decision Trees and DNF * Bshouty's algorithm for learning decision trees and XORs of terms ======================================================================= Recap of where we are now: - Model is we have an unknown function f from {0,1}^n -> {-1,1}, which we can ping with inputs x and get out f(x). Our goal is to come up with an approximation g which agrees with f on most of the domain. - Fourier idea: if think of function as vector of its values, scaled by sqrt(1/2^n), then = Pr(f(x)=g(x))-Pr(f(x)!=g(x)) = correlation of f and g. There are 2^n parity functions phi_v: x_1, x_2, x_1+x_2,... that are all pairwise uncorrelated, so they form a basis. This means we can write any function f as: f(x) = sum_v hat{f}_v phi_v(x), where hat{f}_v = . - KM algorithm: find all large (magnituge > 1/poly) fourier coefficients over parity basis. Alg idea: estimate sum of squares of all fourier coefficients corresponding to a given prefix. Trick of using E[f(yx)f(zx)]. Grow a tree, expand nodes with value > tau. - If most of "length" of f is in large fourier coefficients, then get a good approx wrt uniform distribution. If get at least some non-negligable coefficient, then have weak-learning. On the Fourier Spectrum of DNF and Decision Trees ================================================= Consider a single conjunction (not necessarily monotone). Let T(x) = 1 if x satisfies the conjunction and T(x)=0 otherwise. So, it's not a boolean fn in our sense. In fact = E[T(x)T(x)] = 1/2^|T|. If we analyze the fourier coefficients (which we did last time) we get: hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1] = 0 if phi_v has a relevant variable outside T (by usual parity argument) = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the value of phi_v on an example satisfying T. E.g., so it's very simple. Sum of absolute values is 1. Sum of squares is 1/2^|T| (which we knew already) Analyzing Decision Trees ======================== Suppose f is a decision tree. Using the terms corresponding to positive leaves, we can write f as f(x) = [T_1(x) + ... + T_m(x)]*2 - 1 So, to get the fourier coefficients hat(f)_v, we just add up for the corresponding functions. One easy thing to notice is that each function here has a sum of absolute values of coefficients equal to 1 (even the function "-1" which has coefficient 1 along the empty parity function and 0 everywhere else). So, really crudely, this immediately means that sum_v |hat(f)_v| <= 2m+1 where m is the number of positive leaves. Claim: this fact by itself implies that f must have most of its length in large fourier coefficients. In particular, any vector f with L_2 length 1 (i.e., =1) has the property that the sum of squares of the coefficients less than epsilon/L_1(f) is at most epsilon, where L_1(f) is the L_1 length of f (the sum of absolute-values of coefficients). Proof: a coefficient of magnitide alpha contributes alpha^2 to the sum of squares, and alpha to the sum of absolute values. So, the small coefficients contribute at least a L_1(f)/epsilon factor *more* to the latter than to the former. Since in total they can contribute at *most* L_1(f) to the sum of absolute values, they can contribute at most epsilon to the sum of squares. QED Analyzing DNF ============= For DNF we can't say that most of the DNF is in the large fourier coefficients. The problem with the above reasoning is that the terms aren't disjoint so we can't just add them up. But we *can* say that a DNF formula must have at least one reasonably-sized fourier coefficient. Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m). Proof: We can assume f has a term T of size at most lg(4m), else the the probability that a random example is positive for f is at most 1/4, and so the empty "all negative" parity function has good correlation. Now, here's the point: f has a noticeable correlation with T, because when T=1, f=1 too, and when T=0 it doesn't matter. Formally, = E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m). Now, we use what we know about T to argue f has reasonable fourier coefficients too. Formally, = sum_v hat(f)_v * hat(T)_v. Now, using what we know about hat(T)_v, we get: = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T) and we know this equals 1/2^|T|. So, we get: sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1. so at least one of them has magnitude at least 1/(4m). ====================================================== Can we use MQs to learn DTs wrt arbitrary distributions? To put this into context: It turns out you can show that a bad distribution can make MQs worthless for learning DNF, if crypto-hard functions exist. But this doesn't go through for decision trees. Problem solved by Bshouty: algorithm for learning DTs in PAC+MQ model. Note: algorithm won't be very practical in the sense that not only will it use MQs, but it will also be very sensitive to noise. (in contrast, the Fourier algorithm would still be reasonable if there was random noise, since this would affect the correlations in a reasonable way). Bshouty's algorithm =================== Here is a simpler version of Bshouty's original algorithm, from an email message that he sent me (not sure if this simpler version is in print). You'll see some similarities to KM. In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where T_i are conjunctions. Why is this a generalization of DTs? We're going to solve by building a decision-tree-like-thing (a "parity decision graph"?) of depth n and width O(m). So, by Occam results, just need to take some initial sample S of size O(nm/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nm). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > m leaves. This will happen at some depth d = log(m). Define T_1', ..., T_m' to be what you get by taking the original terms T_1,...,T_m and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the T' terms. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like: (T1' XOR T3'), T1', (T2' XOR T3'), T2'. Now, the claim is that since there are > m leaves, some of these must be XORs of others. In fact, there must exist some set L of at most m leaves, such that any other leaf is an XOR of some subset of leaves in L. (L is a maximal linearly-independent set mod 2.) E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as m^2, so let's have S be of size nm^2/epsilon to be safe....) 2. keep building the tree down from leaves in L until have more than m leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector of outputs. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write this out for all the examples, we get a big matrix with 1 row per example, where the set L corresponds to a set of <= m columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. (What's a little confusing here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true over examples with suffixes matching S. But, that's OK since our goal is just to find a hypothesis consistent with S.) To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is N and say node y in N is inactive, then look at set (N - {y}) XOR (positions listed at y). Finally, as we go down the tree, the T' terms will become constant 1 or 0, and we just take the XOR. Why is this legal? We can see this by arguing bottom-up: Say (as above) that L is the set of active nodes at the current level, and inductively assume that the trees rooted at nodes of L are correct on all examples produced by taking some x in S and replacing its prefix by whatever is needed to reach that node. Then (by construction) the inactive nodes (the ones not in L) are also correct on examples produced by taking some x in S and replacing its prefix with whatever was needed to reach them, too. Therefore, all *parents* of this level are correct on examples produced by taking some x in S and replacing prefixes. And so on. ======================================================== Summary of learnability results =============================== PAC+MQ: (arbitrary distribution, with Memb queries) - monotone DNF - decision trees - XORs of terms - DFAs (target is a DFA of n states. Inputs are strings. Will do this next class) MQ+uniform (memb queries. learning wrt uniform distribution) - general DNF formulas No queries, but allow n^polylog(n) time: - decision trees - DNF over uniform distribution AC^0 too [didn't do this in class] No queries, polynomial time: - decision lists, LTFs, k-CNF, k-DNF. ========================================================