15-859(B) Machine Learning Theory 03/24/08 Membership Query learning, contd * Fourier spectrum of Decision Trees and DNF * Bshouty's algorithm for learning decision trees and XORs of terms ======================================================================= From last time: Looking at learning function f: {0,1}^n -> {-1,1}, where our goal is to come up with an approximation g which agrees with f on most of the domain (i.e., want to do well with respect to the uniform distribution). In fourier view, think of function as a vector in a 2^n-dimensional space. In this view, any boolean function corresponds to a vector of L_2 length equal to 1. Last time we saw that an interesting thing to look at is: what is the L_1 length of the target under the parity basis? (Note: L_2 length doesn't depend on the basis but L_1 length *does* depend on the basis). As a quick sanity check: what are the smallest and largest possible L_1 lengths for a unit L_2-length vector? [Smallest = 1. Largest = sqrt(2^n)] We proved: Thm: any boolean function f must have most of its L_2 length in fourier coefficients that are larger than epsilon/L_1(f): sum_{v: hat{f}_v < epsilon/L_1(f)} (hat{f}_v)^2 < epsilon KM algorithm: if you have ability to make membership queries, can find all large (magnituge > 1/poly) fourier coefficients over parity basis. Alg idea: estimate sum of squares of all fourier coefficients corresponding to a given prefix. Trick of using E[f(yx)f(zx)]. Grow a tree, expand nodes with value > tau. Combining this with the above structural statement, we can learn any function in time/queries polynomial in L_1(f). On the Fourier Spectrum of DNF and Decision Trees ================================================= Consider a single conjunction (not necessarily monotone). Let T(x) = 1 if x satisfies the conjunction and T(x)=0 otherwise. So, it's not a boolean fn in our sense. In fact = E[T(x)T(x)] = 1/2^|T|. If we analyze the fourier coefficients we get: hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1] = 0 if phi_v has a relevant variable outside T (by usual parity argument) = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the value of phi_v on an example satisfying T. E.g., so it's very simple. Sum of absolute values is 1. Sum of squares is 1/2^|T| (which we knew already) Analyzing Decision Trees ======================== Suppose f is a decision tree. Using the terms corresponding to positive leaves, we can write f as f(x) = [T_1(x) + ... + T_m(x)]*2 - 1 So, to get the fourier coefficients hat(f)_v, we just add up for the corresponding functions. One easy thing to notice is that each function here has a sum of absolute values of coefficients equal to 1 (even the function "-1" which has coefficient 1 along the empty parity function and 0 everywhere else). So, really crudely, this immediately means that sum_v |hat(f)_v| <= 2m+1 where m is the number of positive leaves. So, we get decision trees. Analyzing DNF ============= For DNF we can't say that most of the DNF is in the large fourier coefficients. The problem with the above reasoning is that the terms aren't disjoint so we can't just add them up. But we *can* say that a DNF formula must have at least one reasonably-sized fourier coefficient. Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m). Proof: We can assume f has a term T of size at most lg(4m), else the the probability that a random example is positive for f is at most 1/4, and so the empty "all negative" parity function has good correlation. Now, here's the point: f has a noticeable correlation with T, because when T=1, f=1 too, and when T=0 it doesn't matter. Formally, = E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m). Now, we use what we know about T to argue f has reasonable fourier coefficients too. Formally, = sum_v hat(f)_v * hat(T)_v. Now, using what we know about hat(T)_v, we get: = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T) and we know this equals 1/2^|T|. So, we get: sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1. so at least one of them has magnitude at least 1/(4m). ====================================================== Strong-learning DNF wrt uniform distribution: the above result gives us weak-learning. Jackson (CMU PhD thesis '94) showed how to use a specific boosting algorithm which doesn't change the distirb too much, to use this to get strong learning. Key idea: in Fourier view, can think of keeping distrib fixed, but having target change and become non-boolean. If doesn't change too much, can still argue about having a large fourier coeff. ====================================================== Can we use MQs to learn DTs wrt arbitrary distributions? To put this into context: you can show that a bad distribution can make MQs worthless for learning DNF, if crypto-hard functions exist. But this doesn't go through for decision trees. Problem solved by Bshouty: algorithm for learning DTs in PAC+MQ model. Note: unlike the Fourier-based algorithms, this one will be very sensitive to noise. Bshouty's algorithm =================== Here is a simpler version of Bshouty's original algorithm, from an email message that he sent me (not sure if this simpler version is in print). You'll see some similarities to KM. In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where T_i are conjunctions. Why is this a generalization of DTs? We're going to solve by building a decision-tree-like-thing (a "parity decision graph") of depth n and width O(m). So, by Occam results, just need to take some initial sample S of size O(nm/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nm). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > m leaves. This will happen at some depth d = log(m). Define T_1', ..., T_m' to be what you get by taking the original terms T_1,...,T_m and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the T' terms. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like: (T1' XOR T3'), T1', (T2' XOR T3'), T2'. Now, the claim is that since there are > m leaves, some of these must be XORs of others. In fact, there must exist some set L of at most m leaves, such that any other leaf is an XOR of some subset of leaves in L. (L is a maximal linearly-independent set mod 2.) E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as m^2, so let's have S be of size nm^2/epsilon) 2. keep building the tree down from leaves in L until have more than m leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example in S, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector of at most 2m outputs. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write these vectors out for all the examples, we get a matrix with 1 row per example, where the set L corresponds to a set of <= m columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. A subtle issue here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true for strings whose suffixes match examples in S and whose prefixes match the paths down to the current nodes. But, that's OK since our goal is just to find a hypothesis consistent with S, and we will maintain inductively (see step 3 below) that in following the decision grapth to classify an example in S we are always looking at strings with that property. To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node and conceptually assign prefixes accordingly. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is N and say node y in N is inactive, then look at set (N - {y}) XOR (positions listed at y). Finally, as we go down the tree, the T' terms will become constant 1 or 0, and we just take the XOR. ======================================================== Summary of learnability results =============================== PAC+MQ: (arbitrary distribution, with Memb queries) - monotone DNF - decision trees - XORs of terms - DFAs (target is a DFA of n states. Inputs are strings. Will do this next class) MQ+uniform (memb queries. learning wrt uniform distribution) - general DNF formulas No queries, but allow n^polylog(n) time: - decision trees - DNF over uniform distribution AC^0 too [didn't do this in class] No queries, polynomial time: - decision lists, LTFs, k-CNF, k-DNF. ========================================================