15-859(A) Machine Learning Theory 03/04/04 Membership Query learning, contd * KM algorithm for finding large Fourier coefficients * Fourier spectrum of DNF and Decision Trees ========================================================================= Recap of fourier representation =============================== Let's recap fourier representation. View function f over {0,1}^n as vector f = (f_0, f_1,...) where f_x = sqrt(Pr(x))*f(x) So, = sum_x Pr(x)f(x)g(x) = E_{x in D}[f(x)g(x)]. Boolean functions are unit-length (Boolean now means f:{0,1}^n -> {-1,+1}) If we fix D to be the uniform distribution, then the parity functions are orthogonal, and form a basis. Let's index parity functions by their indicator vectors: phi_v(x) = v.x mod 2, scaled to be {-1,+1}. We can write any function f as a sum of parity functions: f(x) = sum_v hat(f)_v phi_v(x) where hat(f)_v = = E_x[f(x)phi_v(x)] and expectation is over the uniform distribution. Since this is just a change of basis, we can write = sum_v hat(f)_v * hat(g)_v Suppose we are able to find most of the fourier coefficients of f. E.g., all but epsilon of the (hat(f)_i)^2. Then the resulting function g may not be boolean, but it's close (as a vector) to f. In particular, |f-g|^2 = epsilon. Then, sign(g) has to be a good approximation to f, since by definition, E[(f(x)-g(x))^2] = epsilon, and every x for which sign(g) is wrong contributes at least 1 to the quantity inside the expectation. So, Pr[f(x) != sign(g)] <= epsilon. Now, give an algorithm by Kushilevitz and Mansour to find all v such that (hat(f)_v)^2 > tau, in time poly(n, 1/tau). This is good for functions that have most of their length in large fourier coefficients. KM ALGORITHM ============ -> See notes from last time. To convert this guarantee into theorems about functions we're interested in (like DNF formulas, decision trees, etc), we need to be able to analyze what their fourier coefficients look like. On the Fourier Spectrum of DNF and Decision Trees ================================================= Let's start by analyzing the fourier spectrum of a single term (conjunction) T. In fact, let's define the function T(x) as: T(x) = 1 if x satisfies term T and T(x) = 0 otherwise. This is a little weird since it is not a boolean function according to our definition {-1, +1}. But it is still a legal function and will be very helpful, since we can start adding them up. E.g., since a decision tree can be viewed as an OR of terms going to its positive leaves, and since any example satisfies at most one of them, we can write f(x) = [T_1(x) + ... + T_m(x)]*2 - 1, where the T_i are all the terms corresponding to positive leaves. What does the fourier decomposition of T look like: hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1] = 0 if phi_v has a relevant variable outside T (by usual parity argument) = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the value of phi_v on an example satisfying T. E.g., so it's very simple. Sum of absolute values is 1. Sum of squares is 1/2^|T| (which we knew already since sum of squares = = E_x[T(x)^2] = Pr_x(T(x)=1)). Analyzing Decision Trees ------------------------ Suppose f is a decision tree. Using the terms corresponding to positive leaves, we can write f as f(x) = [T_1(x) + ... + T_m(x)]*2 - 1 So, to get the fourier coefficients hat(f)_v, we just add up for the corresponding functions. One easy thing to notice is that each function here has a sum of absolute values of coefficients equal to 1 (even the function "-1" which has coefficient 1 along the empty parity function and 0 everywhere else). So, really crudely, this immediately means that sum_v |hat(f)_v| <= 2m+1 where m is the number of positive leaves. Claim: this fact by itself implies that f must have most of its length in large fourier coefficients. In particular, any vector f with L_2 length 1 (i.e., =1) has the property that the sum of squares of the coefficients less than epsilon/L_1(f) is at most epsilon, where L_1(f) is the L_1 length of f (the sum of absolute-values of coefficients). Proof: a coefficient of magnitide alpha contributes alpha^2 to the sum of squares, and alpha to the sum of absolute values. So, the small coefficients contribute at least a L_1(f)/epsilon factor *more* to the latter than to the former. Since in total they can contribute at *most* L_1(f) to the sum of absolute values, they can contribute at most epsilon to the sum of squares. So, that's it. Analyzing DNF ============= For DNF we can't say that most of the DNF is in the large fourier coefficients. The problem with the above reasoning is that the terms aren't disjoint so we can't just add them up. But we *can* say that a DNF formula must have at least one reasonably-sized fourier coefficient. Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m). Proof: We can assume f has a term T of size at most lg(4m), else the the probability that a random example is positive for f is at most 1/4, and so the empty "all negative" parity function has good correlation. Now, here's the point: f has a noticeable correlation with T, because when T=1, f=1 too, and when T=0 it doesn't matter. Formally, = E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m). Now, we use what we know about T to argue f has reasonable fourier coefficients too. Formally, = sum_v hat(f)_v * hat(T)_v. Now, using what we know about hat(T)_v, we get: = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T) and we know this equals 1/2^|T|. So, we get: sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1. so at least one of them has magnitude at least 1/(4m). ========= This means we can weak-learn DNF over the uniform distribution (with membership queries). Jackson then combines this with a controlled boosting to get strong-learning.