15-859(B) Machine Learning Theory 03/19/14 More Fourier: * Learning via Membership queries II - Connecting KM algorithm to L_1 length - Fourier spectrum of Decision Trees and DNF * Classes of high SQ dimension have no large-margin kernels. ======================================================================= Fourier analysis and Query learning =================================== Looking at learning function f: {0,1}^n -> {-1,1}, where our goal is to come up with an approximation g which agrees with f on most of the domain (i.e., want to do well with respect to the uniform distribution). Considering basis of parity functions. I.e.,\hat{f}_v = = E[f(x)phi_v(x)] is the correlation of f with the parity function having indicator vector v, with respect to the uniform distribution. Last time, we proved: THM (KM algorithm): if you have ability to make membership queries, can find all large (|\hat{f}_v| > 1/poly(n)) fourier coeffs in poly time. Alg idea: estimate sum of squares of all fourier coefficients corresponding to a given prefix. Trick of using E[f(yx)f(zx)]. Grow a tree, expand nodes with value > tau. So, this means we can learn functions that have most of their length in their large fourier coefficients. Today: what kinds of functions have this property. Start with useful def and structural result: Define L_1(f) = sum_v |hat(f)_v|. So, this is the L_1 length of f in the parity basis. Note that while L_2 length is independent of your choice of basis (and for boolean functions it's always 1), the L_1 length depends on your basis! E.g., if f is a parity function, then it has L_1 length = 1 in the parity basis, but it has L_1 length = 2^{n/2} (as do all boolean functions) in the standard basis where you list f as (f(0)*2^{-n/2}, f(1)*2^{-n/2},...). THM: any boolean function f must have most of its L_2 length in fourier coefficients that are larger than epsilon/L_1(f): sum_{v: |hat{f}_v| < epsilon/L_1(f)} (hat{f}_v)^2 < epsilon Proof: LHS < sum_{v: |hat{f}_v| < epsilon/L_1(f)} |hat{f}_v|*epsilon/L_1(f) <= epsilon/L_1(f) * sum_{v} |hat{f}_v| = epsilon. QED On the Fourier Spectrum of DNF and Decision Trees ================================================= Consider a single conjunction (not necessarily monotone). Let T(x) = 1 if x satisfies the conjunction and T(x)=0 otherwise. So, it's not a boolean fn in our sense. In fact = E[T(x)T(x)] = 1/2^|T|. If we analyze the fourier coefficients we get: hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1] = 0 if phi_v has a relevant variable outside T (by usual parity argument) = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the value of phi_v on an example satisfying T. E.g., so it's very simple. Sum of absolute values is 1. Sum of squares is 1/2^|T| (which we knew already) Analyzing Decision Trees ======================== Suppose f is a decision tree. Using the terms corresponding to positive leaves, we can write f as f(x) = [T_1(x) + ... + T_m(x)]*2 - 1 So, to get the fourier coefficients hat(f)_v, we just add up for the corresponding functions. One easy thing to notice is that each function here has a sum of absolute values of coefficients equal to 1 (even the function "-1" which has coefficient 1 along the empty parity function and 0 everywhere else). So, really crudely, this immediately means that sum_v |hat(f)_v| <= 2m+1 where m is the number of positive leaves. So, this means we can use the KM algorithm to learn Decision Trees with queries wrt the uniform distribution. Analyzing DNF ============= For DNF we can't say that most of the DNF is in the large fourier coefficients. The problem with the above reasoning is that the terms aren't disjoint so we can't just add them up. But we *can* say that a DNF formula must have at least one reasonably-sized fourier coefficient. Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m). Proof: We can assume f has a term T of size at most lg(4m), else the the probability that a random example is positive for f is at most 1/4, and so the empty "all negative" parity function has good correlation. Now, here's the point: f has a noticeable correlation with T, because when T=1, f=1 too, and when T=0 it doesn't matter. Formally, = E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m). Now, we use what we know about T to argue f has reasonable fourier coefficients too. Formally, = sum_v hat(f)_v * hat(T)_v. Now, using what we know about hat(T)_v, we get: = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T) and we know this equals 1/2^|T|. So, we get: sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1. so at least one of them has magnitude at least 1/(4m). (plus this says somethign about what their Fourier coefficients look like) ====================================================== Strong-learning DNF wrt uniform distribution: the above result gives us weak-learning. Jackson (CMU PhD thesis '94) showed how to use a specific boosting algorithm which doesn't change the distirb too much, to use this to get strong learning. Key idea: in Fourier view, can think of keeping distrib fixed, but having target change and become non-boolean. If doesn't change too much, can still argue about having a large fourier coeff. ====================================================== High SQ dimension => no large-margin kernel =========================================== Two lectures ago we proved that any class C with N > poly(n) pairwise uncorrelated functions wrt D cannot be weak-learned over D in the SQ model. Will start today with a related result (in some sense an implication): any such class cannot have a large-margin kernel. In particular, if C has N pairwise uncorrelated functions wrt D, then for any kernel K, there must be some f in C with margin < 8/sqrt(N). Moreover, this holds even if we allow average hinge-loss as large as 1/2. Proof: - Say the N uncorrelated functions are f_1, ..., f_N - We know for any hypothesis h, sum_i E_D[h(x)f_i(x)]^2 <= 1. - So, their average is at most 1/N, so avg_i |E_D[h(x)f_i(x)}| <= 1/sqrt(N). - Now, suppose we had a good kernel for all f_i (say margin > gamma on all points). - Fix one such f_i. We showed earlier that for a *random* linear separator h, E_h[|E_D[h(x)f_i(x)]|] = Omega(gamma). - In fact, even if allow average hinge-loss as large as 1/2, you get E_h[|E_D[h(x)f_i(x)]|] >= gamma/8. - This implies there must *exist* h such that E_i[|E_D[h(x)f_i(x)]|] >= gamma/8.