15-859(B) Machine Learning Theory 03/17/08 * use of Fourier analysis for learning (preliminaries) * Some simple algorithms for learning wrt uniform distrb. ========================================================================= Recap SQ analysis ================= - Fourier view of functions as vectors (once we fix a distribution over data) - Showed that if target is picked randomly from a large set of orthogonal functions, then SQ alg is in trouble. - can view *any* SQ alg as basically proposing a unit vector and asking for correlation (dot-product) with target. (Can break any SQ into one part that looks like this, and one part that depends only on data distribution) - Then use the fact that in any orthogonal coordinate system, a unit vector can have at most t^2 coefficients that are >= 1/t. - This means that if target is picked from set S of orthogonal functions, Pr(dot-product >= 1/t) <= t^2/|S|. - So, if > poly(n) orthogonal functions, then can't even weak-learn with poly # of queries, each of tolerance 1/poly. Using Fourier analysis for learning =================================== - useful tool for learning wrt uniform distribution. (basis of parity fns) - esp useful with queries. Will see next time an alg for *finding* large fourier coeffs over parity basis if can prod target f with inputs x of our own choosing. Some general results ==================== Suppose phi_1(x), phi_2(x), ... is a basis of boolean functions. f(x) = sum_i hat{f}_i phi_i(x) hat{f}_i = correlation of f with phi_i = . Claim 1: if you can find non-negligable fourier coefficient then you can weak-learn. Proof: Just directly from defn of correlation. Use phi_i or its complement. Now, suppose we can find "most" of the fourier coefficients. Say we can find a set S that captures all but epsilon of the (hat{f}_i)^2. Let g = sum_{i in S} hat{f}_i phi_i. Then, as vectors, we have |f-g|^2 = epsilon, and also we have = 1-epsilon. Claim 2: Pr[f(x) != sign(g(x))] <= epsilon. So sign(g) is a good apx to f. Proof: by definition, E[(f(x)-g(x))^2] = epsilon. Notice that every x where sign(g) is wrong contributes at least 1. So, the error of sign(g) is at most epsilon. In fact, can see from above argument that we don't need to get the fourier coeffs exactly. If someone hands us S, we can estimate the coefficients by looking at correlations, and get a good apx to g. That's all we need. A more refined bound ==================== What if we can only find a set S that captures, say, half of the (hat{f}_i)^2. Then the above isn't so helpful. Can we do any better? Claim 3: Say |f-g|^2 = epsilon. Then can define a randomized function h based on g such that Pr[f(x) != h(x)] \leq epsilon/2. So, this is interesting even for epsilon = 1/2. Let's first prove something a little easier. Claim 4: Say =1-epsilon *and* g(x) in [-1,1]. Then can define randomized h based on g s.t. Pr[f(x) != h(x)] \leq epsilon/2. Proof: Define h as: h(x) = 1 with probability (1 + g(x))/2, h(x) = -1 with probability (1 - g(x))/2. So, for any given x, Pr[f(x) != h(x)] = (1 - f(x)g(x))/2. So, over all, Pr[f(x) != h(x)] = (1 - E[f(x)g(x)])/2 = epsilon/2. For proof of Claim 3, we need to define h in a messier way: h(x) = 1 with probability (1 + g(x))^2 / (2(1 + g^2(x))), h(x) = -1 with probability (1 - g(x))^2 / (2(1 + g^2(x))). [a little algebra to check this sums to 1]. So, for any given x, Pr[f(x) != h(x)] = (f(x) - g(x))^2 / (2(1+g^2(x))). So, over all, Pr[f(x) != h(x)] <= epsilon/2. Learning wrt uniform distrib ============================ Above general statements hold for generic setting. Now, focus on learning wrt uniform distribution. Specific basis of parity functions. So, view like this: we have some unknown function f on n inputs in a box. Can see f on uniform random pts. Maybe we're allowed to query f on inputs of our choosing (the Membership Query model). Goal, find g that approximates f in the sense that it gets most of the domain correct. E.g., next time: Kushilevitz-Mansour algorithm that using Membership Queries, will find all i such that (\hat{f}_i)^2 > tau in time poly (n, 1/tau). Can combine this with structural statement that any Decision Tree must have most of its length in large fourier coefficients to get a poly-time algorithm to learn decision trees with MQs wrt the uniform distribution (via Claim 2 above). For now, some simple algorithms that don't use queries. ------------------- DNF: We can complement our SQ lower bound with a simple n^{O(log s)} time algorithm to learn DNF of s terms over uniform distribution. + must exist *some* term that covers at least 1/s fraction of positives. + can find such a term in time n^{O(log s/epsilon)} + Do greedy set-cover. Get DNF of s*log(1/epsilon) terms. Size is bounded so can use Occam to upper-bound sample size needed. - Note: LMN use Fourier analysis to get this kind of n^polylog(s) bound for learning AC^0 circuits. (DNF are depth-2 circuits). (Weak)-Learning Monotone functions ---------------------------------- A monotone function is one such that if you take some example x and flip some bit from 0 to 1, then the value of f(x) can't change from positive to negative. Quick sanity check: why are we focusing on uniform distribution? If you allow arbitrary distribution, then you're sunk. E.g., consider all probability mass on examples x with |x|=n/2 (the "middle slice"). Target can now be arbitrary on these, so nothing to learn. But under uniform, you can't have too many orthogonal monotone functions. In particular, a classic results is that any monotone function has to have Omega(1/n) correlation with one of {0,1, x_1, ..., x_n}. Specifically: - If Pr(f(x) = 1) is far from 1/2, we're done. (can just say "everything is positive" or "everything is negative). So, for now let's assume Pr(f(x)=1) is between 1/4 and 3/4. - Want to argue that some variable has correlation. I.e., for some x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon. [Sanity check that this is same as our previous def of correlation, i.e., x_i is a weak hypothesis. If error rate is alpha when x_i=0, then error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error rate is their average which is (1-epsilon)/2.] How to prove: look at boolean cube {0,1}^n. Let's color example red if positive and blue if negative. Assume roughly half of each. E.g., say at least 1/4 of nodes are red and at least 1/4 are blue. Want: no matter how you color, a lot of edges crossing from red to blue. Say we can prove that there must be Omega(2^n) edges with one endpoint of each color. Then must be some direction i with Omega(2^n / n) edges in that direction. Then x_i has good correlation [[why?]] So, it all boils down to saying that if you want to slice a cube into two roughly equal halves, you need to cut a lot of edges. Can prove this using "canonical path" argument. for each pair of examples (x,y) where x is red and y is blue, define the canonical path from x to y to be: "starting with x, scan down the bits from left to right, flipping where needed". Must use at least one red-blue edge. There are at least (1/4)(3/4)(2^{2n}) pairs. But, any *given* red-blue edge can be used by at most 2^{n-1} of these paths (why: because if edge is 01010100 -> 01011100, then x must end with 0100 and y must begin with 01011). So, there must be at least (3/8)(2^n) red-blue edges. In fact, you can even show a stronger result (which we won't have time to prove today): any apx balanced monotone function must have Omega(1/sqrt(n)) correlation with the majority function over all the variables. In particular, this implies an even simpler algorithm: Just asks two SQs: 1. What is Pr(f(x) = 1)? 2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)? Claim: any monotone function must have Omega(1/sqrt(n)) correlation with (a) the all positive function, (b) the all negative function, or (c) the majority function "predict positive iff |x| > n/2".