15-859(B) Machine Learning Theory 03/27/06 * finish SQ hardness * use of Fourier analysis for learning (preliminaries) * Some simple algorithms for learning wrt uniform distrb. ========================================================================= Recap SQ analysis ================= - Fourier view of functions as vectors (once we fix a distribution over data) - Showed that if target is picked randomly from a large set of orthogonal functions, then SQ alg is in trouble. - can view *any* SQ alg as basically proposing a unit vector and asking for correlation (dot-product) with target. (Can break any SQ into one part that looks like this, and one part that depends only on data distribution) - Then use the fact that in any orthogonal coordinate system, a unit vector can have at most 1/tau^2 coefficients that are >= tau - This means that if target is picked from set S of orthogonal functions, Pr(dot-product >= tau) <= tau/|S|. - So, if > poly(n) orthogonal functions, then can't even weak-learn with poly # of queries, each of tolerance 1/poly. Using Fourier analysis for learning =================================== - useful tool for learning wrt uniform distribution. - esp useful with queries. Can think of problem like this: we have a black-box containing some unknown function f:{0,1}^n -> {0,1} from some class C. We can prod f with inputs x of our own choosing. Our goal is to come up with a good approx of f in the sense that it gets most of the domain correct. Some general results ==================== Suppose \phi_1(x), \phi_2(x), ... is a basis of boolean functions. f(x) = \sum_i \hat{f}_i \phi_i(x) \hat{f}_i = correlation of f with phi_i. Claim 1: if you can find non-negligable fourier coefficient then you can weak-learn. Proof: Just directly from defn of correlation. Use phi_i or its complement. Now, suppose we can find "most" of the fourier coefficients. Say we can find a set S that captures all but epsilon of the (\hat{f}_i)^2. Let g = \sum_{i in S} \hat{f}_i \phi_i. Then, as vectors, we have |f-g|^2 = epsilon. Claim 2: even if above g is not a boolean function, sign(g) is a good apx to f. Proof: by definition, E[(f(x)-g(x))^2] = epsilon. Notice that every x where sign(g) is wrong contributes at least 1. So, the error of sign(g) is at most epsilon. In fact, can see from above argument that we don't need to get the fourier coefficients exactly. If someone hands us S, we can estimate the coefficients by looking at correlations, and get a good apx to g. That's all we need. Learning wrt uniform distrib ============================ Above general statements hold for generic setting. Now, focus on learning wrt uniform distribution. Specific basis of parity functions. So, view like this: we have some unknown function f on n inputs in a box. Can see f on uniform random pts. Maybe we're allowed to query f on inputs of our choosing (the Membership Query model). Goal, find g that approximates f in the sense that it gets most of the domain correct. E.g., next time: Kushilevitz-Mansour algorithm that using Membership Queries, will find all i such that (\hat{f}_i)^2 > tau in time poly (n, 1/tau). Can combine this with structural statement that any Decision Tree must have most of its length in large fourier coefficients to get a poly-time algorithm to learn decision trees with MQs wrt the uniform distribution (via Claim 2 above). For now, some simple algorithms that don't use queries. ------------------- DNF: We can complement our SQ lower bound with a simple n^{O(log s)} time algorithm to learn DNF of s terms over uniform distribution. Ideas? + how important are terms of size >> log(s)? + Do greedy set-cover. Get DNF of s*log(1/epsilon) terms. Size is bounded so can use Occam to upper-bound sample size needed. [Can't do "list-and-cross-off" because hypothesis size too big] - Note: LMN use Fourier analysis to get this kind of n^polylog(s) bound for learning AC^0 circuits. (DNF are depth-2 circuits). - On hwk, you are giving alg to learn decision trees over *any* distribution in time n^{O(log s)}. (Weak)-Learning Monotone functions ---------------------------------- A monotone function is one such that if you take some example x and flip some bit from 0 to 1, then the value of f(x) can't change from positive to negative. Quick sanity check: why are we focusing on uniform distribution? If you allow arbitrary distribution, then you're sunk. E.g., consider all probability mass on examples x with |x|=n/2 (the "middle slice"). Target can now be arbitrary on these, so nothing to learn. But under uniform, you can't have too many orthogonal monotone functions. In particular, a classic results is that any monotone function has to have Omega(1/n) correlation with one of {0,1, x_1, ..., x_n}. Specifically: - If Pr(f(x) = 1) is far from 1/2, we're done. (can just say "everything is positive" or "everything is negative). So, for now let's assume Pr(f(x)=1) is between 1/4 and 3/4. - Want to argue that some variable has correlation. I.e., for some x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon. [Sanity check that this is same as our previous def of correlation, i.e., x_i is a weak hypothesis. If error rate is alpha when x_i=0, then error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error rate is their average which is (1-epsilon)/2.] How to prove: look at boolean cube {0,1}^n. Let's color example red if positive and blue if negative. Assume roughly half of each. E.g., say at least 1/4 of nodes are red and at least 1/4 are blue. Want: no matter how you color, a lot of edges crossing from red to blue. Say we can prove that there must be Omega(2^n) edges with one endpoint of each color. Then must be some direction i with Omega(2^n / n) edges in that direction. Then x_i has good correlation [[why?]] So, it all boils down to saying that if you want to slice a cube into two roughly equal halves, you need to cut a lot of edges. Can prove this using "canonical path" argument. for each pair of examples (x,y) where x is red and y is blue, define the canonical path from x to y to be: "starting with x, scan down the bits from left to right, flipping where needed". Must use at least one red-blue edge. There are at least (1/4)(3/4)(2^{2n}) pairs. But, any *given* red-blue edge can be used by at most 2^{n-1} of these paths (why: because if edge is 01010100 -> 01011100, then x must end with 0100 and y must begin with 01011). So, there must be at least (3/8)(2^n) red-blue edges. In fact, you can even show a stronger result (which we won't have time to prove today): any apx balanced monotone function must have Omega(1/sqrt(n)) correlation with the majority function over all the variables. In particular, this implies an even simpler algorithm: Just asks two SQs: 1. What is Pr(f(x) = 1)? 2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)? Claim: any monotone function must have Omega(1/sqrt(n)) correlation with (a) the all positive function, (b) the all negative function, or (c) the majority function "predict positive iff |x| > n/2".