15-859(B) Machine Learning Theory 02/28/02 1.Finish proof from last time (Fourier analysis to prove that can't learn C in SQ model if C has more than polynomially-many uncorrelated functions) 2. More SQ algorithms: weak-learning monotone functions under the uniform distribution. ======================================================================== 1. --> see notes from last time. 2. Weak-learning monotone functions under uniform distribution ============================================================== Here is an interesting SQ algorithm, for getting 1/n correlation (weak-learning) any *monotone* function over the uniform distribution. Say learning from examples, which are uniformly drawn in {0,1}^n. Say that target function is *monotone*. Monotone means that if you take some example x and flip some bit from 0 to 1, then the value of f(x) can't change from positive to negative. You don't have queries: just observe uniform random examples. Just given that the function is monotone, it turns out there is an easy way to *weak-learn* with respect to uniform distribution. Any guesses? - Say that Pr(f(x) = 1) = 1/2 (at least approximately) since otherwise can just say "everything is positive" or "everything is negative". - Want to argue that some variable has correlation. I.e., for some x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon. (If can find such an x_i, then can predict positive when x_i = 1 and negative when x_i = 0) I.e., target has some noticeable fourier coefficient out of these n+1. How to prove: look at boolean cube {0,1}^n. Let's color example red if positive and blue if negative. Assume roughly half of each. E.g., say at least 1/4 of nodes are red and at least 1/4 are blue. Want: no matter how you color, a lot of edges crossing from red to blue. Say we can prove that there must be Omega(2^n) edges with one endpoint of each color. Then must be some direction i with Omega(2^n / n) edges in that direction. Then x_i has good correlation [[why?]] So, it all boils down to saying that if you want to slice a cube into two roughly equal halves, you need to cut a lot of edges. Can prove this using "canonical path" argument. for each pair of examples (x,y) where x is red and y is blue, define the canonical path from x to y to be: "starting with x, scan down the bits from left to right, flipping where needed". There are at least (1/4)(3/4)(2^{2n}) pairs. But, any *given* edge can be used by at most 2^n of these paths (why: because if edge is 01010100 -> 01011100, then x must end with 100 and y must begin with 0101). So, there must be at least (3/16)(2^n) edges. This algorithm gives 1/n correlation. It turns out you get 1/sqrt(n) corelation by an even simpler algorithm that just asks two SQs: 1. What is Pr(f(x) = 1)? 2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)? I.e., we're asking: What is the correlation of f with the "all positive" hypothesis? and What is the correlation of f with the majority function? And can also prove that no algorithm from a polynomial-sized sample can hope to do better than log(n)/sqrt(n) for an arbitrary monotone function. Won't do the proofs here, but this is from a paper with Carl Burch and John Langford. Nice (hard) open question: if you assume the target can be written as a polynomial-size monotone DNF formula, can you do better? [The lower bound doesn't hold in that case] (probably easier) open problem: close the log(n) gap.