15-859(A) Machine Learning Theory 02/26/04 Recap SQ analysis from last time. New problem: (Weak-)learning monotone functions over uniform distribution - good case-study of analyzing distrib-specific learning - Upper-bounds will be simple SQ algs. Lower bounds will be info-based. ======================================================================== Recap SQ analysis from last time. ================================ - Fourier view of functions as vectors (once we fix a distribution over data) - Showed that if target is picked randomly from a large set of orthogonal functions, then SQ alg is in trouble. - can view *any* SQ alg as basically proposing a unit vector and asking for correlation (dot-product) with target. (Can break any SQ into one part that looks like this, and one part that depends only on data distribution) - Then use the fact that in any orthogonal coordinate system, a unit vector can have at most 1/tau^2 coefficients that are >= tau - This means that if target is picked from set S of orthogonal functions, Pr(dot-product >= tau) <= tau/|S|. - So, if > poly(n) orthogonal functions, then can't even weak-learn with poly # of queries, each of tolerance 1/poly. (Weak)-Learning Monotone functions ================================== Today, we're going to look at the problem of (weak)-learning MONOTONE functions, over the uniform distribution on {0,1}^n. Monotone means that if you take some example x and flip some bit from 0 to 1, then the value of f(x) can't change from positive to negative. Use |x| to denote the number of 1s in x (the "size" of x). We're going to allow targets that are arbitrarily complicated (i.e., consider algs that don't depend on description length of target) and see what kinds of upper and lower bounds we can get. It turns out that even if you assume the target has poly-size representation as monotone DNF, no better upper-bounds (algorithmic guarantees) are known than the ones we will discuss today. Whether you can do better in that case is a big open question. Quick sanity check: why are we focusing on uniform distribution? If you allow arbitrary distribution, then you're sunk. E.g., consider all probability mass on examples x with |x|=n/2 (the "middle slice"). Target can now be arbitrary on these, so nothing to learn. But under uniform, you can't have too many orthogonal monotone functions. In particular, the first analysis we will do is show than any monotone function has to have Omega(1/n) correlation with one of {0,1, x_1, ..., x_n}. Specifically: - If Pr(f(x) = 1) is far from 1/2, we're done. (can just say "everything is positive" or "everything is negative). So, for now let's assume Pr(f(x)=1) is between 1/4 and 3/4. - Want to argue that some variable has correlation. I.e., for some x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon. [Sanity check that this is same as our previous def of correlation, i.e., x_i is a weak hypothesis. If error rate is alpha when x_i=0, then error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error rate is their average which is (1-epsilon)/2.] How to prove: look at boolean cube {0,1}^n. Let's color example red if positive and blue if negative. Assume roughly half of each. E.g., say at least 1/4 of nodes are red and at least 1/4 are blue. Want: no matter how you color, a lot of edges crossing from red to blue. Say we can prove that there must be Omega(2^n) edges with one endpoint of each color. Then must be some direction i with Omega(2^n / n) edges in that direction. Then x_i has good correlation [[why?]] So, it all boils down to saying that if you want to slice a cube into two roughly equal halves, you need to cut a lot of edges. Can prove this using "canonical path" argument. for each pair of examples (x,y) where x is red and y is blue, define the canonical path from x to y to be: "starting with x, scan down the bits from left to right, flipping where needed". There are at least (1/4)(3/4)(2^{2n}) pairs. But, any *given* edge can be used by at most 2^n of these paths (why: because if edge is 01010100 -> 01011100, then x must end with 100 and y must begin with 0101). So, there must be at least (3/16)(2^n) edges. This algorithm gives error 1/2 - Omega(1/n). Can we improve? Simple lower bound of Omega(1/sqrt(n)): case of "slice function" where target is random on middle slice. Each example has Omega(1/sqrt(n)) of having |x|=n/2. This is a pretty big gap, though. Turns out we can do better on both sides: - algorithm of error 1/2 - Omega(1/sqrt(n)). - lower bound of 1/2 - O(log(n)/sqrt(n)). A better algorithm ================== Actually, the algorithm is even simpler than the one above. Just asks two SQs: 1. What is Pr(f(x) = 1)? 2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)? Claim: any monotone function must have Omega(1/sqrt(n)) correlation with (a) the all positive function, (b) the all negative function, or (c) the majority function "predict positive iff |x| > n/2". Proof: (modulo a classic thm called Kruskal-Katona thm) - for target c, define p_k = Pr[c(x)=1 | |x|=k]. E.g., what do p_k look like for c = x_1? c=majority? - Claim 1: for all k, p_k >= p_{k-1}. Seems intuitively like it has to be true. Proof: one way to pick a random x at a given level is to start from the all 0s example and put 1s into it at random. Since example can only switch neg->pos in this process, Pr(x is positive after k 1s) >= Pr(x is positive after k-1 1s). So, p_k >= p_{k-1}. - Claim 2: In fact, for i= 1/2 - O(log(sn)/sqrt(n)), where probability is over choice of both c and x. We will prove a weaker bounds where use [log(sn)]^{3/2} instead of log(sn). Here is P_s. - let t = lg(3sn). - Put in each conjunction of t variables into c iid with probability p. - define p so that x with |x|=n/2 has 50/50 chance of being positive. I.e., (1-p)^{n/2 choose t} = 1/2. First part to analyze is how well can an algorithm do without making any MQs? If example x has |x|=n/2 then we can't do better than guessing by definition of p. But what if, say, |x| = n/2 - k? Then, Pr(x is negative) = (1-p)^{n/2 - k choose t} = (1/2)^{(n/2 - k choose t)/(n/2 choose t)} ~ (1/2)^{(1 - 2k/n)^t} ~ (1/2)^(1 - 2kt/n) = (1/2)*(1 - O(kt/n)) So, say we set k to be O(sqrt(nlog(n))). There's at most a 1/n chance that a random x would have |x| \not\in [n/2 - k, n/2 + k]. For those, maybe we can get a good bias, but for the rest, our bias will be at most O(kt/n) = O(log(sn)^{3/2}/sqrt(n)). So, that's where our log^{3/2} is coming from. Can be more careful and just make it a single log. But now the rest of the proof is showing that MQs don't help much. Here's the idea. Let V_s be a vector of (n choose t) entries, giving the probability of each term of size t being in the target function conditioned on the information the alg has so far. So, initially we have: V_s = (p,p,.....,p). Now, suppose the algorithm asks a query x and gets back the answer "negative". Then the conditional distiribution is just to zero out the entries for conjunctions of size t satisfied by x. For positive answers, to keep the conditional clean, let's actually give the algorithm *more* information, by telling it the lexicographically first term in the target that is satisfied by x. So, after a positive example, we get something like: V_s = (p,...,p, 0,0,...,0,1,p,....,p) Claim 1: after s queries, we have <= s ones in V_s. Proof: obvious. Claim 2: after s queries, Pr(#zeros > 2s/p) < e^{-s/4}. Proof: Can think of each query as flipping coins of bias p until get a heads or else you run out of coins (terms that x might satisfy). So, the question is equivalent to asking: how many times will we flip a coin of bias p until we see s heads? In 2s/p flips we expect to see 2s heads. Chernoff says that the chance we see less than half is at most e^{-expectation/8} = e^{-s/4}. Since the flipping ends at s heads, this gives the claim. So, now say we have a random x. Q: What is Pr(x satisfies a "1-entry")? A: <= s/2^t = s/(3ns) = 1/3n. Q: What is the expected number of 0-entries satisfied? A: <= (2s/p)/2^t = 2/(3pn). By Markov, at most 1/sqrt(n) chance it satisfies more than 2/(3p*sqrt(n)). ==> since each term is in target with prob p, this affects the probability that x is positive by at most p*[2/(3p*sqrt(n))] = 2/(3sqrt(n)). So, that does it. The queries can only help by O(1/sqrt(n)). Open problems ============= Hard open question: strong-learn polynomial-size monotone DNF formulas without queries? Probably easier problem: close the log(n) gap.