Notes for Feb 11, 1998 -> probabilistic inequalities -> weak and strong learning =========================================================================== So far, talked about guarantees useful when exists a consistent hypothesis. But, what about common case where there isn't? What kind of guarantee might we want then? - All hyps that look good are actually pretty good. - All hyps have observed error near to true error. (called uniform convergence bounds) -->See the Probabilistic inequality notes. How to prove Chernoff bounds (roughly) -------------------------------------- Say have a fair coin. Flip m times. Expect m/2 heads. Want to show that Pr[# heads >= m/2 + k*sqrt(m)/2] < e^{-3/8 * k^2}. (i.e., k is the number of standard devs) So, the chance that the number of heads is >= k standard devs above the expectation is at most e^{-0.375 k^2}. Proof: Imagine a weird random walk on real line. Start at 1. Flip a coin. If heads, you multiply position by (1 + k/2sqrt(m)) and if tails you multiply by (1 - k/2sqrt(m)). This is not exactly fair since if you have equal numbers of heads and tails, you end up to the left. E.g., if you have one of each, then final position is 1 - k^2/(4m). On the other hand, it turns out the EXPECTED value of your final position is 1. To calculate this formally, define random variable Y_i to be the value you multiply position by on the ith coin toss. So, the final postion Y = 1 * Y_1 * Y_2 * ... * Y_m. E[Y_i] = 1/2 * (1 + k/2sqrt(m)) + 1/2 * (1 - k/2sqrt(m)) = 1. E[Y] = product of E[Y_i] (because they are INDEPENDENT) = 1 Simple inequality called "Markov's inequality" says that for any non-negative random variable X, Pr[X >= t*E[X]] <= 1/t. Can anyone see why this is true? Apply to Y. Pr[Y >= e^{(3/8)k^2}] < e^{-(3/8)k^2}. But now, notice that it doesn't take *too* many extra heads to pull Y up a lot since it is multiplicative. In particular, if we get m/2 + k*sqrt(m)/2 heads (and m/2 - k*sqrt(m)/2 tails), then solving for Y, we get: (first pairing up the heads and tails, then looking at the extra heads) Y = (1 - k^2/4m)^{m/2 - k*sqrt(m)/2} * (1 + k/2sqrt(m))^{k*sqrt(m)} which (using (1+epsilon)^n is approx e^{epsilon*n}) is approximately e^{-(k^2)/8 + (k^3)/8sqrt(m) + (k^2)/2} > e^{(3/8)k^2} (technically, one of the above "approx"s is slightly in the wrong direction...) Weak and strong learning ------------------------ We defined PAC (prediction) model like this: Def1: Alg A PAC-learns concept class C if for any c in C, any distribution D, any epsilon, delta > 0, with probability 1-delta, A produces a poly-time evaluatable hypothesis h with error at most epsilon, given poly (n, 1/epsilon, 1/delta, size(c)) examples and running time. What if we change the definition to say delta = 1/2. Or, even delta = 1 - 1/poly(n). I.e., there is *some* noticeable chance it succeeds. Def2: same as Def1 but replace "any delta > 0" with "some delta < 1-1/poly(n)". If C is learnable using Def2, is it also learnable using Def1? Yes: Given an alg A that works using Def2, run it poly(n)*log(1/delta) times, with epsilon/4 as error parameter. Let's set our number of times so that with probability 1-delta/3, at least one of those succeeded. Now, draw a new sample of m points and test the hypotheses on it, picking one whose observed error is < epsilon/2. Use Chernoff bounds: Pr(the good hyp looks bad) < e^{-m*epsilon/12} Pr(a given bad hyp looks good) < e^{-m*epsilon/8} Want first quantity to be less than delta/3. Want second quantity to be less than delta/(3*(# of hypotheses)). Say delta^2/poly(n). Solve: m = O((1/epsilon)*log(poly(n)/delta)). So, delta not so important. Can fix to 1/2 and still get same notion of "learnability". What about episilon? Def 3: Let's say that alg A WEAK-LEARNS class C if for all c in C, all distributions D, THERE EXISTS epsilon, delta > 1/poly(n,s) s.t. A achieves error 1/2 - epsilon with prob at least delta. I.e., with some probability you get noticeable correlation. Question: supposed we defined PAC model this way, does this change the notion of what is learnable and what is not? Answer, it doesn't. Given an algorithm that satisfies Def3, can "boost" it to an algorithm that satisfies Def1.