15-859(B) Machine Learning Theory 01/31/02 * Occam recap * Chernoff and Hoeffding bounds * relations between PAC and MB models ======================================================================== Occam recap ----------- Chernoff and Hoeffding bounds -> see inequalities handout. ----------------------------- Relating PAC and Mistake-Bound models: ------------------------------------- The PAC model should be easier than MB model since we are restricting examples to be coming from a distribution. Can make this formal: Theorem: Say we have an algorithm for learning C in MB model with some mistake-bound M. Then we can PAC-learn using a training set of size O(M/epsilon * log(M/delta)). Proof: (think of Winnow algorithm) Assume MB alg is "conservative". Run it on the data set. Look at sequence of hypotheses produced: h_1, h_2,... For each one, if consistent with following 1/epsilon * log(M/delta) examples, then stop. If h_i has error > epsilon, the chance we stopped was at most delta/M. So there's at most a delta chance we are fooled by *any* of the hypotheses. You can actually get a better bound of O(1/epsilon[M + log(1/delta)]) using the Chernoff bounds we just talked about. This is nice because it looks just like our standard PAC bound, with M replacing ln(|H|). Here's how we can do it. First of all, we're going to split our data set into two pieces. We'll use the first piece S1 to generate hypotheses, and then we'll use S3 as a "hold-out set" to test them. What we want is (1) at least one of hypotheses generated from S1 has true error < epsilon/2, and (2) S2 is big enough so that whp the one of error < epsilon/2 looks better than any of the (at most M-1) hypotheses of error > epsilon. To analyze the first part, think of flipping a coin of bias epsilon/2. Suppose we can show that in N flips, the probability of seeing M or more heads is at least 1-delta. Then that means the probability of success in part (1) using S1 of size N is also at least 1-delta. So, all we have to do is solve for N. Use Chernoff bounds. Set N = max(4M/epsilon, (16/epsilon)ln(1/delta)). Expected number of heads is N*epsilon/2. By Chernoff, the chance we see less than half (less than N*epsilon/4) is at most e^{-(expectation)/8}, which is at most 1-delta. Now, we'll analyze 2nd part. We want the best of the M h's to look better than all the ones of true error > epsilon. Just to be crude, let's pick a value 3*epsilon/4 and argue that whp the good one has better empirical error than that, and whp all the bad ones have worse error than that. Let's assume the good one has true error *exactly* epsilon/2 just for clarity. Claim is that just need (32/epsilon)ln(M/delta) examples: first event has failure prob at most e^{-(expectation)/12} < delta/M. Each of other (at most M-1) events has failure prob at most e^{-(expectation)/32} = delta/M. So, that's it. Notice that the size of S2 is a low-order term (since just have log(M) rather than M). ======================================================================== How to prove Chernoff/Hoeffding bounds (roughly) ------------------------------------------------ Say have a fair coin. Flip m times. Expect m/2 heads. Let's show that Pr[# heads >= m/2 + k*sqrt(m)/2] < e^{-3/8 * k^2}. (i.e., k is the number of standard devs) So, the chance that the number of heads is >= k standard devs above the expectation is at most e^{-0.375 k^2}. Proof: Imagine a weird random walk on real line. Start at 1. Flip a coin. If heads, you multiply position by (1 + k/2sqrt(m)) and if tails you multiply by (1 - k/2sqrt(m)). This is not exactly fair since if you have equal numbers of heads and tails, you end up to the left. E.g., if you have one of each, then final position is 1 - k^2/(4m). On the other hand, it turns out the EXPECTED value of your final position is 1. To calculate this formally, define random variable Y_i to be the value you multiply position by on the ith coin toss. So, the final postion Y = 1 * Y_1 * Y_2 * ... * Y_m. E[Y_i] = 1/2 * (1 + k/2sqrt(m)) + 1/2 * (1 - k/2sqrt(m)) = 1. E[Y] = product of E[Y_i] (because they are INDEPENDENT) = 1 Simple inequality called "Markov's inequality" says that for any non-negative random variable X, Pr[X >= t*E[X]] <= 1/t. Can anyone see why this is true? Apply to Y. Pr[Y >= e^{(3/8)k^2}] < e^{-(3/8)k^2}. But now, notice that it doesn't take *too* many extra heads to pull Y up a lot since it is multiplicative. In particular, if we get m/2 + k*sqrt(m)/2 heads (and m/2 - k*sqrt(m)/2 tails), then solving for Y, we get: (first pairing up the heads and tails, then looking at the extra heads) Y = (1 - k^2/4m)^{m/2 - k*sqrt(m)/2} * (1 + k/2sqrt(m))^{k*sqrt(m)} which (using (1+epsilon)^n is approx e^{epsilon*n}) is approximately e^{-(k^2)/8 + (k^3)/8sqrt(m) + (k^2)/2} > e^{(3/8)k^2} (technically, one of the above "approx"s is slightly in the wrong direction...)