Notes for 2/4/98 [current material is in the book.] Last time: "Occam's razor theorem": simple hypotheses are trustworthy since not too many of them. Another way to think about it: if you can find a simple enough hypothesis, then you can be confident with less data. One nice way of stating this is given as Theorem 2.1 in the book: Theorem: Suppose alg A, given m examples labeled according to c in C finds a consistent hypothesis h in H such that size(h) < (n*size(c))^alpha * m^beta where beta < 1. Then A PAC learns C by H. Proof is just that since RHS grows less than linear with m, it won't be too long until size(h) is sufficiently smaller than m to apply our theorem. --> compression implies learning Example: learning a monotone disjunction with few relevant variables. Goal: use fact that target has few relevants to get away with less data. Standard alg, get m = 1/epsilon[log(2^n) + log(1/delta)], or about n/epsilon We saw one way to take advantage of having few relevant variables using Winnow, which had linear threshold hypotheses. Analyzed in MB model. Can convert to PAC using conversion described last time. Another way: "Greedy set cover" algorithm. Given a sample of m data points, throw out all inconsistent variables. But, instead of just returning the OR of all of them, repeatedly do this: - pick the variable that covers the *most* positive examples, and put into h. - throw out the (positive) examples covered. Idea: if we're lucky, we'll have found a consistent OR function with not too many variables, so we don't need as much data. Theorem: if the data is consistent with a disjunction of r variables, then we will find a disjunction of at most r*log(m) variables. Proof: each time we throw out at least a 1/r fraction of the positive examples remaining. So, after t steps, the number remaining is at most m(1-1/r)^t. This is less than 1 for t = r*log(m). To get a PAC guarantee, just need all bad "small" disjunctions to be inconsistent. One way to say this: let's say we describe a disjunction by listing the variables in it. Takes log(n) bits to describe a variable. So, size(h) = (# vars in h)*log(n). So, to learn in PAC model, suffices to see m examples for m > 1/epsilon [r*log(m)*log(n)*ln(2) + ln(1/delta)] This is a mess to calculate exactly but ignoring delta and loglog terms this comes out to O(1/epsilon [r*log(n)*log(r/epsilon)]) Is having too *many* choices of hypothesis necessarily bad? ---------------------------------------------------------- One way to look at our argument: if H has N hypotheses of error epsilon, then E[ # consistent ] = |H|(1-eps)^m. Now we note that if expected number of events is small (say 0.1) then Prob(at least 1 event occurs) must be small too. Can we say that if expectation is large, then prob(at least 1) is large too? Answer: it depends. This is often not true. Could be tiny chance of (at least 1) but when it occurs you get a lot. E.g, if you buy 100 lottery tickets then expected[#dollars won] > 1 but prob[>0] is small. BUT it is true when events are INDEPENDENT. If INDEP, then expectation = 1 implies reasonable prob of at least 1. E.g., say the events that the h in H are consistent with sample are independent. Then, if m = 1/epsilon * ln(N), then for a fixed h in H, the probability it is consistent with data is (1-epsilon)^m, which is roughly 1/N. So, expectation is approx 1, and Pr[at least 1 consistent] ~ 1 - (1 - 1/N)^N ~ 1 - 1/e So, we have a chance about 1 - 1/e of being fooled. VC dimension ------------ To get a better handle on our question, we need a way of measuring how many "really different" hypotheses there are under consideration. VC dimension, and the "effective number of hypotheses" are a way of getting at this. First, another motivating problem: Consider problem of learning an initial subinterval of [0,1]. Example is a real number in [0,1]. C = {[0,a] : a <= 1} Here's an algorithm: pick the interval [0,b] where b is the largest positive example seen so far. Question: how much data does this need to achieve PAC guarantee? Answer: consider the region [a',a] of probability measure espilon. If we see an example inside that region, then we're done. Prob we fail in m examples is (1-epsilon)^m. ==> 1/epsilon * log(1/delta) are sufficient. We're doing a lot better than formula that had log(|C|) in it. Why?? --> Only one "degree of freedom". How much is a degree of freedom worth? Are they all worth the same? --> Look at "effective number" of concepts as a function of number of data points seen. Only grows linearly.... Turns out, can use this "effective number" in place of actual number. But, it's not obvious. Define: C[m] = max number of ways to split m points using concepts in C. (book calls this PI_C(m)) Theorem: If number of examples seen m satisfies m > 2/epsilon * [log_2(2C[2m]) + log_2(1/delta)] then whp (1-delta), all bad (error > epsilon) hypotheses in C are inconsistent with data. C[m] is sometimes hard to calculate exactly, but turns out can get a pretty tight bound on it using something called "VC-dimension". ==> go to slides