15-681 Machine Learning. Lecture 9/5 o Recap: The PAC model, intervals. o PAC and consistency, conjunctions o decision lists o Occam's razor ========================================================================= Note: web page has all handouts, plus lecture notes, slides, etc. Note: See Terri in Wean 4117 if want to reg or reg'd late. ***The PAC model*** Assume seeing data from some probability distribution D over instance space X. Assume labeled by some target concept c in C. Defn: Learner learns class C (by H) in PAC model if for any target concept c in C, any distribution D, any epsilon,delta > 0: with probability > 1-delta it finds a hypothesis h in H with error < epsilon. Allowed: Time, examples poly(n,size(c),1/epsilon,1/delta). Example from last time: learning an initial subinterval. C = { [0,a]: a > 0} target concept is some specific interval. _________________ ____ _____ _____/ \______/ \ / \ / \___/ \____ |-----------------|----------|-------------------------| a_epsilon a Algorithm: produce hypothesis [0,b], where b is the largest positive example seen. Analyze in PAC model. Clearly poly time per example. Question: how many examples? Idea: define a_epsilon as the number less than *a* such that the interval [a_epsilon, a] has probability epsilon. Notice that we succeed exactly when we have seen some example in this range. Say we see M examples. Want Prob(failure) < delta. Prob(seen no examples in [a_epsilon, a]) < delta. (1-epsilon)^M < delta. [Note: for a given example, it's in range with prob epsilon, so not in range with prob 1-epsilon. Want the AND of M independent events, so multiply them to get probability.] Solve for M. We want: M ln(1-epsilon) < ln(delta), or -M ln(1-epsilon) > ln(1/delta) M > ln(1/delta) / [-ln(1-epsilon)] Use: -ln(1-x) = x + x^2/2 + x^3/3..., so suffices to have: M > (1/epsilon) ln(1/delta). This is polynomial in 1/epsilon and 1/delta. If we see this many examples, w.h.p we will have seen one in the desired range (even though we don't know what the range is) and therefore we will have succeeded. Let's step back a bit. Notes: 1. H=C is saying we want to learn anything we can represent. Could make H = poly-time programs. (i.e., we don't care what representation the learner uses so long as it can predict well) then this is typically called ``PAC prediction''. 2. Could make H smaller than C. define epsilon_H = error of best hypothesis in H. Then define "good" as error only epsilon_H + epsilon. E.g., set C = all functions. Goal is to find best h in H without assumptions. E.g: say we don't assume anything about our data in the 1-d example above. We might still want to find the "approximately best" subinterval. In fact you can do that by taking a bunch of data, looking at interval [0,b] for all positive examples b, and taking the one with least error. But, it takes a more sophisticated analysis to prove that this works. 3. Model is robust. Say we fix epsilon=1/4, delta = 1/4. Get same notion of learnability. ***Conjunctions*** Remember last time we were asking "why should consistency mean anything about predicting well on new examples?" Let's look at our FIND-S algorithm for finding the most specific conjunction, and try to prove that it succeeds in the PAC model. In fact, will show that any algorithm that finds a consistent conjunction (and has polynomial running time per example) works. Why find the most specific? Because it's something we can do efficiently. Finding the most general (i.e., smallest) conjunction is NP-hard. Here's a nice way to prove it. 1. Consider some specific ``bad'' conjunction whose error is at least epsilon. The probability that this bad conjunction is consistent with M examples drawn from D is at most (1-epsilon)^M. 2. How many conjunctions are there over n variables: 3^n. 3. (1) and (2) imply that given M examples drawn from D, the probability THERE EXISTS a bad conjunction consistent with all of them is at most: 3^n(1-epsilon)^M. Why? Prob(A or B) <= Prob(A) + Prob(B). Let's think about this: Suppose that M is sufficiently large so that 3^n(1-epsilon)^M < delta. That means that with probability (1-delta), all BAD conjunctions are INCONSISTENT. Therefore, our algorithm (which finds a consistent one) succeeds. Solving for M just like before: M ln(1-epsilon) + n ln(3) < ln(delta) M > (1/epsilon)[n ln(3) + ln(1/delta)] We're polynomial in 1/epsilon, 1/delta, and n. ***PAC and CONSISTENCY*** Let's generalize from this one example.... THEOREM: Say we have an efficient (fast enough running time) algorithm for learning class C in the consistency model. Then we can learn C in the PAC model, and we need at most (1/epsilon)[ ln|C| + ln(1/delta) ] examples. -> In PAC setting, consistency DOES imply predicting well on new data, so long as class of concepts is restricted. Gives an answer to why the conjunction alg seemed more like it was learning than the DNF alg that basically memorized the data. If you let C = DNF formulas, then |C| = 2^2^n, so ln(C) is exponential in n. This is saying you aren't guaranteed success until you see (roughly) all the examples. ***Another example: Decision lists*** Example: If x_1 then positive else if x_3 then negative else if x_5 then positive else negative. How many decision lists are there? Hard to count exactly but how about an upper bound: 2n (or 4n if allow negations) different rules. So at most (4n)! different lists. Sounds like a lot, but we get to take the log, so we're OK in PAC model. How about finding a consistent one? Is there a consistent decision list? 1 0 0 1 1 + 0 1 1 0 0 - 1 1 1 0 0 + 0 0 0 1 0 - 1 1 0 1 1 + 1 0 0 0 1 - Algorithm: build from top down, putting down any consistent rule you can that classifies at least 1 example. Works because you never get stuck. Look at examples you haven't classified so far. Each is classified by some rule in target function. Top one of these is a good legal rule to put down next. ***OCCAM's RAZOR*** William of Occam (approx 1320 AD) said: ``Entities should not be multiplied unnecessarily'' (in Latin) There are a bunch of ways you could interpret this but the simplest one is: all things equal, prefer simpler explanations. Let's try to justify this in the PAC model. Let's write down all explanations from simplest on up: h_1, h_2, h_3, h_4,... Let's define "size" or "complexity" of explanation h_i as number of bits it takes me to tell it to you. So, there are at most 2^s hypotheses of size < s. (Since there are at most 2^s different strings of length < s). Remember we were arguing that for a hypothesis class H, w.h.p. there are no bad concepts consistent with the data so long as |H|*(1-epsilon)^M < delta, or, |H| < delta*e^{epsilon*M} Let's apply to our setting. What is the chance that there is a bad consistent hypothesis of size < s? Replace |H| by 2^s. w.h.p. there are no bad consistent hypotheses with size less than s < 1/ln(2) (epsilon*M - ln(1/delta)) What does this mean? Can think of this as saying that simple (small size) hypotheses are "safe" in the sense that it's unlikely one of them is bad and yet looks consistent with the data. And, the reason is that there aren't that many simple hypotheses. Therefore, Occam's razor is a good idea. Really saying that ANY bias is fine (doesn't say that MY notion of simpler is better than YOUR notion of simpler). The only requirement is that you specify the bias ahead of time. Of course, there is no guarantee that there EXISTS a consistent small hypothesis. Our statement could be satisfied by all small hypotheses being inconsistent. So, a good ordering is one in which you put the things you think are more likely to be true first, to increase the chances that there will in fact be a small consistent hypothesis. **To end up** Our bound 1/epsilon(ln(|C| + ln(1/delta)) is an UPPER BOUND. It's a guarantee, but you probably in practice will need many fewer examples. In practice, you would train on some data, and then test on some new data to see if in fact your error was small enough. Where is overcounting coming from: 1. assuming all bad hypotheses are sitting at just above epsilon error. In practice a lot of bad hypotheses are really bad and get thrown out much sooner. 2. using prob(A or B) < prob(A) + prob(B). If hypotheses are correlated, then fact that one is inconsistent means the other is probably inconsistent too. E.g., there are an infinite number of intervals [0,a] and yet we got a nice bound by a direct analysis.