OCCAM'S RAZOR ------------- Today: start discussion of Occam's razor. Start with mathematical caluclations. Come up with some theorems. Try to interpret. Look at some experimental results and try to interpret them too. Take today plus part of Thursday. --> Assignment for Thurs: look at "Decision Forest" paper. Occam's razor: - why is it often a good idea? (language? psych? physics? mathematics?) - when is it a good idea? - If two people have different ordering of "simplicity" how can it possibly be a good idea for both? questions to try to address: (1) Why pick simple explanations? (2) Are they generally better than complicated ones? (3) Given two hyps A and B consistent with data, one simple one more complicated, should you prefer the simpler? Start by doing analysis from PAC model. -------------------------------------- What do we mean by simple? Each have different notions. For each of us, can imagine ordering possible explanations in the world so that if A is simpler than B then A < B. Key idea will be: "simple hypotheses that look good are probably truly good because there aren't too many of them --- so it's unlikely one will just have fooled you by chance." In particular, can do calculation just like did last week: if # samples m > (1/epsilon)*(ln(N) + ln(1/delta)), then w.h.p, there is no bad consistent hypothesis with index < N. ("bad" hyp is one with error > epsilon.) epsilon=10%,delta=5%, get: m = 10(ln(N) + 3). Or, to tie into more natural notion of "simplicity". Say use "description length in English" and I only allow you rules you can explain in L letters. (maybe L=100). Then feel confident if: m >= (1/epsilon)*(ln(26^L) + ln(1/delta) == roughly 33(L+1) for 10%/5%. This is way overcounting since only very few strings of characters of length < 100 are hypotheses. DEFINE: A legal "size" function of a hypothesis is one such that there are at most 2^s hypotheses of size s or less. IN PARTICULAR: use size = log_2(index) = index in binary. Then we get: If draw m examples from distribution, Prob (over the draw of examples) [there exists a consistent bad hyp of size <=s] < 2^s * (1-epsilon)^m -- and by rearranging -- (just like last week) | Prob(exists bad consistent hyp of size <=s | have seen >= 1/epsilon (sln2 + ln(1/delta)) examples) < delta. | This statement says: if I see m data points, with high prob all consistent simple hyps will be good. DOESN'T SAY: if I see m data pts and there is a consistent simple hyp, then with high prob it will be good. less cryptically: This doesn't say: Prob(all consistent simple hyps are good | see m data pts AND there is a simple consistent hyp) is high. -------------------------- Push a little further on this. Say there is no simple consistent hyp. Want to keep asking for more data until some simple enough hypothesis is consistent and then stop. I.e., Want to say: feel confident if hyp is small enough compared with data and Not give a hard boundary beforehand. One thing might try. Fix delta=10%. Get some m examples. See if a suff. small hyp is consistent. If so, stop. Otherwise draw more examples, recalculate s (again with delta=10%) and see again. Keep going until get small consistent hyp. How confident are we? Can we be 90% sure???? Problem: NEED TO HAVE DELTA DECREASE. THEOREM: Suppose alg only outputs a hypothesis if (A) it's consistent with data, and (B) size of hyp s satisfies: (1/eps)*(s ln2 + ln(2s^2 / delta)) <= # examples. ---or, roughly equivalently--- s <= (1/ln2)*[m*epsilon - ln(2m^2/delta)] (m=#examples) THEN with prob >= 1-delta it never outputs a bad hypothesis. i.e., if can compress this much, then can be confident. WHY: Prob(failure) | | | | | |____________________________ s Chance ever fails is at most sum (s=1 to infinity) of delta/2s^2. = delta/2 * sum(1/s^2). Last is pi^2/6. < delta. Discussion/interpretation: --------------- A. Does not say that expect good results if condition on hyps of fixed size s being consistent. E.g., if pick s too small, the only times its consistent will by definition be times when data was not representative. B. Doesn't address (2), (3). Let's try to address (2). First, can't hope to say "complicated hyps are bad" because some will by luck happen to be good. So most we could hope to say (in PAC framework) is "complicated hyps don't inspire confidence". If you think about it, this is really the point of Occam's razor anyway. Sometimes, though, this is just wrong (according to above defn of simple). IN FACT, sometimes complicated ones can be just as good or better: --> enjoy-sport hypothesis A = sunny hypothesis B = sunny or wins_lottery or (July and snowy). ___ _ _ _ --> line with strange slope vs __| |_______| |_| |_| |_____. (i.e., 2nd function is intuitively "more complicated" even though line takes more bits to write down.) ---> foreshadow later discussion of VC dim <---- Still, can do analysis to get at (2), but will require assumption that is usually COMPLETELY FALSE. Do anyway because helps to understand where "Occam's razor" or "MDL" breaks down. Say: - N hypotheses - ALL ARE INDEPENDENT (for any set of hyps S, Pr(all correct) = product of probs each one is correct) (this is the STRONG STRONG OFTEN VERY BAD ASSUMPTION) Now, see m examples and some hyps are consistent. What kind of true error can you expect? Say all have error epsilon. Expected[# consistent] = N(1-eps)^m. (don't need indep for this) Now, want to say that if N is large enough, then good chance that at least one hyp just by luck will be consistent. Will mean that no reason to believe that consistent hyp has error < epsilon. Note: if expected number of events is small (say 0.1) then Prob(at least 1 event occurs) must be small too. This is analysis from before. Want to say here is that if expectation is large, then prob(at least 1) is large too. This is often not true. Could be tiny chance of (at least 1) but when it occurs you get a lot. E.g, if buy 5 lottery tickets then expected[#dollars won] > 1 but prob[>0] is small. BUT Is true when events are INDEPENDENT. If INDEP, then expectation = 1 implies reasonable prob of at least 1. Say N = 1/(1-eps)^m, so Expected[#] = 1. Then, Pr(at least one consistent) = 1 - (chance ALL are NOT consisent) = 1 - (1 - 1/N)^N = 1 - 1/e = 0.63... So, if independent, this says that to be confident in error < epsilon you should have N < 1/(1-eps)^m. Use this to explain real data. From presidential elections. Crook county voted for the winner 27 times in a row. What do you think is the chance they will get the right answer on the next one? --> show slide of newpaper article. Note: 3141 counties. --> go through calculation to get: between 0.7 and 0.75 prob correct. (Using first calculation we get that it's unlikely (< 20% chance) that ANY hyp of error > 30% survives, and using independence calculation, we get that if all have error 25% then its likely at least one will survive). [This ignores the problem that if no such county existed then the article might have been about cities, etc.] Suppose all 0.75 predictors, then expected number surviving: | # | surviving | 4.2 | 2.4 |_______________1.33__ 1976 1984 1992 ACTUAL NUMBER SURVIVING: 4 in 1976, 2 in 1984, 1 in 1992. Summary: In PAC model, simple-enough consistent hyps are trustworthy, and complicated ones may or may not be: less so if more independence.