LAST TIME: went though some calculations in PAC model related to Occam's razor. This time, describe one more pt of view: BAYESIAN LAST TIME. Summary: In PAC model, for any legal defn of "simple" that you have, simple-enough consistent hyps are trustworthy because there are not so many of them. Complicated ones may or may not be: less so if more independence. ( line vs complicated-looking function: lines not indep Pr(line A predicts correctly | line B does) >> initial prob. ) PAC theorem really about bias: for any bias you pick, how much data lets you justify going a certain distance down your list? "Simple" is just a bias we can agree on. DIDN'T GET AT: why should there ever exist simple consistent hyps. Maybe the way in which thm is satisfied is just that rarely see simple consistent hyps. --->> Why should there exist simple good hyps??? BAYESIAN/MDL view ----------------- In some sense, only way to really answer "should I prefer A to B?" is to be able to say: "which was a priori more likely?" What do we mean by "which is more plausible?" One way to look at this: model the world as having a prob distribution on explanations (target concepts) and imagine that the world picks one at random (according to this distribution) to be correct one. Enjoy-sport: say everyone in the world has an enjoy-sport function which is a conjunction. Different people have different functions --> frequency distribution. Think of friend as random from this distrib. Now it makes sense to talk about which of A or B is more likely to be correct. A B C D .... Say initially: 20%, 10%, 10%, 5%, 5%, ...., 5%. Now, only B,C,D survive. Posterior probs are 40%,40%,20%. Now, really the target wasn't selected by a distribution, but Bayesian approach is to put set of priors -- how likely did you think it was that H was the correct answer before you saw the data --- on the hypotheses. Use this to calculate posterior probs. Treat above probs as our degrees of belief that A,B,... was correct. Advantages: (1) Put prior knowledge into prior probs. (2) Can also talk about probabilistic hyps: in particular gives a way of talking about evidence that doesn't completely rule out a hyp. (this is main use of Bayesian approach) Say have die you roll and hyps are (A) its fair and (B) its lopsided in certain way. BAYES RULE: Pr(hyp | data) = Pr(hyp AND data)/Pr(data) = Pr(data | hyp)*Pr(hyp) / Pr(data). i.e., Pr(hyp | data) proportional to Pr(data | hyp)*Pr(hyp) [in case of examples, "data" means classifications. So, for deterministic hyps, Pr(data | hyp) is 0 or 1] Do example: --> Person thinks chance they might be afflicted with some disease. Hyps H (healthy) and H-bar (not healthy). Test with 10% false-negative and only 5 % false positive. Say prior is P(H-bar)=0.001 --> calculate. What does this have to do with Occam's razor?? Suppose you pick a concept at random from priors and want to communicate which one it was, to minimize expected number of bits transmitted. Needs to be "self-terminating" transmission. Best possible is sum[ p_i * log(1/p_i) ] bits. Use log(1/p) bits to describe concept of prob p. E.g., probs 1/4, 1/4, 1/4, 1/8, 1/8 / \ /\ /\ A B C /\ D E Key point (obvious): more likely have fewer bits. Like a good editor. So, in a good language, shorter explanations will have higher apriori probs. Rissanen's MDL: take -logs of Bayes rule. In optimal language, get highest postiori by minimizing: (description-length of hyp) + (description length of data given hyp) --> motivates "decision trees with exception lists". What's a good prior?? If have ordering, Rissanen suggests Prob(H_i) is proportional to 1/[i*log(i)*loglog(i)*....] as "slowly growing" to express "maximum ignorance" in a sense. COMBINE WITH PAC --------------- PAC: math says unlikely simple hyps will fool us: i.e. we expect that all simple ones that look good will be good predictors. BAYES: In a good language, simplicity of expression correspond to more likely according to physics of world. I.e., in a good language (evolutionary pt of view) there actually WILL be simple good predictors. COMBINE: in a good language there will likely be simple good predictors and math says simple bad ones won't fool us. Caveat: ------- In Bayesian setting, if you are picking a hypothesis for purpose of prediction, YOU DON'T ALWAYS WANT THE HYPOTHESIS OF HIGHEST POSTIORI PROBABILITY. Take above case 20%,10%,5%....5%. What if all the 5% hyps are "essentially the same"? "Bayes optimal" prediction: take a vote of predictions of each hypothesis, weighted by current probabilities. Or "Gibbs alg": pick a hyp at random from distrib. ---> Note: Gibbs a lot easier to do and at most twice as bad as Bayes-optimal. DISCUSSION ---------- --> discussion on decision forest paper. 1. How can we explain what they see? 2. What lessons should we draw from it?