Lecture notes for 2/2/98 * PAC model * MB ==> PAC * Occam's razor ------------------------- Last time we introduced the PAC model. - assuming data is coming from a fixed distribution D. This will let us make statistical statements about relation between performance on a training set and performance on new unseen data. - Like in MB model, want to make guarantees like "if data is consistent with a concept in C then the alg does well" - Instead of a mistake-bound, we'll ask: how much data do we need to guarantee that with prob > 1-delta, the alg finds a hypothesis with error < epsilon? - Also, more general question: if alg is using hypothesis class H, then how much data do we need to see so that all hypotheses have observed error near to true error? Or, maybe just that the hypothesis with *best* observed error should have nearly best true error, etc. - In general, try for algs with running time poly(1/eps,1/delta, n, size(c)) ---------------------------------------------- PAC model should be easier than MB model since restricting examples to be coming from a distribution. Can make this formal: Theorem: Say we have an algorithm for learning C in MB model with some mistake-bound M. Then, in PAC model, can get hyp of error < epsilon, (with confidence 1-delta) given a training set of size O(M/epsilon * log(M/delta)). Proof: (think of Winnow algorithm) Assume MB alg is "conservative". Run it on the data set. Look at sequence of hypotheses produced: h_1, h_2,... For each one, if consistent with following X examples, then stop. What's X? if mistake bound M is known: 1/epsilon * log(M/delta). If hyp had true error greater than epsilon, then with prob 1-delta/M it would have been discovered to be bad. So, the chance that *any* of our hypotheses fools us is at most 1-delta. What if M is not known? In this case, for the ith hypothesis, can use: 1/epsilon * log(2i^2/delta). Chance we're ever fooled is sum_i delta/2i^2 < delta. Or, as Kamal pointed out, can run MB alg over the data you have and then pick the hypothesis that lasted the longest. This is nicer since it doesn't put "epsilon" and "delta" into the algorithm itself. Turns out, can get a better bound of O(1/epsilon [M + log(1/delta)]) using the fact that *most* bad hypotheses will not take fully 1/epsilon*log(1/delta) time to be discovered. Roughly, the problem is: we're flipping a coin of bias epsilon, and want to know how long we need to flip so that with prob 1-delta we've seen M heads. To get this analysis to work, you use one portion of data to generate the hypotheses and then test them on a separate portion of data. Maybe will get back to later. -------------------------- Last time: if you see m = 1/epsilon[ln(|H|) + ln(1/delta)] examples, then whp all hyps in H of true error >= epsilon have observed error greater than zero (at least 1 mistake). So, if we can solve consistency problem for H efficiently, then we can learn H in PAC model, so long as ln(|H|) is reasonable (e.g., OK for conjunctions, not OK for DNF). What about decision lists? How many are there? ---------------------------------------------- What if we want to talk about hypothesis spaces that have more and less complex hypotheses in them? Def: A {\bf legal size function} on concepts is one such that at most $2^s$ concepts have size $s$. E.g., size == description length in bits, using your favorite language. Theorem: Given $m$ examples, with probability > 1-delta there are no bad (error > epsilon) consistent hypotheses of size less than s, so long as: m > 1/epsilon [s*ln(2) + ln(1/delta)] or s < 1/ln(2) [epsilon*m - ln(1/delta)] Proof: follows simply by noticing that the total number of hypotheses of size less than $s$ is at most $2^s$, and plugging in our earlier bound. Interesting relation to philosophical notion of "Occam's razor". The following statement is attributed to William of Occam ( 1320 AD): ``Entities should not be multiplied unnecessarily'' (in Latin) which is generally interpreted to mean "don't make unnecessarily complex explanations". In our context, can see one justification: simple consistent explanations are trustworthy. Why? Because there aren't too many of them. On the other hand, there are lots of complex explanations so it's easier to get one of them to fit the data just by chance. ------------------------------------------------------- A couple tricky issues regarding the above simple theorems: Suppose I repeatedly perform this experiment: For some favorite hypothesis space H (or hypothesis representation) I pick 10,000 random examples from some database and observe the frequencies of the following events: a. There is no consistent simple hypothesis. b. There is a consistent simple hypothesis, and it has low true error. c. There is a consistent simple hypothesis with high true error. Question: is Prob(c) < Prob(b)? I.e., are simple hypotheses good if we condition on the event that simple consistent hypotheses exist? Answer: no What does the theorem say? Theorem says Prob(c) is small. But, Prob(b) could be even smaller or zero. --------------------------------------------------------- Here is an interesting extension of the above theorem. Theorem: Suppose we have an algorithm that continues to request examples until it finds a hypothesis such that (1) the hypothesis is consistent with the data, and (2) the size s of the hypothesis satisfies s <= 1/(ln 2) * [epsilon * m - ln(2m^2/delta)] where m is there number of examples it has seen so far. Then, with probability 1 - \delta, this algorithm will never output a hypothesis with error > epsilon. Proof: Consider a fixed value of m. The chance of any small, ``bad'' hypotheses being consistent with the data so far is only \delta' = \frac{\delta}{2m^2}, by our previous analysis. Since sum delta/(2m^2) <= delta m the probability that we'll ever output a bad hypothesis is less than delta. =============================================================================