15-859(B) Machine Learning Theory 01/29/02 * PAC model * some simple results * relations between PAC and MB ======================================================================== One problem with the mistake-bound model is that you can't answer questions like: how many examples do you need to converge to a good hypothesis? That's because the MB model doesn't make any assumptions about the order of examples. (Of course, this is a strength of the model too!) To address this kind of question, we now go on to the PAC model. The PAC model ------------- PAC model is a model for "batch" learning: you train your algorithm on some fixed set of data, and then deploy it in the real world. To do this, we'll need to have some relation between the training set and the world. The basic idea of the PAC model is to assume that examples are being provided from a fixed (but perhaps unknown) distribution over the instance space. The assumption of a fixed distribution gives us hope that what we learn based on some training data will carry over to new test data we haven't seen yet. Also, it provides us a well-defined notion of the error of a hypothesis with respect to target concept. Definition: Given a distribution D over examples, the *error* of a hypothesis h with respect to a target concept c is Pr_{x in D}[h(x) != c(x)]. Suppose we are seeing examples labeled according to some target concept in class C. What kind of guarantee could we hope to make? Two issues: 1) can't hope to get target exactly since distribution might place low weight on some part of instance space. So, instead goal is to approximate the target. 2) can't necessarily guarantee to approximate target since we might be unlucky and get a data sample that doesn't accurately reflect the distribution. So, our goal will be to get an approximation with high probability. -> Probably Approximately Correct (PAC) model. Definition: An algorithm A PAC-learns concept class C by hypothesis class H if for any c in C, any distribution D over the instance space, any epsilon, delta > 0, A needs at most poly (n, 1/epsilon, 1/delta, size(c)) examples and running time to produce a hypothesis h in H that, with probability 1-delta, has error at most epsilon. The quantity epsilon is usually called the accuracy parameter and delta is called the confidence parameter. A hypothesis with error at most epsilon is often called ``epsilon-good.'' This definition allows us to make statements such as: ``the class of k-term DNF formulas is learnable by the hypothesis class of k-CNF formulas.'' Remark 1: If we require H = C, then this is sometimes called ``proper PAC learning''. If we allow H to be the class of polynomial time programs (i.e., we don't care what representation the learner uses so long as it can predict well) then this is typically called ``PAC prediction''. I will usually say: ``concept class C is PAC-learnable'' to mean that C is learnable in the PAC-prediction sense. Remark 2: One nice extension of this model is instead of requiring the error of h be <= epsilon, to just require that the error be at most 1/2 - epsilon. This is called *weak-learning* and we will talk more about this later. Remark 3: Another nice extension is to the case where H is not necessarily a superset of C. In this case, let epsilon_H be the least possible error using hypotheses from H. Now, we relax the goal to having the error of h be at most epsilon + epsilon_H. If we let C be the set of all concepts (and we remove ``size(c)'' from the set of parameters we are allowed to be polynomial in), then this is often called the *agnostic* model of learning: we simply want to find the (approximately) best h in H we can, without any prior assumptions on the target concept. This is like what we got with Weighted-Majority for H = {single variable concepts}. A lot harder to get this guarantee efficiently for interesting H's. A simple example ---------------- Let's show that the simple algorithm for learning conjunctions in the consistency model is a PAC learning algorithm for this class. There are many ways of showing this fact. Here is one particularly nice way. * Consider some specific ``bad'' conjunction whose error is at least epsilon. The probability that this bad conjunction is consistent with m examples drawn from D is at most (1-epsilon)^m. * Notice that there are (only) 3^n conjunctions over n variables. * (1) and (2) imply that given m examples drawn from D, the probability there *exists* a bad conjunction consistent with all of them is at most 3^n(1-epsilon)^m. Suppose that m is sufficiently large so that this quantity is at most delta. That means that with probability (1-delta) there are *no* consistent conjunctions whose error is more than epsilon. Since our algorithm finds a consistent conjunction whenever one exists (and we are assuming one exists here), this means that with probability at least 1-delta our algorithm finds a hypothesis with error at most epsilon. * The final step is to calculate the value m needed to make sure that it is polynomial in our parameters (from the description of the algorithm, it is clear that its running time is polynomial so long as m is). We can do this as follows: Want: 3^n(1-epsilon)^m <= delta. Now, (1-epsilon)^{1/epsilon} <= 1/e [e.g., (1/2)^2, (99/100)^100] so, it suffices to have 3^n (1/e)^{epsilon*m} <= delta, or equivalently, e^{epsilon*m} >= 3^n / delta. Taking logs, m >= (1/epsilon)[nln(3) + ln(1/delta)]. Since this is polynomial in our parameters, we have a PAC-learning algorithm. qed Relating Consistency and the PAC model --------------------------------------- Let's go over analysis. 1. prob a bad hypothesis is consistent with m examples is at most (1-epsilon)^m 2. So, prob exists a bad consistent hypothesis is at most |H|(1-epsilon)^m 3. Set to delta, solve to get number examples needed at most 1/epsilon[ln(|H|) + ln(1/delta)] This gives us a way of relating consistency with PAC model. Theorem: Let A be an algorithm that learns class C by H in the consistency model (i.e., it finds a consistent h in H whenever one exists). Then A needs only (1/epsilon)(ln|H| + ln(1/delta)) examples to learn C (by H) in the PAC model. Therefore, A is a PAC-learning algorithm so long as this quantity is polynomial in 1/epsilon, 1/delta, size(c), and n. The above quantity is polynomial for conjunctions, k-CNF, and k-DNF. It is not polynomial for general DNF. Explains why our DNF algorithm didn't feel right, but we felt more confident in the conjunctions algorithm. ============================================================= What if we want to talk about hypothesis spaces that have more and less complex hypotheses in them? E.g., one way to describe a monotone conjunction is by giving the indicator vector --- this uses n bits for each conjunction. But, another way is by giving the indices of the relevant variables: this uses O(r*log(n)) bits to describe a conjunction of r variables. Let size(c)= # bits it takes to describe c in whatever description language we are using. The only fact we'll use about the language is that at most 2^s concepts can have size < s. Theorem: Given $m$ examples, with probability > 1-delta there are no bad (error > epsilon) consistent hypotheses of size less than s, so long as: m > 1/epsilon [s*ln(2) + ln(1/delta)] or s < 1/ln(2) [epsilon*m - ln(1/delta)] Proof: just replacing |H| with 2^s. Interesting relation to philosophical notion of "Occam's razor". The following statement is attributed to William of Occam ( 1320 AD): ``Entities should not be multiplied unnecessarily'' (in Latin) which is generally interpreted to mean "don't make unnecessarily complex explanations". In our context, can see one justification: simple consistent explanations are trustworthy. Why? Because there aren't too many of them. On the other hand, there are lots of complex explanations so it's easier to get one of them to fit the data just by chance. ------------------------------------------------------- A couple tricky issues regarding the above simple theorems: Suppose I repeatedly perform this experiment: For some favorite hypothesis space H (or hypothesis representation) I pick 10,000 random examples from some database and observe the frequencies of the following events: a. There is no consistent simple hypothesis. b. There is at least one consistent simple hypothesis, and they all have low true error. c. There is a consistent simple hypothesis with high true error. Question: is Prob(c) < Prob(b)? I.e., are simple hypotheses good if we condition on the event that simple consistent hypotheses exist? Answer: no What does the theorem say? Theorem says Prob(c) is small. But, Prob(b) could be even smaller or zero. --------------------------------------------------------- Here is an interesting extension of the above theorem. Theorem: Suppose we have an algorithm that continues to request examples until it finds a hypothesis such that (1) the hypothesis is consistent with the data, and (2) the size s of the hypothesis satisfies s <= 1/(ln 2) * [epsilon * m - ln(m(m+1)/delta)] where m is there number of examples it has seen so far. Then, with probability 1 - \delta, this algorithm will never output a hypothesis with error > epsilon. Proof: Consider a fixed value of m. The chance of any small, ``bad'' hypotheses being consistent with the data so far is only delta/(m(m+1)), by our previous analysis. Since sum 1/m(m+1) = 1, m the probability that we'll ever output a bad hypothesis is less than delta. ============================================================================= Relating PAC and Mistake-Bound models: PAC model should be easier than MB model since we are restricting examples to be coming from a distribution. Can make this formal: Theorem: Say we have an algorithm for learning C in MB model with some mistake-bound M. Then we can PAC-learn using a training set of size O(M/epsilon * log(M/delta)). Proof: (think of Winnow algorithm) Assume MB alg is "conservative". Run it on the data set. Look at sequence of hypotheses produced: h_1, h_2,... For each one, if consistent with following 1/epsilon * log(M/delta) examples, then stop. If h_i has error > epsilon, the chance we stopped was at most delta/M. So there's at most a delta chance we are fooled by *any* of the hypotheses. You can actually get a better bound of O(M/epsilon + (1/epsilon)log(1/delta)) using the fact that *most* bad hypotheses will not take fully 1/epsilon*log(1/delta) time to be discovered. We'll get back to this once we talk about Chernoff bounds.