15-859(A) Machine Learning Theory 01/15/04 * Mistake-bound model - simple results - relation to consistency model - Halving and Standard Optimal algorithm ======================================================================= Mistake-bound model ------------------- The consistency model had the problem that there was nothing in it about being able to predict well on new data. The Mistake-Bound model addresses this problem by explicitly modeling learning as an on-line process. In the MB model, learning is in stages. In each stage: 1. Learner gets unlabeled example. 2. Learner predicts classification. 3. Learner is told correct label. Goal: bound the total number of mistakes. DEFINITION: Algorithm A has mistake-bound M for learning class C if A makes at most M mistakes on any sequence that is consistent with a function in C. [later we'll talk about extendig to when there's a "mostly consistent" f in C] Notice that since we're not assuming anything about how examples are selected, we can't necessarily talk about "how much data do I need to converge". E.g., maybe we end up seeing the same example over and over again and don't learn anything new. But, that's OK because we (hopefully) won't be making mistakes either. Later on, when we talk about models that assume data is coming from some fixed probability distribution, we will be able to convert mistake bounds into bounds on how much data we need to converge. But the mistake-bound setting is nice because it's very clean -- no probabilities. We'll think of A as being good if its mistake bound is polynomial in n = the size of the examples, and s = the description length of the smallest consistent function (for classes like DNF and Decision Trees that can have very complicated functions in them). [We are thinking of a "concept class" as both a set of functions and a description language for them.] DEFINITION: C is *learnable* in the mistake bound model if there exists an algorithm with mistake bound poly(n,s), and furthermore whose running time per stage is poly(n,s). Question: why do we talk about a *class* of functions as being learnable or not, rather that just single functions? ----------------examples------------------------------------------ Conjunctions: Start: h = x_1 \Not{x}_1 x_2 \Not{x}_2 ... x_n \Not{x}_n. Mistake on positive: remove from h all literals not satisfied. Guarantee: {literals in h} contains {literals in target}. Implies no mistakes on negatives (if data is consistent with a conjunction). The first mistake removes n literals, and each subsequent mistake removes at least 1, which implies that # mistakes <= n+1. -------------------------- Disjunctions: --> same thing just reverse positive and negative. -------------------------- Lower bound of n: We've seen an algorithm that makes at most n+1 mistakes. In fact, *no* deterministic algorithm can guarantee a mistake bound less than n. Easiest to see with disjunctions: Imagine seeing the following sequence of examples: 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 then, *any* labeling of these is consistent with some concept in C, so for any algorithm, there exists a sequence of examples forcing it to make n mistakes. (And if we imagine labeling them with a random OR-function, can see that even a randomized algorithm has to make n/2 mistakes in expectation). [this in some ways foreshadows the notion of "VC-dimension" that we will get to later...] Relation to consistency model ----------------------------- Theorem: if class C is learnable in the MB model by an algorithm A that uses hypotheses in C, then C is learnable in the consistency model. Proof: Given set S of data, run algorithm through the set over and over until it makes a pass with no mistakes. Let's assume the algorithm is "conservative", meaning that it only changes its hypothesis when it makes a mistake. Then, if it makes a pass with no mistakes, we have a consistent hypothesis. On the other hand, if it keeps making mistakes and exceeds its mistake bound, then we know there is no consistent hypothesis. poly MB + poly time/example ==> poly overall. What about "non-conservative" algorithms? --> can reset its state after seeing a non-mistake --> or, can just use its hypothesis to see if there is any inconsistent example and then feed that example in. Note: the other direction is not necessarily so easy, if we want our algorithm to be efficient. You can construct weird concept classes where the consistency problem is can be solved easily, but achieving a polynomial mistake bound requires being able to invert a 1-way function (e.g., like factoring). Even for natural concept classes, going in the other direction is not necessarily very obvious. E.g., we saw last time how to learn decision lists in the consistency model. On your hwk you will show how to do that in the MB model. One more example: * each example is a number in range: 0 to 2^n-1. (think of as binary). * C = { [0,a] : a < 2^n} (i.e., C = initial sub-intervals). (e.g., imagine that you are observing some device that fails when it gets too hot, and each day you write down current temperature and whether device is working right or not) Simple alg for consistency model: choose concept [0,b] where b is the largest positive example seen. Then verify that no negative example is < b. - Does this work for MB model? - What would work for MB model? use current hypothesis as [0,b] where b is halfway between largest positive and smallest negative example. Cuts down unknown region in half on every mistake. At most n mistakes. ======================================================================== Halving algorithm ----------------- If we don't care about computation time, we can generalize the above algorithm to learn an *arbitrary* concept class C with at most lg(|C|) mistakes. HALVING ALGORITHM: predict using majority vote over all concepts in C consistent with past data. Each mistake of halving algorithm cuts down the number of available concepts in half (or more). So, it makes at most lg|C| mistakes. E.g., there are 3^n conjunctions, so this makes at most O(n) mistakes. Notice that it takes lg|C| bits to describe a concept in C (if you use the same number of bits per concept), so we are making at most 1 mistake per bit. What about concept classes with both simple and complex functions? E.g., decision trees or DNF formulas or C-programs. Let's define C_b to be the set of functions in C that take < b bits to write down. So, |C_b| < 2^b. Then we can do a "guess and double" on b. If the target takes s bits and s is between 2^b and 2^{b+1}, the number of mistakes is at most 1 + 2 + 4 + ... + 2s < 4s. So, this means that if we remove the running-time restriction in the definition of learnability, then *everything* is learnable in the MB model. Question: What if we had a "prior" p on the different functions in C? Can we make at most lg(1/p_f) mistakes, where f is the target function? Ans: Sure, just do a weighted vote, cutting down available probability mass in half (at least) on every mistake. This actually gives a better algorithm for the case of functions of different sizes. Just give each function f a weight of 1/2^size(f). If our description language is a prefix-code (e.g., like a huffman code: think of a binary tree with concepts at the leaves) then the sum of all the weights is <= 1 to start [why?]. If the target has size s, then the number of mistakes is at most lg(2^s) = s. So, we got rid of the "4". Theorem: if computation time is no object, then can learn any concept class C with # mistakes <= # bits in description of the target function. ---- a few asides ---- Question: what if the target function really was picked according to our prior? What can we say about our *expected* number of mistakes? Ans: E[M] <= sum_f [p_f * lg(1/p_f)] = *entropy* of distribution p. What's also interesting is that our algorithm is now equivalent to the "Bayes optimal" strategy: at each step we are predicting the most likely outcome given our info so far, under the assumption the target was picked from p. So this gives two motivations for the same algorithm: it's the correct thing to do if our prior is right (Bayes motivation), but it's also a good thing to do if we are just hoping our prior is reasonable (Mistake-bound motivation). One more aside: what if the true distribution is p, but we thought it was q. What does our bound look like now? Ans: sum_f [p_f * lg(1/q_f)]. The difference between these two quantities is called the KL divergence between p and q: KL(p||q). Not symmetric. ======================================================================= Question: is halving algorithm optimal? Say you have unlimited computation time, does this have the best possible mistake bound? Surprisingly, the answer is no. On-line learning as a 2-person game ----------------------------------- If we ignore the issue of computational efficiency. we can think of learning in the mistake-bound model as a 2-person game. In particular, for a deterministic algorithm, the game goes like this: 1. The opponent selects some example x, which splits C into two sets: C_0(x) is the set of concepts that label x negative, and C_1(x) is the set of concepts that label x positive. 2. The algorithm gets to pick one of C_0(x) and C_1(x) to throw away (by predicting negative and making a mistake it throws out C_0(x), or by predicting positive and making a mistake it throws out C_1(x)) Keep going until only one concept left. (view two semantically equivalent concepts as the same). The value of this game (to the opponent) is the number of rounds the game is played. Looking at the problem this way, we can see a couple of things. First, there is a well-defined notion of optimal strategies for each player and second, there is a well-defined number OPT(C) that is the optimal mistake bound for concept class C. What is optimal strategy? Given an example x, we ``just'' calculate OPT(C_0(x)) and OPT(C_1(x)) (by applying this idea recursively), and throw out whichever set has larger mistake bound. Another way to look at it is that OPT(C) is defined recursively as: If |C| = 1 then OPT(C) = 0. Else OPT(C) = 1 + max( min( OPT(C_0(x)), OPT(C_1(x)) ) ) x This algorithm is often called the "Standard optimal algorithm" (more discussion in Littlestone paper). This is *not* necessarily the same thing as the Halving Algorithm (see hwk problem #5). ----- Another thing to think about is what if there isn't any perfect target function in C? One thing we might hope to do is perform comparably to the *best* function in C in hindsight. Next class we'll see a really neat algorithm for doing this that is like a flexible version of the halving algorithm: instead of throwing out the inconsistent functions, you just give them a reduced vote. By doing this right we'll be able to get a bound that looks like: log(|C|) * (1/epsilon) + (1 + epsilon)*M_best where M_best is the number of mistakes of the best function in C in hindsight.