15-859(A) Machine Learning Theory 01/23/06 * Overview of main learning models * Mistake-bound model - relation to consistency model - Halving and Standard Optimal algorithm ======================================================================= Overview of main learning models -------------------------------- The consistency model had the problem that there was nothing in it about being able to predict well on new data. In this course, we will be looking at a number of models that address this in different ways: - In the distributional (PAC) model, we assume data comes from some fixed probability distribution over the instance space, labeled by an unknown target function. Assume training data is drawn from this distribution and will then ask questions like: how much data do I need to see so that if I do well over it, then I can expect to do well on new points also drawn from this distribution. - Also look at "active learning" models where the algorithm can choose which points it wants to have labeled. (2 main versions: one where algorithm can construct new examples out of whole cloth, and one where it can just select from existing pool (like web pages or documents)). - In the other direction (making things harder) we can look at the case where an adversary selects the order in which examples are presented (i.e., our algorithm should work for any order). This is the model we are going to start with, and in many ways it's the cleanest to think about, and also yields some very nice algorithms. Mistake-bound model ------------------- In the MB model, learning is in stages. In each stage: 1. The learner gets an unlabeled example. 2. The learner predicts its classification. 3. The learner is told the correct label. Goal: bound the total number of mistakes. DEFINITION: Algorithm A has mistake-bound M for learning class C if A makes at most M mistakes on any sequence that is consistent with a function in C. [later we'll talk about extending to when there's a "mostly consistent" f in C] Notice that since we're not assuming anything about how examples are selected, we can't necessarily talk about "how much data do I need to converge". E.g., maybe we end up seeing the same example over and over again and don't learn anything new. But, that's OK because we (hopefully) won't be making mistakes either. Later on, when we talk about distributional models, we will be able to convert mistake bounds into bounds on how much data we need to converge. But the mistake-bound setting is nice because it's very clean -- no probabilities. So this is the toughest model for positive results. We'll think of A as being good if its mistake bound is polynomial in n = the size of the examples, and s = the description length of the smallest consistent function (for classes like DNF and Decision Trees that can have very complicated functions in them). [We are thinking of a "concept class" as both a set of functions and a description language for them.] DEFINITION: C is *learnable* in the mistake bound model if there exists an algorithm with mistake bound poly(n,s), and furthermore whose running time per stage is poly(n,s). Question: why do we talk about a *class* of functions as being learnable or not, rather that just single functions? ----------------Examples------------------------------------------ Monotone conjunctions: - Start: h = x_1 ^ x_2 ^ x_3 ^ ... ^ x_n. (conjunction of everything) - Mistake on positive: remove from h all variables not satisfied. Guarantee: {variables in h} contains {variables in target}. Implies no mistakes on negatives (if data is consistent with a conjunction). Each mistake removes at least 1 vble, which implies that # mistakes <= n. (let's say the AND of no variables is the "all positive" function) Disjunctions: --> same thing just reverse +/-, 1/0. -------------------------- Lower bound of n: We've seen an algorithm that makes at most n mistakes. In fact, *no* deterministic algorithm can guarantee a mistake bound less than n. Easiest to see with disjunctions: Imagine seeing the following sequence of examples: 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 then, *any* labeling of these is consistent with some concept in C, so for any algorithm, there exists a sequence of examples forcing it to make n mistakes. (And if we imagine labeling them with a random OR-function, can see that even a randomized algorithm has to make n/2 mistakes in expectation). [this in some ways foreshadows the notion of "VC-dimension" that we will get to later...] Relation to consistency model ----------------------------- Theorem: if class C is learnable in the MB model by an algorithm A that uses hypotheses in C, then C is learnable in the consistency model. Proof: Given set S of data, run algorithm through the set over and over until it makes a pass with no mistakes. Let's assume the algorithm is "conservative", meaning that it only changes its hypothesis when it makes a mistake. Then, if it makes a pass with no mistakes, we have a consistent hypothesis. On the other hand, if it keeps making mistakes and exceeds its mistake bound, then we know there is no consistent hypothesis. poly MB + poly time/example ==> poly overall. What about "non-conservative" algorithms? --> can reset its state after seeing a non-mistake --> or, can just use its hypothesis to see if there is any inconsistent example and then feed that example in. Note: the other direction is not necessarily so easy, if we want our algorithm to be efficient. You can construct weird concept classes where the consistency problem is can be solved easily, but achieving a polynomial mistake bound requires being able to invert a 1-way function (e.g., like factoring). Even for natural concept classes, going in the other direction is not necessarily very obvious. E.g., we saw last time how to learn decision lists in the consistency model. On your hwk you will show how to do that in the MB model. One more example: * each example is a number in range: 0 to 2^n-1. (think of as binary). * C = { [0,a] : a < 2^n} (i.e., C = initial sub-intervals). (e.g., imagine that you are observing some device that fails when it gets too hot, and each day you write down current temperature and whether device is working right or not) Simple alg for consistency model: choose concept [0,b] where b is the largest positive example seen. Then verify that no negative example is < b. - Does this work for MB model? - What would work for MB model? use current hypothesis as [0,b] where b is halfway between largest positive and smallest negative example. Cuts down unknown region in half on every mistake. At most n mistakes. ======================================================================== Halving algorithm ----------------- If we don't care about computation time, we can generalize the above algorithm to learn an *arbitrary* concept class C with at most lg(|C|) mistakes. HALVING ALGORITHM: predict using majority vote over all concepts in C consistent with past data. Each mistake of halving algorithm cuts down the number of available concepts in half (or more). So, it makes at most lg|C| mistakes. E.g., there are 2^n conjunctions, so this makes at most n mistakes. Notice that it takes lg|C| bits to describe a concept in C (if you use the same number of bits per concept), so we are making at most 1 mistake per bit. What about concept classes with both simple and complex functions? E.g., decision trees or DNF formulas or C-programs. Let's define C_b to be the set of functions in C that take < b bits to write down. So, |C_b| < 2^b. Then we can do a "guess and double" on b. If the target takes s bits and s is between 2^b and 2^{b+1}, the number of mistakes is at most 1 + 2 + 4 + ... + 2s < 4s. So, this means that if we remove the running-time restriction in the definition of learnability, then *everything* is learnable in the MB model. Question: What if we had a "prior" p on the different functions in C? Can we make at most lg(1/p_f) mistakes, where f is the target function? Ans: Sure, just do a weighted vote, cutting down available probability mass in half (at least) on every mistake. This actually gives a better algorithm for the case of functions of different sizes. Just give each function f a weight of 1/2^size(f). If our description language is a prefix-code (e.g., like a huffman code: think of a binary tree with concepts at the leaves) then the sum of all the weights is <= 1 to start. If the target has size s, then the number of mistakes is at most lg(2^s) = s. So, we got rid of the "4". Theorem: if computation time is no object, then can learn any concept class C with # mistakes <= # bits in description of the target function. ---- a few asides ---- Question: what if the target function really was picked according to our prior? What can we say about our *expected* number of mistakes? Ans: E[M] <= sum_f [p_f * lg(1/p_f)] = *entropy* of distribution p. What's also interesting is that our algorithm is now equivalent to the "Bayes optimal" strategy: at each step we are predicting the most likely outcome given our info so far, under the assumption the target was picked from p. So this gives two motivations for the same algorithm: it's the correct thing to do if our prior is right (Bayes motivation), but it's also a good thing to do if we are just hoping our prior is reasonable (Mistake-bound motivation). One more aside: what if the true distribution is p, but we thought it was q. What does our bound look like now? Ans: sum_f [p_f * lg(1/q_f)]. The difference between these two quantities is called the KL divergence between p and q: KL(p||q). Not symmetric. ======================================================================= Question: is halving algorithm optimal? Say you have unlimited computation time, does this have the best possible mistake bound? What do you think? Turns out the answer is no. You can think of the halving algorithm as saying "I want to go with the prediction such that, if I am wrong, I cut down the number of functions left by as much as possible". But what you *really* want is to go with the prediction that, if you are wrong, cuts down the *mistake-bound* of the set of functions left over as much as possible. E.g., it's possible for the MB for a set of functions C to be much *less* than log|C|. Examples? So, if MB(C) is the optimal mistake-bound for C (best MB achievable by a deterministic algorithm), and the current example splits C into C+ and C-, then you predict + if MB(C+) > MB(C-), else predict -. In fact, for deterministic algorithms, you can view this as a 2-player alternating-move game between the algorithm and the adversary. This algorithm is often called the "Standard optimal algorithm" (more discussion in Littlestone paper). This is *not* necessarily the same thing as the Halving Algorithm (see hwk problem #5). ======================================================================== What if there isn't any perfect target function in C? One thing we might hope to do is perform comparably to the *best* function in C in hindsight. In a few lectures, we'll see a really neat algorithm for doing this that is like a flexible version of the halving algorithm: instead of throwing out the inconsistent functions, you just give them a reduced vote, and predict probabilistically. By doing this right we'll be able to get a bound that looks like: log(|C|) * (1/epsilon) + (1 + epsilon)*M_best where M_best is the number of mistakes of the best function in C in hindsight. This then has a number of nice connections to game-theory. First, though, we're going to look at some very nice efficient algorithms, including one that gets to the question: how can we design learning algorithms that handle large numbers of irrelevant features?