Notes for Jan 19, 1998 Handouts: Minsky-Paper "Perceptrons" chapter. Littlestone paper. Admin: These handouts are related to today's topic and also topics for next few lectures. On web page: putting my typed notes and also related readings that you might want to look at. -- e.g., paper that describes the neural-net NP-hardness proof. ------------------------------------------------------------------------------- To finish up from last time: algorithm for Decision Lists in the consistency model: A DL is a list of if-then rules where each condition is a literal and each consequent is + or -. (Equivalent to a decision tree with just one long path) Let's find algorithm to learn DLs in the consistency model. Extending to Mistake-bound model is on the homework. An example: 1 0 0 1 1 + 0 1 1 0 0 - 1 1 1 0 0 + 0 0 0 1 0 - 1 1 0 1 1 + 1 0 0 0 1 - Algorithm: 1. Find some rule consistent with the current set of examples that applies to at least one of them. If no such rule exists, halt with failure. 2. Put the rule at the bottom of the hypothesis. 3. Throw out those examples classified by the hypothesis so far. 4. If there are any examples left, repeat. It is clear that this algorithm runs in polynomial time, since each iteration removes at least one example, and each iteration can be performed in poly time. To prove: if there exists a consistent list, this algorithm will find one. Proof: If there is a decision list L consistent with remaining data, then there is at least one choice for step 1 (the highest rule in L satisfied by at least one example). Furthermore, if there was a consistent DL at the beginning, then there must exist a DL consistent with the remaining data --- namely, the original DL. ta da. (qed) ------------------------------------------------------------------------------- Last time: ended showing that alg for MB model could be used directly to solve consistency problem is hypotheses belong to class C. An example where not so easy to go the other way: * each example is a number in range: 0 to 2^n-1. (think of as binary). * C = { [0,a] : a < 2^n} (i.e., C = initial sub-intervals). (e.g., imagine that you are observing some device that fails when it gets too hot, and each day you write down current temperature and whether device is working right or not) Simple alg for consistency model: choose concept [0,b] where b is the largest positive example seen. Then verify that no negative example is < b. - Does this work for MB model? - What would work for MB model? use current hypothesis as [0,b] where b is halfway between largest positive and smallest negative example. Cuts down unknown region in half on every mistake. At most n mistakes. Halving algorithm ----------------- Suppose we don't care about computation time. Can we generalize the above idea to learn an *arbitrary* concept class C making at most log(|C|) mistakes, where |C| is the number of concepts in C? Halving algorithm: predict using majority vote over all concepts in C consistent with past data. E.g., there are 3^n conjunctions, so this makes at most O(n) mistakes. (later, we'll see a nice noise-tolerant version of this) Question: is halving algorithm optimal? Say you have unlimited computation time, does this have the best possible mistake bound? On-line learning as a 2-person game ----------------------------------- If we ignore the issue of computational efficiency. we can think of learning in the mistake-bound model as a 2-person game. In particular, for a deterministic algorithm, the game goes like this: 1. The adversary selects some example x, which splits C into two sets: C_0(x) is the set of concepts that label x negative, and C_1(x) is the set of concepts that label x positive. 2. The algorithm gets to pick one of C_0(x) and C_1(x) to throw away (by predicting postive and making a mistake it throws out C_1(x), or by predicting negative and making a mistake it throws out C_2(x)) 3. If only one concept left, then done (view two semantically equivalent concepts as the same). Else, go back to step 1. The value of this game (to the adversary) is the number of rounds the game is played. Looking at the problem this way, we can see a couple of things. First, there is a well-defined notion of optimal strategies for each player and a well-defined number OPT(C) that is the optimal mistake bound for concept class C. What is optimal strategy? Given an example x, we ``just'' calculate OPT(C_0(x)) and OPT(C_1(x)) (by applying this idea recursively), and throw out whichever set has larger mistake bound. Another way to look at it is that OPT(C) is defined recursively as: If |C| = 1 then OPT(C) = 0. Else OPT(C) = 1 + max( min( OPT(C_0(x)), OPT(C_1(x)) ) ) x This algorithm is often called the "Standard optimal algorithm" (more discussion in Littlestone paper). Notice that this is {\em not} necessarily the same thing as the Halving Algorithm. The Perceptron Algorithm ------------------------ Now, I want to describe one of the oldest algorithms (from 60's or earlier) in machine learning, called perceptron algorithm. Perceptron Algorithm: Algorithm for learning a linear threshold function. For simplicity, use threshold of 0, so looking at functions like: w_1x_1 + w_2x_2 + ... + w_nx_n > 0. (write this as w \cdot x > 0) (can simulate a nonzero threshold with "dummy" input x_0 that is always 1) Guarantee is that if data is consistent with a linear threshold function W, then number of mistakes is at most (1/delta)^2, where delta = min |W \cdot x|/(|W|*|x|) x (delta is a measure of the amount of wiggle room available: if we normalize all examples to have length==1, then it is the maximum, out of all consistent planes, of the minimum distance of an example to that plane) Algorithm: Start with all-zeroes weight vector w. Predict positive iff w \cdot x > 0. Mistake on positive: w <- w + x Mistake on negative: w <- w - x (if some examples have really large feature values, then can normalize examples to length 1. I.e., add or subtract x/|x| instead of adding or subtracting x). Example: let's say we have examples: 0 0 + 1 0 - 1 1 + and we cycle through. Let's add dummy x_0 = 1 to all examples. Get sequence of weight vectors: 0 0 0 1 0 0 0 -1 0 1 0 1 0 -1 1 1 0 2 0 -1 2 1 -1 2 Proof of convergence: First, for simplicity, let's assume target plane doesn't pass through any examples. Also, for simplicity, let's replace negative point (x_0, x_1,..., x_n) by positive point (-x_0, ..., -x_n). Also, normalize so all examples have length 1. let w^* be a unit vector representing the target plane. delta = min_x (w^* \cdot x) let w be our hypothesis plane. Look at: (w \cdot w^*) / |w| if we add x to w, then: numerator goes up by x \cdot w^* >= delta. (denominator)^2 becomes (w+x)^2 = w^2 + x^2 + 2(w \cdot x) < w^2 + 1 because w \cdot x is negative. Since we start with w = all zeroes, after m mistakes the numerator is at least delta*m, and the denominator is at most sqrt(m). Numerator cannot be greater than denominator since the quantity we're looking at is the cosine of the angle between w and w^*, so delta*m <= \sqrt{m}, so m <= 1/delta^2. QED