15-859(B) Machine Learning Theory 01/25/12 Online learning contd * Online learning and game theory (from slides from last time) * The Winnow algorithm for disjunctions * Winnow for k-of-r functions and general LTFs in terms of L_1 margin ======================================================================= First, recap last time and talk about some connections to game theory (see slides) WINNOW ALGORITHM ================ If you think about the problem of learning an OR-function, we saw an algorithm: "list all features and cross off bad ones on negative examples" that makes at most n mistakes. But, what if most features are irrelevant? E.g., if representing a document as vector indicating which words appear in it and which don't, then n is pretty large! What if the target is an OR of r relevant features where r is a lot smaller than n. Can we get a better bound in that case? What could we do if computation time were no object? How many bits to describe an OR of r variables, where r << n? Ans: O(r log n). So, in principle, we'd like to obtain a bound like this. Winnow will give us a bound of O(r log n) mistakes efficiently. So, this means you only have a small penalty for "throwing lots of features at the problem". In general, will say that an algorithm with only polylog dependence on n is "attribute-efficient". Winnow Algorithm: (basic version) 1. Initialize the weights w_1, ..., w_n of the variables to 1. 2. Given an example x = (x_1, ..., x_n), output + if w_1x_1 + w_2x_2 + ... + w_nx_n >= n, else output -. 3. If the algorithm makes a mistake: (a) If the algorithm predicts negative on a positive example, then for each x_i equal to 1, double the value of w_i. (b) If the algorithm predicts positive on a negative example, then for each x_i equal to 1, set w_i = 0. 4. repeat (goto 2) THEOREM: The Winnow Algorithm learns the class of disjunctions in the Mistake Bound model, making at most 1 + 2r(1 + lg n) mistakes when the target concept is an OR of r variables. PROOF: Let us first bound the number of mistakes that will be made on positive examples. Any mistake made on a positive example must double at least one of the weights in the target function (the *relevant* weights), and a mistake made on a negative example will *not* zero any of these weights, by definition of a disjunction. Furthermore, each of these weights can be doubled at most 1 + lg(n) times, since only weights that are less than n can ever be doubled. Therefore, Winnow makes at most r(1 + lg(n)) mistakes on positive examples. Now we bound the number of mistakes made on negative examples. The total weight summed over all the variables is initially n. Each mistake made on a positive example increases the total weight by at most n (since before doubling, we must have had w_1x_1 + ... + w_nx_n < n). On the other hand, each mistake made on a negative example decreases the total weight by at least n (since before zeroing, we must have had w_1x_1 + ... + w_nx_n >= n). The total weight never drops below zero. Therefore, the number of mistakes made on negative examples is at most the number of mistakes made on positive examples, plus 1. That is, 1 + r(1 + lg(n)). Adding this to the bound on the number of mistakes on positive examples yields the theorem. QED. Open question: is there an attribute-efficient algorithm for Decision Lists? Want mistake bound poly(L, log n), where L is length of list. -------------------------------------------------------------------------- Winnow for Majority Vote functions =================================== How about learning a majority-vote (k of r) function: n variables total, r are relevant, and an example is positive iff at least k are on. (Classify as SPAM if at least k keywords are present). Say we know k. Let epsilon = 1/(2k). Our algorithm will multiply by 1+epsilon for mistakes on positives, and divide by 1+epsilon for mistakes on negatives. Think of multiplying by 1+epsilon as putting a poker chip on the weight, and dividing as removing the chip (can have neg chips). Max number of chips on relevant weights <= r*log_{1+epsilon}(n) ~ (r/eps)ln(n) M.on pos.:put >= k chips on rel weights. Total weight up by at most epsilon*n M.on neg.:take <=k-1 off rel weights. total wt down by at least n(eps/(1+eps)) Use to create two inequalities in two unknowns: k*(M_pos) - (k-1)*(M_neg) <= (r/eps)ln(n) total weight >= 0. Or, n + (M_pos)*n*epsilon >= (M_neg)*n*(epsilon/(1+epsilon)) Solve: Rewrite 2nd inequality as: M_neg <= (1+eps)M_pos + (1+eps)/eps. Plug into first to get: k*M_pos - (k-1)(1+eps)M_pos <= (r/eps)*ln(n) + (k-1)(1+eps)/eps Now we have set eps = 1/2k so that (k-1)(1+eps) < k - 1/2, so we get: (1/2)M_pos <= (r/eps)*ln(n) + (k-1)(1+eps)/eps So M_pos = O(rk*log(n)), and same with M_neg. What if don't know k?? If know r, then can guess and double. If dont, can guess and double r too, but then cost is O(r^2*log(n)). So we have: THEOREM: Winnow learns the class of "k of r functions" with at most O(r^2 log n) mistakes. ---------------------------------------------------------------- Using Winnow for general LTFs ============================= What if the target is a general linear threshold function of the form w*_1 x_1 + ... + w*_n x_n >= w*_0? Let's scale so that all weights are integers, and assume all are non-negative (can do that by introducing new variables y_i = 1-x_i). Let W = w*_1 + ... + w*_n. If we know W (or say W is just an upper bound on true value) then we can solve the problem like this: just repeat each variable W times. We then have a "k=w*_0 out of r=W" problem, so we get a mistake-bound of O(W^2 log(nW)). Now, here's a cute thing: the above algorithm --- repeating each variable W times --- does *exactly* the same thing if we had run the algorithm without repeating each variable! (it's equivalent to initializing each weight to W instead of 1 and using a threshold of nW instead of n, which is the same as initializing weights to 1 and a threshold of n) So, we really didn't have to do anything! So, we get a good bound as a function of the L_1-size of the solution. Can in fact remove W from log as well. More generally can directly argue using the same basic analysis that if the target vector w* is a general LTF such that w* \cdot x > c on positives and w* \cdot x < c - alpha on negatives, then the mistake bound is O((L_1(w*)/alpha)^2 log(n)). I.e., don't have to assume integer weights. Can extend also to the case where features x_i aren't just {0,1}. In that case, need to multiply L_1(w*) by L_\infty(X), the maximum magnitude of any feature in the instance space. [Will do this on hwk 2] The quantity gamma = alpha/[L_1(w^*)L_infty(X)] is called the "L_1 margin" of the separator, and our bound is O((1/gamma^2)*log(n)) The perceptron algorithm, which we'll look at next time, has a mistake bound of O(1/gamma^2) where gamma = alpha/[L_2(w*)L_2(X)]. If "n" is large but most are irrelevant (i.e. target is sparse but examples are dense), then Winnow is better because adding irrelevant features increases L_2(X) but not L_infty(X). On the other hand, if the target is dense and examples are sparse, then perceptron is better.