15-859(A) Machine Learning Theory 04/06/04 * Maximum entropy and maximum likelihood exponential model * Connection to Winnow. =========================================================================== - 2 different, natural, optimization criteria that have the same solution. - turns out to do well in practice too. - Winnow (balanced/multiclass version) can be viewed as a fast approximation. We'll be assuming that examples are given by n boolean features. Also convenient to have "dummy feature" x_0 = 1 in every example. High level ---------- First of all, we are going to be producing a hypothesis that, given x, outputs a probability distribution over labels. E.g., labels might be {+,-} or might have a larger set of labels. p_h(y|x) = the amount of probability mass that h(x) assigns to label y. Let S = data sample. p_S(y|x) = 1 if label of x is y, and 0 otherwise. Second, we're not going to make assumptions about there being a correct target function of some form --- we're just going to ask for the hypothesis that maximizes a certain objective function. Going to be a convex optimization problem. Here are 2 natural criteria that turn out to be equivalent. 1. Maximum entropy ================== General principle for fitting a "maximally ignorant" prob dist given known constraints. E.g., you roll a die many times and find that the average value is 4 instead of 3.5. Could ask for the maximum entropy distribution subject to this constraint. Entropy of a distribution p over y is defined as: H(p) = sum_y p(y)*log(1/p(y)) It's easiest to think of in terms of information theory. I want to send a stream of symbols y to you using the fewest possible bits, under assumption that each symbol is generated iid according to distribution p. If you ignore the fractionality issue (or equivalently, you encode blocks of symbols at a time) the optimal encoding uses log(1/p(y)) bits to encode y. H(p) is the average number of bits per symbol. Uniform distribution has the highest entropy. In the context of a probabilistic function over a dataset, the standard measure used is the average entropy over the data. I.e., H(p_h) = average_{x in S} sum_y p_h(y|x) log[1/p_h(y|x)] The maximum entropy p_h is the one that predicts y uniformly at random, no matter what x is. But, let's say we want h to agree in certain ways with the sample. E.g., if the sample is 30% positive, then we want h (over x in S) to be 30% positive on average. If the sample is 45% positive given that x_i = 1, then we want h to agree with that too. I.e. 1. for all y, sum_{x in S} p_h(y|x) = sum_{x in S} p_S(y|x) 2. for all y, for all x_i, sum_{x in S:x_i=1} p_h(y|x) = sum_{x in S:x_i=1} p_S(y|x). Note: it's easy to achieve these by just memorizing data: let p_h = p_S. Note: can merge 1 into 2 because of dummy feature. DEF: Let P be the set of probability distributions satisfying these constraints. Our objective: find the MAXIMUM ENTROPY h in P. [Note 1: usually in this area, a "feature" is property of the example and the label, like a statistical query. E.g., "x_i = 1 and the label is +". But I will use feature in the standard machine-learning sense. It doesn't matter much.] [Note 2: this is a pretty strange objective. In particular, it says nothing about what h should do on points x not in S! Luckily, there will exist "reasonable" functions that optimize this objective.] 2. MAXIMUM LIKELIHOOD "EXPONENTIAL MODEL" or "LOGISTIC LINEAR SEPARATOR" ======================================================================== Here's a natural family (let's call it "Q") of probabilistic hypotheses. Let w_{iy} be a weight for feature i and label y. Let w_y be the vector for y. Let p_h(y|x) be proportional to e^{w_y.x}. I.e., p_h(y|x) = e^{w_y.x}/Z(x), where Z(x) = sum_y e^{w_y.x}. Of all these, pick the maximum likelihood distribution. I.e., we want the h in Q that maximizes the likelihood of the data. If you just think of positive and negative examples, you can think of this as asking for the best linear separator (with w_i = w_{i+} - w_{i-}) according to a measure that penalizes according to -log-likelihood. Points near the separator get a small penalty, and points on the wrong side of the separator get a penalty that is roughly proportional to their distance to the plane (do the calculation...) Interesting theorem: criteria 1 and 2 are duals. p* in the intersection of P and Q is the optimum solution to both. ANALYSIS of MAXENT and MAXIMUM-LIKELIHOOD ========================================== First, for a general distribution p_h, what is the likelihood? product_{x in S} p_h(label(x)|x). Let's take the log and divide by |S|. I.e., we want to find h in Q to: maximize average_{x in S} log(p_h(label(x)|x)). Let's negate and minimize. Can write it like this: minimize average_{x in S} sum_y p_S(y|x) log[1/p_h(y|x)]. This has an interesting interpretation. For two distributions p,q, let's define: H(p,q) = sum_y p(y)log[1/q(y)]. This is the average number of bits to encode y distributed according to p, if we use the optimal code for q. H(p,p) = H(p). Clearly, H(p,q) >= H(p,p). Let's extend this definition to average over {x in S} if p,q are functions over x. H(p,q) = average_{x in S} sum_y p(y|x)log[1/q(y|x)] So, HERE IS THE KEY POINT: - for criteria (2) we want h in Q that minimizes H(p_S,p_h). - for criteria (1) we want h in P that maximizes H(p_h,p_h). The rest of the analysis will follow from the following lemma: LEMMA: let p, p' in P, q in Q. Then H(p,q) = H(p',q). [I.e., if we optimize our code for q, then all distribs in P are equally good for us. We can call this quantity H(P,q), since it's the same for all distributions in P]. PROOF: H(p,q) = average_{x in S} sum_y p(y|x)*[log(Z(x)) - w_y.x] Notice that Z doesn't depend on y, and sum_y p(y|x) = 1. So, the Z part just gives average_{x in S}[log(Z(x))] which doesn't depend on p. For the rest of it, we have: - average_{x in S} sum_y p(y|x)*[sum_i w_{yi}*x_i] = - sum_y sum_i w_{yi}*[average_{x in S} p(y|x) x_i] Notice that for a given i,y, the inside part is non-zero only for x in S such that x_i=1. It really only depends on the average over {x in S: x_i=1} of p(y|x). By definition of P, this is the same for all p in P: we can replace p with p_S. So, it didn't matter which p in P we were looking at. QED. OK, now let's use this to prove that if p* is in both P and Q then it's the optimum solution to both criteria. Maximizing likelihood over Q: I.e., we want to show that H(p_S, p*) <= H(p_S, q) for any q in Q. But notice that p_S is in P. So, the LHS equals H(p*,p*) and the RHS equals H(p*,q), so by defn of H, LHS <= RHS. Maximizing entropy over P: I.e., we want to show that H(p*,p*) >= H(p,p) for any p in P. Since p* is in Q, the LHS equals H(p,p*). So again we have what we want by defn of H. This proves the result. All that's left is to argue that such a p* exists: i.e., it is possible to satisfy the sample constraints of P using a model of the expenential form Q. Actually, we're in a little trouble here: what if S has 100% positive examples, or all examples with x_5=1 are negative? We can handle this by either (a) replacing Q with its closure \bar{Q}, or else (b) redefining p_S to imagine the dataset had at least one positive and one negative example with all 1's in it (this is probably the better thing to do in practice). Won't prove this part of the result but this is where the x_0 comes in. RELATION TO BALANCED/MULTICLASS-WINNOW =========================== Here is a version of Winnow called "Balanced Winnow" that also works for multiclass problems: - have vector of weights for each label. - initialize all w_yi to 1. Also have dummy feature x_0 as before. - Given example x, predict label y such that w_y . x is largest. - If we make a mistake, then for each x_i = 1, do: For correct label y, let w_yi = w_yi * (1+epsilon) For predicted label y, let w_yi = w_yi * (1-epsilon) So, in terms of a decision procedure, the rule used is the same form as Q (but we're just outputting our highest probability label). Claim: Balanced-Winnow also approximately satisfies the constraints in the defn of P. (in the sense that the number of times it predicts y given that x_i = 1 is apx equal to the actual number in the data seen so far) Proof: For simplicity, let's just look at the case of +/- labels. Then, 1. The total weight starts at 2n and never increases. 2. the number of mistakes made on positives is approximately the same as the number of mistakes made on negatives. This is because of the "fake" variable x_0 and the fact that the weights w_{0,positive} and w_{0,negative} can't get larger than 2n. So, for instance, (1+epsilon)^{m_p} * (1-epsilon)^{m_n} < 2n. so, m_p cannot be too much more than m_n. Similarly, m_n cannot be too much more than m_p. This means the number of times it predicts positive is approximately the number of positive examples in the dataset. 3. This statement holds over S_i = {x in S: x_i = 1} too. It's just the same argument using i > 0. In practice (at Whizbang anyway), Winnow gives almost the same behavior as true maxent but is a lot faster... -------------------------------------------- More info on maxent: - http://www.cs.cmu.edu/~aberger/maxent.html and - http://www.cis.upenn.edu/~adwait/statnlp.html (this lecture was in part based on some of the above material)