Notes for Jan 21, 1998
----------------------------------------------------------------------
Today, talk about a problem of "predicting from expert advice" and a
nice alg called the weighted majority algorithm.
Setting: each day we get a prediction from some number of experts
about whether it is going to rain. Then, we use them to make our
prediction. Our goal: do nearly as well as best of the experts in
hindsight.
Can view this as basically a fancy way of saying the following:
Let's say we extend the mistake-bound model to allow noise.
Originally, we said our goal was to make at most some bounded
number of mistakes, assuming that the data is consistent with some
concept in C. Could extend to say: "our goal is to make at most some
poly(n) number of mistakes, plus a small constant times the number of
mistakes made by the best concept in C". Then, experts setting (and in
particular, the goal of doing nearly as well as the best expert in
hindsight) is really the case of C = {single variable concepts}.
Weighted-majority Algorithm (2 versions)
---------------------------
Version 1: initialize all experts to weight 1. Given an example, we
poll all the experts and predict based on a weighted majority
vote of their predictions. We then cut in half the weights of all
that make a mistake. (no rewards.)
Claim: this makes at most some small constant times the # of mistakes
of the best expert, plus additive amount that is logarithmic in the
number of experts.
Analysis: Let
n = number of experts
M = # mistakes we've made so far.
m = # mistakes best expert so far has made.
W = total weight.
After each mistake, W <-- W*(3/4). So, W <= n*(3/4)^M.
weight of best expert = (1/2)^m.
Now, notice that the total weight cannot be smaller than the
weight of the best expert.
Solve: n*(3/4)^M >= (1/2)^m, or, (4/3)^M <= n*2^m.
M <= 2.4(log_2 n + m).
Comparison with Halving Alg
---------------------------
How does this compare with the halving algorithm from last time? Can
think of each concept in a class C as an expert. In the no-noise MB
setting (there exists *some* correct concept), the halving alg made
log(|C|) mistakes. That's gone up by factor of 2. But, now we're
robust to the case where the optimal concept is occasionally faulty:
we make at most 2.4 mistakes for every one made by best concept in
hindsight.
Randomized WM
-------------
If best concept (or best expert) has, say, 40% error rate, then this
guarantee is not so good. We'd like to improve dependence on m.
Version 2: instead of predicting based on majority vote, let's use the
weights as probabilities. (Randomized weighted majority). E.g., w_0
weight on outcome 0, w_1 on outcome 1, etc. W = sum. Then, we'll
predict outcome i with prob w_i/W. Goal: bound our worst-case
expected number of mistakes, and lets assume that the adversary (the
world) has to select one of the answers as correct before we make our
coin toss.
Why is this better in the worst case? Idea: worst case for
deterministic algorithm was when weights split 50/50. But, now it's
not so bad since we also have 50/50 chance of getting it right.
Also, to trade-off between dependence on m and log(n), we'll
generalize to multiply by beta < 1, instead of necessarily 1/2.
Analysis: Define F_i to be the fraction of weight on the WRONG
answers at the ith example. So, F_i is the probability we make a mistake
on the ith example. Let M be a random variable denoting our total
number of mistakes. So E[M] = sum(F_i), using the fact that
Expectation is additive.
On ith example, W <-- W(1 - (1 - beta)F_i).
Reason: on F_i fraction, we're multiplying by beta.
So, final weight W = n(1 - (1-beta)F_1)(1 - (1-beta)F_2)...
Let's say that m is the number of mistakes of the best expert so far.
We can use the inequality W >= beta^m.
Now we solve. First take natural log of both sides. We get:
m ln(beta) <= ln(n) + sum[ln(1 - (1-beta)F_i)]
Simplify: ln(1-x) = - x - x^2/2 - x^3/3 - ...
So, ln(1 - (1-beta)F_i) < -(1-beta)F_i.
m ln(beta) <= ln(n) - (1-beta)*sum(F_i)
Now, use E[M] = sum(F_i).
Result is: m ln(1/beta) + ln(n)
E[M] <= --------------------
(1 - beta)
Same rough form as before. Have we made progress? Yes: e.g.,
beta=1/2, get 1.39m + 2ln(n)
beta=3/4, get 1.15m + 4ln(n)
Roughly, of the form (1+epsilon)*m + 1/epsilon*ln(n).
------------------------------------------------------------------------
Uses of Randomized WM
---------------------
One nice feature of RWM is that can apply in settings where
predictions are not things that can be combined (or can't be combined
easily).
Example: each expert is telling you a different way to drive to work.
You pick one using RWM. Later you find out how well you would have
done (as you drive you listen to the traffic report) and penalize
appropriately.
To do this right, we want to generalize from just "loss" of 0 or 1 to
losses in [0,1]. Goal of having expected loss be not too much worse
than loss of best expert.
How to generalize? Answer: penalize by beta^loss. I.e., having two
examples of loss 1/2 gives same weight as one example of loss 1 and
one example of loss 0.
Analysis still goes through: basically, just need to replace F_i by
p_i*l_i where p_i is our prob distribution at time i, and l_i is the
loss vector at time i. (So, if losses are {0,1} this is the same
thing as F_i). E.g.,
Our expected loss at time i is p_i*l_i, so our total expected
loss is sum(p_i*l_i).
The weight change part is a little trickier:
New weight of expert j is (old weight) * beta^(its loss)
This is less than or equal to
(1 - (1-beta)*(its loss))*(old weight)
So, this means we can still use the formula
(new W) <= (old W)*(1 - (1 - beta)p_i*l_i)
Neat use of RWM:
---------------
--> Use in setting of 2-player games.