Notes for 1/26/98

Admin: did the extra copies of the book come in yet?  Note: book
doesn't talk about material we've been doing so far, but we'll be
switching topics soon to material that's in the book.

--------------------------------------------------------------------------

-- recap from last time about Randomized WM:

- algorithm: use weights as probabilities.  Update multiplicatively.

- we found this had a nice bound:  M <= (m*ln(1/beta) + ln(n)) / (1 - beta)

As time goes on, the first term will be increasing and the second is
static (it's an additive constant), so want to adjust beta with time.

For instance, you can get that in T trials,

	M <= m + ln(n) + O(sqrt(T ln n))

One way of looking at this, if we divide both sides by T, is that our
average loss per trial is approaching that of the best expert at a
rate like 1/sqrt(T).

Here is a neat example for the case of 2 experts.  Imagine someone was
flipping a coin of bias p.  You want to predict outcome.  Best
prediction is 0 if p<1/2 or 1 if p>1/2.  prob of making a mistake is
min(p, 1-p).  If you didn't know p, you could take majority vote of
all outcomes you'd seen so far.  In the limit, your prob of mistake
would approach this.

Now, what if you remove assumption bits are coming from a coin.  Just
some guy saying H, T, T, T, H, ....  Can you have a strategy where, if
in hindsight there were 30% heads, you made a mistake only a bit more
than 30% of the time? 

This is a special case of the 2-player game setting

zero-sum 2-player games  (aka matrix games)
-------------------------------------------
Here is a game: A and B each hide either a nickel or a dime behind
their backs.  Then they reveal their coins.  If match, then A gets
both. Else, B gets both.  This is an example of a 2-person zero-sum
game.  Zero-sum means that what A wins = what B loses, and vice versa.

Is this a fair game? (who would you rather be?)  What's a good strategy?

This kind of game is often called a MATRIX GAME:

Payoff matrix (payoff to A)
				5  -5
			       -10 10	

row-player picks a row and column-player picks a column, and you get payoff.

"Pure strategy"  -> deterministic choice of row(column)

"Mixed strategy" -> probabilistic choice.


For a given (pure or mixed) strategy S_A of A and S_B of B, the value
V(S_A, S_B) of the game to A is A's expected winning per round, given
that both players follow this strategy.  E.g., if S_A is "A picks
randomly" and S_B is "B always hides a nickel", then V() = -2.5 cents.

For a given S_A, what is   min  V(S_A, S_B)?
			   S_B

Can think of this as the value of the game if A uses S_A, and
announces this to B, who can then use this information to pick his
best strategy S_B.

In fact, if A announces his strategy, then B can pick a pure
(non-randomized) strategy.  This is like the problem of selecting
stocks when you know the future -- there's no reason to diversify.

For a given S_B, what does    max  V(S_A, S_B) mean?
	     	              S_A

Can think of this as the value of the game if B uses S_B, and
announces this to A, who can then use this information to pick his
best strategy S_A.

Suppose that someone has to announce their strategy first.  Which
would you rather be?

Minimax theorem:

	max  min  V(S_A, S_B)   =   min max V(S_A, S_B)
        S_A  S_B                    S_B S_A

This is the value of the game V.

Easy direction to see is the <=

To get the >= direction, suppose for contradiction that the LHS was V-epsilon.
Now, lets run our on-line algorithm.  On every play, our opponent
knows our randomized strategy so we should not be able to make more
than V-epsilon on average per game.  But, in hindsight, there must be
a fixed row that would have achieved value at least V.   This
contradicts the theorem.

----------------------------------------------

Turns out this kind of thing has been proved over and over in
different areas in the past (though improvements/simplifications over time)
	Hannan '57. 
	Banos '68
	Megiddo '80
	learning theory papers in '90s

Problem in game setting is that the weight on expert/strategy i
depends on just the total loss of expert i.  E.g., if I'm playing
against this algorithm, and I say "rock" 100 times in a row, then it
will do great, but the weight on paper will be 1, the weight on rock
will be 2^(-50), and the weight on scissors will be 2^(-100).
So, if I now say scissors, it will keep saying paper for a long time
(about 50 times).  Then I can do paper and rock for a while since it
is not going to say scissors for a long time.

So, to be good in practice, we'd like to be more adaptive.  Can you
think of ways to do that?

-- could refuse to lower weights that are already really low
-- when you remove some amount x from the total weight W, you could
   put some of it back evenly among the experts.

(Homework question on analyzing this in more detail -- do easier case
of deterministic alg on hwk, but same idea applies to randomized alg too).

(A potential project: might be fun to try various algs on people for
simple game (or not so simple game), say with 100 reps, and see how
they do.)


Winnow Algorithm
----------------
Now turn to an alg that is something like a combination of the WM
alg and the Perceptron alg.

Look at in this context:  Say we're trying to learn an OR function.
"This image is a picture of a person if it has this feature or that
feature or that feature".  We saw an algorithm: "list all features and
cross off bad ones on negative examples" that can make n mistakes.
But, what if most features are irrelevant.  What if the target is an
OR of r relevant features where r is a lot smaller than n.  Can we get
a better bound in that case?

Winnow will give us a bound of O(r log n) mistakes.

Algorithm:(a simple version)
\begin{enumerate}
\item Initialize the weights $w_1, \ldots, w_n$ of the variables to 1.

\item Given an example $x = \{x_1, \ldots, x_n\}$, output 1 if 
	$$w_1x_1 + w_2x_2 + \ldots + w_nx_n \geq n$$
and output 0 otherwise.

\item If the algorithm makes a mistake:
\begin{enumerate}
\item  If the algorithm predicts negative on a positive example, then
for each $x_i$ equal to 1, double the value of $w_i$.

\item
If the algorithm predicts positive on a negative example, then
for each $x_i$ equal to 1, cut the value of $w_i$ in half.
\end{enumerate}
\item Goto 2.
\end{enumerate}
\end{description}

\begin{theorem}
The Winnow Algorithm learns the class of disjunctions in the Mistake
Bound model, making at most $2 + 3r(1 + \lg n)$ mistakes when the target
concept is a disjunction of $r$ variables.
\end{theorem}

{\em Proof. } Let us first bound the number of mistakes that will be
made on positive examples.  Any mistake made on a positive example
must double at least one of the weights in the target function (the
{\em relevant} weights), and a mistake made on a negative example will
{\em not} halve any of these weights, by definition of a disjunction.
Furthermore, each of these weights can be doubled at most $1 + \lg n$
times, since only weights that are less than $n$ can ever be doubled.
Therefore, Winnow makes at most $r(1 + \lg n)$ mistakes on positive
examples.

Now we bound the number of mistakes made on negative examples.  The
total weight summed over all the variables is initially $n$.  Each
mistake made on a positive example increases the total weight by at
most $n$ (since before doubling, we must have had $w_1x_1 + \ldots
w_nx_n < n$).  On the other hand, each mistake made on a negative
example decreases the total weight by at least $n/2$ (since before
halving, we must have had $w_1x_1 + \ldots + w_nx_n \geq n$).  The
total weight never drops below zero.  Therefore, the number of
mistakes made on negative examples is at most twice the number of
mistakes made on positive examples, plus 2.  That is, $2 + 2r(1 + \lg
n)$.  Adding this to the bound on the number of mistakes on positive
examples yields the theorem.  \qed

Can also look at case where data is not completely consistent.