15-681 Machine Learning.  Lecture 9/24

Recap: End of last time, talking about On-line learning and problem of
predicting from expert advice.
- Goal of being not too much worse than best expert.  Could just
  sample and take statistics and pick best but... 
	-- don't want to wait.
	-- what if not a fixed distribution?

If one expert is doing much better than rest then we want to go with
it.  If several are good, then majority voting sounds best.

Nice algorithm that achieves both: Weighted Majority Alg.

	Assign each expert a weight of 1 to start.

	Predict according to a (weighted) majority vote.

	Penalize experts that make a mistake by multiplying weights by 1/2.

Proved: # mistakes M < 2.41*(m + log(n)), where n=number of experts
and m = #mistakes of best expert so far.

Uses: 
	- an alternative to selecting among several hypotheses.
      	- use to combine several learning algorithms.
	- log(n) means you can add in lots of experts / hypotheses / 
	  learning-algorithms without much penalty. 

Today: applications and variations.  Algorithm is very flexible.  

***Calendar application***

Motivation:  	1. many of ID3's rules are simple.  
		2. lots of features available.
		3. time dependence.  (get to this later)
Idea:  
	- Create one expert for every pair of features.  So, have 561 experts.

	- Each expert will perform a simple lookup-table rule.  Given
	a pair of values for its two features, the expert will predict the
	most common outcome out of the last 4 times (say) that that pair of
	values appeared.  E.g., Say we're predicting day of week.  If the
	features are "seminar type" and "names of attendees", and the values
	are nil (meaning it's not a seminar) and Caruana (meaning that it's a
	meeting with Rich Caruana) then we look at the last 4 meetings with
	Rich Caruana and take the most common day.

	- Note: in doing this, we have moved the "time dependence"
	inside the experts.  This gives us more flexibility in dealing with
	recency/frequency tradeoff.  E.g., even if the last meeting
	with person X was 200 days ago, we can still remember it.  As a
	general rule of thumb, the simpler the learning algorithm
	(e.g., lookup table is simpler than decision tree is simpler than
	neural net), the more flexible you can make it for issues like this.

Results: about 5-10% better accuracy than ID3.  On higher end when
using all the features (not just ones selected for ID3).


============================================================================

Algorithm is built for goal of: "not too much worse than best expert
so far".  What if one expert is best for a while, then that expert
starts doing badly and there is another that's best?   E.g., some
weather predictors do better in fall, some in winter etc. Might want to
have stronger goal:
	"not too much worse than best expert lately"

What might be some situations like that?

	
1. Notice that weight on expert is function of number of mistakes it
has made total so far, so doesn't distinguish recent versus old
mistakes.  Any ideas on modifying the algorithm?

Here's one: Modify algorithm to say that if weight w_i < W/10n, then
don't cut it in half.  (the number 10 is arbitrary.  Any number greater
than 2 would be fine)

Claim: say you have a long sequence of examples.  In any contiguous
block, total number of mistakes is not too much worse than best expert
IN THAT BLOCK.

Let's analyze it.  W_init = total weight at start of block.  How much does W
drop when a mistake is made?

	- at least W/2 is on portion that made a mistake.
	- at most W/10 of that is on portion that can't be halved.

	So, at least 4/10 of weight gets halved.  ==> remove 1/5 of weight.

		==> after M mistakes, total weight < W_init*(4/5)^M


	Weight on best expert starts at least W_init/20n.  So, if it
	makes m mistakes, then its weight is at least:

		W_init/20n  * (1/2)^m.

	So, 
		W_init/20n * (1/2)^m < W_init*(4/5)^M

	==>	(5/4)^M < 2^m * (20n)

	==>     M < 3.11 * [m + lg(20n)]


2. Going back to the original algorithm, the constant 2.41 is kindof
big. As we pointed out last time, in practice it does much better.
But, here is a modification that reduces that worst case bound. (In
expts I've seen, these modifications don't affect error, but they do
make the algorithm usable in more general settings...)

	Modification 1: replace 1/2 by beta.
	Modification 2: Use weights as probabilities.

I.e., algorithm is:

	Assign each expert a weight of 1 to start.

	Predict outcome X with probability proportional to total
	weight on that outcome.  (Equivalent to: select expert i with
	probability w_i/W)

	Penalize experts that make a mistake by multiplying their
	weight by beta.

Intuitively, why is worst case better?

New bound: Expected number of mistakes M is bounded by:

                  m ln(1/beta) + ln(n)
           M <=   --------------------
                      (1 - beta)


E.g., beta=1/2, get 1.39m + 2ln(n)
      beta=3/4, get 1.15m + 4ln(n)


Application: say time between example and your prediction must be
fast, but you have more time before the next example.  Now, only need to
examine one expert to make prediction (but need to examine all to update
the weights)

Application: can use in more general settings where experts are
suggesting "strategies" or "actions" that can't be combined so well.