15-859(B) Machine Learning Theory            04/02/08

* Markov Decision Processes and Reinforcement Learning II
=========================================================================
===Recap===
MDP model is:
  - a set of states
  - a set of actions
  - a probabilistic transition function: Let's use Pr_{s,a}(s') to
    denote prob of transitioning to state s' if do action a from state s.
  - a probabilistic reward function: R(s,a) is a probability
      distribtion over rewards you get from doing action a in state s.
Assumption is that you know your current state.  

Different things we might want to optimize.  We will consider expected
discounted reward.  Given gamma in (0,1), want to maximize E[r_0 +
gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ...]  where r_t is reward
at time t.  This has the nice property that the optimal strategy is a
policy (a mapping of states to actions).

Q-values
========
DEFINITION: Q(s,a) = expected discounted reward for performing action
a from state s, and then behaving optimally from then on.

DEFINITION: V(s) = max_a Q(s,a).  V(s) = value of state s.
So, V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ]

We then saw we can use this as a legal *definition* of V(s) since
there is only one solution.

[Actually, just to flesh out what may be an obvious point: the optimal
policy simultaneously maximizes V(s) for all states s.  That's because
if you start at state s and go to s' then you want to now behave in the
same way as if you had started at s', because the gamma^t discounting
is time-invariant]

Given the transition and reward functions, we saw we can solve for
Q-values through either Dynamic Programming (called "value iteration")
or Linear Programming.  # iterations of DP needed depends on
"half-life" 1/(1-gamma), so would prefer LP if gamma = 1 - 1/n^10.

Another popular method is called "policy iteration":  Here you guess
some initial policy pi_0 which converts this to a markov chain.  Then
can solve for the values V^{pi_0} using linear algebra.  Then compute
greedy policy pi_1 with respect to these values.  Then solve for
V^{pi_1}, etc.  In practice this converges quickly but only
theoretical bounds known are that it converges at least as fast as
value iteration.

Q-learning
==========
Deterministic version (if world is deterministic):
- Suppose we are in state s, do action a, go to s' and receive reward
  r.  Then set:  
        hat{Q}(s,a) = r + gamma * hat{V}(s')
  where
        hat{V}(s') = max_{a'} hat{Q}(s',a').

  [Analogous to running Bellman-Ford algorithm in an asynchronous way.]

More generally, in a probabilistic environment:

- On t-th update of hat{Q}(s,a):
      hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')]

We ended by proving the following theorem about the deterministic
version:

Define a "Q-interval" to be an interval of time in which every <s,a>
is tried at least once.  Let Delta_i = max_{s,a}|hat{Q}(s,a)-Q(s,a)|
after the ith Q-interval. Then after the ith Q-interval, Delta_i <=
Delta_0*gamma^i.

 Proof by induction.  Base case (the very beginning) is OK.  After an
 update in interval i of hat{Q}(s,a) (let's say that action a takes you
 from s to s') we have: 
  |hat{Q}(s,a) - Q(s,a)| = |(r + gamma max_{a'} hat{Q}(s',a')) - 
			    (r + gamma max_{a'}Q(s',a')|
                         = gamma |max_{a'}hat{Q}(s',a') - max_{a'}Q(s',a')|
                        <= gamma max_{a'} |hat{Q}(s',a') - Q(s',a')|
                        <= gamma * gamma^{i-1}*Delta_0.

For probabilistic version, you will need that sum_t alpha_t = infinity
but sum_t (alpha_t)^2 is finite.

Related Algorithms
==================
SARSA: "on policy" versus "off policy" updates.  Replace "hat{V}(s')"
with "hat{Q}(s',a')".  So long as you usually choose the actions you
believe to be best (argmax[hat{Q}(s',a')]), this will be similar to
the Q-learning update.  [Question: what is needed for the above proof
to go through?]

Also, related TD(lambda) algorithm where you pass experience
backward in time discounted by some lambda, rather than just passing
it back one level.  TD is for *evaluating* a given policy, but can
also do with SARSA.  Algorithmically, do with vector of multipliers
over states called "eligibility trace".  At each step, multiply each
value by lambda*gamma (to discount), but add 1 to current state.


One more issue to address
=========================
Say you find approximate values V' and use the greedy policy \pi wrt V'.
Is \pi necessarily an approximately optimal policy? 

Specifically, say that max_s |V(s)-V'(s)| = epsilon.
Let Delta = max_s (V(s) - V^{\pi}(s)).
Want to show that Delta is not too much larger than epsilon.

Proof: Let s be a state of maximum difference Delta between V(s) and
V^{\pi}(s).  Say that the optimal action from s is "a" but \pi says to
do "b".  
    V^{\pi}(s) = E[R(s,b)] + gamma * sum_{s'}[Pr_{s,b}(s') V^{\pi}(s')]

Now, what if we follow pi for just one step and then behave optimally
from then on.  At best this helps by gamma*Delta since the quantities
inside the sum go up by at most Delta. [In fact, this is an easy way
to see that actions "a" and "b" must be different.]  So, this is still
worse than optimal at state s by at least Delta - gamma*Delta. I.e.,

  Delta(1-gamma) <= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V(s')] - 
                    [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V(s')].

But since "b" looked better according to V' (that's how \pi was
defined), it must be the case that

               0 >= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V'(s')] - 
                    [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V'(s')].

So, subtracting the second RHS from the 1st RHS, and using the fact
that |V(s') - V'(s')| <= epsilon, we get:

  Delta(1-gamma) <= gamma*[sum_{s'}Pr_{s,a}(s')*epsilon] +
                    gamma*[sum_{s'}Pr_{s,b}(s')*epsilon],

which is at most 2*gamma*epsilon.

So, Delta <= 2*epsilon*gamma / (1 - gamma).

So the gap gets worse as gamma gets closer to 1, but we can at least
say that it's enough to get epsilon to be small compared to 1/(1-gamma).


Combining with function approximation / concept learning
========================================================

What if the state space is too big to represent explicitly?  E.g.,
each state is just described by some set of features. 

- Neat idea: try to train a learning algorithm that given a <state,action> 
  pair outputs its Q-value.  E.g., could be a neural-net or whatever.

- How to train, since we don't have labeled data?  Use Q-learning!
  Say we're in state s, do action a, get reward r, and go to s'.  Use
  (1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label.
  Can think of this like training up an evaluation function for chess
  by trying to make it be self-consistent.

- What's particularly interesting here is we are effectively using two
  notions of states being "similar" here.  One is defined by the
  representation of the learning algorithm (e.g., if you have a simple
  linear weighted combination of features, then two states are similar
  if they aren't too far apart in the important directions in the
  feature space).  The other is defined by the transition function.

- Will this really work? TD-gammon backgammon player learned a
  policy for backgammon play using this approach and is the best
  computer backgammon player in the world, and comparable to or better
  than best human players.  On the other hand, theory says this
  approach isn't guaranteed to work: even if your predictor is
  powerful enough to fit the value function, you can get into a bad
  feedback loop.  Work by Geoff Gordon on conditions that ensure
  things will go well.  But they are pretty strong conditions. 

=========================

2-player MDP: what if you have two players A and B.  Alternate moves. 
One state is labeled "win for A".  (and can also have a state called
"win for B" if we like).   No discounting.  Goal of A is to maximize
prob of entering "win for A" state.  

- What could we do if we defined the game to be a win for B if it
lasts more than T steps?

- What if we don't?  Turns out there is no known poly-time algorithm
for determining the optimal policy for A.  (In fact, believed to not
be in P).  However, the decision question "does A have a strategy that
can guarantee at least a probability p of winning" is in NP intersect
Co-NP.  Can anyone see why it's in NP?  Co-NP?

These are sometimes called "simple stochastic games".