15-859(B) Machine Learning Theory            03/31/08

* Markov Decision Processes and Reinforcement Learning
=========================================================================
Today's topic: Markov Decision Processes (MDPs) and reinforcement learning.

General scenario: much like the DFA problem: we are an agent in some
environment, we have observations, we perform actions, they take us to
a new state of the world.  But also, we'll add a new ingredient to
that setup, which is that from time to time we get a cookie (a
reward), and our goal is going to be not so much to model the
environment as to find a strategy that gets us lots of cookies.  

We're going to make some assumptions that make the problem easier, and
some that make the problem harder, compared to the DFA problem.
Specifically, we'll assume:

- All states look different: you can always tell, from your current
  observation, what state you're in.  [makes the problem easier]
  Also, reward depends only on current state and action.

- But, transitions are probabilistic [makes the problem harder]

- Goal is to learn a good strategy for collecting a lot of reward,
  rather than necessarily to make a model. [makes the problem a bit
  different]

Formally, a Markov Decision Process consists of 
  - a set of states
  - a set of actions
  - a probabilistic transition function: f(s,a) is a probability
      distribution over states, and
  - a probabilistic reward function: R(s,a) is a probability
      distribtion over rewards you get from doing action a in state s.
Assumption is that you know your current state.

Simple example: grid-world with walls.  can go left, right, up, or
down.  Entering top-right corner gives you a reward of 100 and then
takes you to a random state.  Perhaps if you're not currently "hugging" a
wall, then each action has some chance of taking you to a random neighbor. 

Nice features of MDPs:
----------------------
* Like DFA model, it's appealing as a model of agent trying to figure
out how to act in the world.  Incorporates notion that actions can be
good or bad not only because of immediate reward/penalty, but because
of future rewards/penalties they can bring.  E.g., game playing
(though there you often have huge number of states), or a robot that
needs to do a task that takes several steps to carry out.

* learning algorithms will be very natural: propagate rewards backward
through state space: if you do an action from state s and get a reward
of 100, you might give value of 90 to the state s' you were in right
before s, since you can get from s' to s.  A lot like notion of
"conditioning" in psychology.  That's why this is called reinforcement
learning.

* Probabilities allow you to reasonably handle situations like: you
told your robot to go forward 1 foot, and instead it went 1.5 feet and
turned a bit to the right.  Or you ran out of battery power and got
moved back to the start.  Or someone just randomly came along and
picked you up and put you down somewhere else.

* Probabilities give to some extent a way of handling states that look the
same, by merging them into one state with probabilistic transitions going out. 

Limitations
-----------
* States that look the same can be a real problem, even with probabilities.  
E.g., imagine robot that needs to go into Wean 4623.  Right now in
corridor.  Door might be locked or unlocked: that's two states of the
world that look the same.  Once we get there, we find door is locked.
MDP model would have robot keep going back into corridor and trying again.  

* markov assumption not quite right: e.g., if your robot's wheels can
get stuck, an MDP could model this as saying the go-forward action has
a 2% chance of keeping you in the same state.  But really, if that
occured, you expect the probability it doesn't work the next time to
have gone up by a lot.  (could also be viewed as same as previous problem).

"POMDPs" capture probabilistic transitions *and* lack of full
observability, but there is much less in terms of positive results
about them, so we are going to stick with MDPs.

What do we want to optimize?
============================
We haven't exactly specified what we want our learned strategy to optimize.
Some natural choices are:
  - expected average reward (reward per time step)
  - expected reward in first t time steps (t given to us).
  - expected discounted reward: sum of rewards but giving less weight
    to rewards in the future.  Specifically, a typical way of doing this:
      r_0 + gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ... 
    where gamma < 1 is given to us, and r_t is reward in tth step.

Most common is the 3rd one.  Problem with average reward is it can let
your robot dilly-dally: two strategies that differ only in the first
1000 time steps are equivalent.  Fixing time-horizon t can be very
sensitive to exact horizon.  Discounting forces you to always do
something reasonable.

Technical question: why discount by gamma^t and not, say, 1/t^2?
People will give statements like "it's like an interest rate" but
the real reason people use gamma^t is that it simplifies the problem:
the optimal strategy is now TIME-INDEPENDENT: it's just a mapping from
states to actions.  In other words, if action a is the best choice
given that we are at state s at time 0, then we want to do action a
*every* time we reach state s.  That's because the suffix of the sum
starting from time t looks exactly like the sum itself, scaled by gamma^t.
This kind of strategy is called a POLICY. 

Q-values
========
DEFINITION: Q(s,a) = expected discounted reward for performing action
a from state s, and then behaving optimally from then on.

DEFINITION: V(s) = max_a Q(s,a).  V(s) = value of state s.

  The way that a procedure called Q-learning is going to work is to
  learn approximations to these Q values, as opposed to explicitly
  learning a mapping from states to actions.  Once we have Q-values,
  we can then pick actions accordingly.

Nice relationship among Q-values:

     Q(s,a) = E[R(s,a)] + gamma * sum_s' Pr(s')*V(s').

where Pr(s') is the probability that s' is the next state when you do
action a from state s.  Plugging in the definition of V(s), 
we can either write the Q's in terms of themselves, or the V's in
terms of themselves.  E.g., 
  V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ]

In fact, we can use this as a legal definition of Q(s,a) and V(s).  In
particular, there is only one solution.  Can anyone see why?

  Proof: one nice way to see this is: let V be the true value of each
  state according to our original definition, and suppose you modify
  V, changing each entry by at most some Delta, and plug it into the
  RHS of the above equation.  Then, notice you'll be off by at most
  Delta*gamma on the left-hand-side.  I.e., if I had some other
  solution that was off by at most Delta, then it would also have to
  be off by at most gamma*Delta.

What if we are *given* the transition and reward functions.  How to
solve for Q values?  Two natural ways.

Method #1: Dynamic Programming.  We just use the above observation.
  Start with some arbitrary guesses for the values V, and then run them
  through the above equation to get an updated guess.  Specifically,

     V_i(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V_{i-1}(s') ]

  In fact, if we initialize V_0(s) = 0, then V_t(s) represents the
  maximum discounted reward we can get if the game ends in t steps.
  Note: this will converge exponentially quickly to the correct values.  If 
  initially off by Delta, the in t steps you're off by at most Delta*gamma^t.

Method #2: Linear Programming.  Replace "max" with ">=" and then
  minimize sum_s V(s) subject to those constraints.  Since the
  quantities are all monotonically related in the inequalities, the
  minimum sum will occur when, for each s, one of its constraints is at
  equality. 

Q-learning
==========
Idea in Q-learning is that if we *don't* have a model and only get
rewards  as we go, we can still do DP ("Bellman backups") in an
asynchronous way.  Let hat{Q}(s) be our estimated Q value for state s. 

Deterministic version (if world is deterministic):
- initialize all hat{Q} to 0
- Suppose we are in state s, do action a, go to s' and receive reward
  r.  Then set:  
        hat{Q}(s,a) = r + gamma * hat{V}(s')
  where
        hat{V}(s') = max_{a'} hat{Q}(s',a').

  [Analogous to running Bellman-Ford shortest path algorithm in an
  asynchronous way.  Say your goal is to determine the shortest path
  to a given goal state.  You are currently in a state s where you
  know a path of length 50, but you travel an edge of length 10 and
  get to a state s' where you know of a path of length 30.  So, you
  update s to being distance at most 40]

More generally, in a probabilistic environment:

- On t-th update of hat{Q}(s,a):
      hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')]

- Idea: dampen the randomness.

  Use alpha_t = 1/t or similar.  With alpha_t = 1/t, you get a fair
  average of all the rewards received for doing action a in state s
  (just unroll the sum).  If you make more slowly decreasing, you
  favor your more recent experience.


Convergence of Q-learning (deterministic version): 

- Say we start with some hat{Q} values.  Define 
      Delta_0 = max_{s,a} |hat{Q}(s,a) - Q(s,a)|.

- Let's define a "Q-interval" to be an interval of time in which every
  <s,a> is tried at least once.

Claim is that after ith Q-interval, we've done at least as well as i
steps of dynamic programming.  Specifically, let Delta_i =
max_{s,a}|hat{Q}(s,a)-Q(s,a)| after the ith Q-interval.

Claim: after the ith Q-interval, Delta_i <= Delta_0*gamma^i. 
In addition, during the ith Q-interval, this maximum difference will
be at most Delta_0 * gamma^{i-1}. 

Prove by induction.  Base case (the very beginning) is OK.  After an
update in interval i of hat{Q}(s,a) (let's say that action a takes you
from s to s') we have: 
  |hat{Q}(s,a) - Q(s,a)| = |(r + gamma max_{a'} hat{Q}(s',a')) - 
			    (r + gamma max_{a'}Q(s',a')|
                         = gamma |max_{a'}hat{Q}(s',a') - max_{a'}Q(s',a')|
                        <= gamma max_{a'} |hat{Q}(s',a') - Q(s',a')|
                        <= gamma * gamma^{i-1}*Delta_0.

For probabilistic version, you will need that sum_t alpha_t = infinity
but sum_t (alpha_t)^2 is finite.

Various issues
--------------
- We've just talked about how to update the estimates given the
  actions chosen, but how to choose the action?  Typical thing: with
  prob 1-epsilon choose the action you think is best, with prob
  epsilon choose a random action.  Another thing you can do is if you
  start all the hat{Q}-values too high, then this will naturally cause
  the algorithm to select actions it hasn't tried much.  There's more
  sophisticated work on analyzing action choices as a function of
  properties (like mixing rates) of the environment.

Combining with function approximation / concept learning
========================================================

What if the state space is too big to represent explicitly?  E.g.,
each state is just described by some set of features. 

- Neat idea: try to train a learning algorithm that given a <state,action> 
  pair outputs its Q-value.  E.g., could be a neural-net or whatever.

- How to train, since we don't have labeled data?  Use Q-learning!
  Say we're in state s, do action a, get reward r, and go to s'.  Use
  (1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label.
  Can think of this like training up an evaluation function for chess
  by trying to make it be self-consistent.

- Will this really work? TD-gammon backgammon player learned a
  policy for backgammon play using this approach and is the best
  computer backgammon player in the world, and comparable to or better
  than best human players.  On the other hand, theory says this
  approach isn't guaranteed to work: even if your predictor is
  powerful enough to fit the value function, you can get into a bad
  feedback loop.  Work by Geoff Gordon on conditions that ensure
  things will go well.  But they are pretty strong conditions. 

Other material if time: Say you find approximate values V' and use the
greedy policy wrt V'.  Why is that necessarily an apx optimal policy?