15-859(B) Machine Learning Theory 04/10/07
* Markov Decision Processes and Reinforcement Learning II
=========================================================================
===Recap===
MDP model is:
- a set of states
- a set of actions
- a probabilistic transition function: Let's use Pr_{s,a}(s') to
denote prob of transitioning to state s' if do action a from state s.
- a probabilistic reward function: R(s,a) is a probability
distribtion over rewards you get from doing action a in state s.
Assumption is that you know your current state.
Like DFA model, MDP is appealing as a model of agent trying to figure
out how to act in the world. Incorporates notion that actions can be
good or bad not only because of immediate reward/penalty, but because
of future rewards/penalties they can bring. Probabilities also allow
you to reasonably handle unexpected events.
[POMDP aside:
What about the case when you don't completely know your current state?
One default thing to do is just fold that into the probabilities. At
some level, a lot of randomness experienced is just that there are
properties of the world that we just can't measure.
On the other hand, if you have a model of the world and can't tell
exactly what state you are in (the POMDP setting) then the
"right" way to handle this is to explicitly model your current
location in the MDP as a probability cloud, called a "belief state".
E.g., maybe there is a long hallway and you can't exactly tell where
you are in it. Some actions widen the cloud and some (like actions
designed to pinpoint your location) can reduce it.
The problem from a planning perspective is that the optimal action
doesn't have to be the optimal action from any of the components.
E.g., Sebastian Thrun's heaven/hell example. [2k states. Actions:
forward, turn around, look up]. To do this right, you want to view
this as an MDP on belief states. These are continuous but you can
show the optimal value function is piecewise linear on them. Of
course, the state space blows up a lot. There's been a lot of work
by Littman and others on algorithms for improving performance in practice.
Sometimes you can deal with uncertainties by explictly defining new states
corresponding to information you have. E.g., if there is a door that
might be locked or unlocked, just explicitly model your state of
knowledge in your state description.
]
Different things we might want to optimize. We will consider expected
discounted reward. Given gamma in (0,1), want to maximize E[r_0 +
gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ...] where r_t is reward
at time t. This has the nice property that the optimal strategy is a
policy (a mapping of states to actions).
Q-values
========
DEFINITION: Q(s,a) = expected discounted reward for performing action
a from state s, and then behaving optimally from then on.
DEFINITION: V(s) = max_a Q(s,a). V(s) = value of state s.
So, V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ]
We then saw we can use this as a legal *definition* of V(s) since
there is only one solution.
[Actually, just to flesh out what may be an obvious point: the optimal
policy simultaneously maximizes V(s) for all states s. That's because
if you start at state s and go to s' then you want to now behave in the
same way as if you had started at s', because the gamma^t discounting
is time-invariant]
Given the transition and reward functions, we saw we can solve for
Q-values through either Dynamic Programming (called "value iteration")
or Linear Programming. # iterations of DP needed depends on
"half-life" 1/(1-gamma), so would prefer LP if gamma = 1 - 1/n^10.
Another popular method is called "policy iteration": Here you guess
some initial policy pi_0 which converts this to a markov chain. Then
can solve for the values V^{pi_0} using linear algebra. Then compute
greedy policy pi_1 with respect to these values. Then solve for
V^{pi_1}, etc. In practice this converges quickly but only
theoretical bounds known are that it converges at least as fast as
value iteration.
Q-learning
==========
Deterministic version (if world is deterministic):
- Suppose we are in state s, do action a, go to s' and receive reward
r. Then set:
hat{Q}(s,a) = r + gamma * hat{V}(s')
where
hat{V}(s') = max_{a'} hat{Q}(s',a').
[Analogous to running Bellman-Ford algorithm in an asynchronous way.]
More generally, in a probabilistic environment:
- On t-th update of hat{Q}(s,a):
hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')]
We ended by proving the following theorem about the deterministic
version:
Define a "Q-interval" to be an interval of time in which every ~~
is tried at least once. Let Delta_i = max_{s,a}|hat{Q}(s,a)-Q(s,a)|
after the ith Q-interval. Then after the ith Q-interval, Delta_i <=
Delta_0*gamma^i.
For probabilistic version, you will need that sum_t alpha_t = infinity
but sum_t (alpha_t)^2 is finite.
Also, related TD(lambda) algorithm where you pass experience
backward in time discounted by some lambda, rather than just passing
it back one level.
One more issue we need to address
=================================
Say you find approximate values V' and use the greedy policy \pi wrt V'.
Is \pi necessarily an approximately optimal policy?
Specifically, say that max_s |V(s)-V'(s)| = epsilon.
Let Delta = max_s (V(s) - V^{\pi}(s)).
Want to show that Delta is not too much larger than epsilon.
Proof: Let s be a state of maximum difference Delta between V(s) and
V^{\pi}(s). Say that the optimal action from s is "a" but \pi says to
do "b".
V^{\pi}(s) = E[R(s,b)] + gamma * sum_{s'}[Pr_{s,b}(s') V^{\pi}(s')]
Now, what if we follow pi for just one step and then behave optimally
from then on. At best this helps by gamma*Delta since the quantities
inside the sum go up by at most Delta. [In fact, this is an easy way
to see that actions "a" and "b" must be different.] So, this is still
worse than optimal at state s by at least Delta - gamma*Delta. I.e.,
Delta(1-gamma) <= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V(s')] -
[E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V(s')].
But since "b" looked better according to V' (that's how \pi was
defined), it must be the case that
0 >= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V'(s')] -
[E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V'(s')].
So, subtracting the second RHS from the 1st RHS, and using the fact
that |V(s') - V'(s')| <= epsilon, we get:
Delta(1-gamma) <= gamma*[sum_{s'}Pr_{s,a}(s')*epsilon] +
gamma*[sum_{s'}Pr_{s,b}(s')*epsilon],
which is at most 2*gamma*epsilon.
So, Delta <= 2*epsilon*gamma / (1 - gamma).
So the gap gets worse as gamma gets closer to 1, but we can at least
say that it's enough to get epsilon to be small compared to 1/(1-gamma).
Combining with function approximation / concept learning
========================================================
What if the state space is too big to represent explicitly? E.g.,
each state is just described by some set of features.
- Neat idea: try to train a learning algorithm that given a
pair outputs its Q-value. E.g., could be a neural-net or whatever.
- How to train, since we don't have labeled data? Use Q-learning!
Say we're in state s, do action a, get reward r, and go to s'. Use
(1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label.
Can think of this like training up an evaluation function for chess
by trying to make it be self-consistent.
- What's particularly interesting here is we are effectively using two
notions of states being "similar" here. One is defined by the
representation of the learning algorithm (e.g., if you have a simple
linear weighted combination of features, then two states are similar
if they aren't too far apart in the important directions in the
feature space). The other is defined by the transition function.
- Will this really work? TD-gammon backgammon player learned a
policy for backgammon play using this approach and is the best
computer backgammon player in the world, and comparable to or better
than best human players. On the other hand, theory says this
approach isn't guaranteed to work: even if your predictor is
powerful enough to fit the value function, you can get into a bad
feedback loop. Work by Geoff Gordon on conditions that ensure
things will go well. But they are pretty strong conditions.
~~