15-859(B) Machine Learning Theory 04/10/07 * Markov Decision Processes and Reinforcement Learning II ========================================================================= ===Recap=== MDP model is: - a set of states - a set of actions - a probabilistic transition function: Let's use Pr_{s,a}(s') to denote prob of transitioning to state s' if do action a from state s. - a probabilistic reward function: R(s,a) is a probability distribtion over rewards you get from doing action a in state s. Assumption is that you know your current state. Like DFA model, MDP is appealing as a model of agent trying to figure out how to act in the world. Incorporates notion that actions can be good or bad not only because of immediate reward/penalty, but because of future rewards/penalties they can bring. Probabilities also allow you to reasonably handle unexpected events. [POMDP aside: What about the case when you don't completely know your current state? One default thing to do is just fold that into the probabilities. At some level, a lot of randomness experienced is just that there are properties of the world that we just can't measure. On the other hand, if you have a model of the world and can't tell exactly what state you are in (the POMDP setting) then the "right" way to handle this is to explicitly model your current location in the MDP as a probability cloud, called a "belief state". E.g., maybe there is a long hallway and you can't exactly tell where you are in it. Some actions widen the cloud and some (like actions designed to pinpoint your location) can reduce it. The problem from a planning perspective is that the optimal action doesn't have to be the optimal action from any of the components. E.g., Sebastian Thrun's heaven/hell example. [2k states. Actions: forward, turn around, look up]. To do this right, you want to view this as an MDP on belief states. These are continuous but you can show the optimal value function is piecewise linear on them. Of course, the state space blows up a lot. There's been a lot of work by Littman and others on algorithms for improving performance in practice. Sometimes you can deal with uncertainties by explictly defining new states corresponding to information you have. E.g., if there is a door that might be locked or unlocked, just explicitly model your state of knowledge in your state description. ] Different things we might want to optimize. We will consider expected discounted reward. Given gamma in (0,1), want to maximize E[r_0 + gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ...] where r_t is reward at time t. This has the nice property that the optimal strategy is a policy (a mapping of states to actions). Q-values ======== DEFINITION: Q(s,a) = expected discounted reward for performing action a from state s, and then behaving optimally from then on. DEFINITION: V(s) = max_a Q(s,a). V(s) = value of state s. So, V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ] We then saw we can use this as a legal *definition* of V(s) since there is only one solution. [Actually, just to flesh out what may be an obvious point: the optimal policy simultaneously maximizes V(s) for all states s. That's because if you start at state s and go to s' then you want to now behave in the same way as if you had started at s', because the gamma^t discounting is time-invariant] Given the transition and reward functions, we saw we can solve for Q-values through either Dynamic Programming (called "value iteration") or Linear Programming. # iterations of DP needed depends on "half-life" 1/(1-gamma), so would prefer LP if gamma = 1 - 1/n^10. Another popular method is called "policy iteration": Here you guess some initial policy pi_0 which converts this to a markov chain. Then can solve for the values V^{pi_0} using linear algebra. Then compute greedy policy pi_1 with respect to these values. Then solve for V^{pi_1}, etc. In practice this converges quickly but only theoretical bounds known are that it converges at least as fast as value iteration. Q-learning ========== Deterministic version (if world is deterministic): - Suppose we are in state s, do action a, go to s' and receive reward r. Then set: hat{Q}(s,a) = r + gamma * hat{V}(s') where hat{V}(s') = max_{a'} hat{Q}(s',a'). [Analogous to running Bellman-Ford algorithm in an asynchronous way.] More generally, in a probabilistic environment: - On t-th update of hat{Q}(s,a): hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')] We ended by proving the following theorem about the deterministic version: Define a "Q-interval" to be an interval of time in which every is tried at least once. Let Delta_i = max_{s,a}|hat{Q}(s,a)-Q(s,a)| after the ith Q-interval. Then after the ith Q-interval, Delta_i <= Delta_0*gamma^i. For probabilistic version, you will need that sum_t alpha_t = infinity but sum_t (alpha_t)^2 is finite. Also, related TD(lambda) algorithm where you pass experience backward in time discounted by some lambda, rather than just passing it back one level. One more issue we need to address ================================= Say you find approximate values V' and use the greedy policy \pi wrt V'. Is \pi necessarily an approximately optimal policy? Specifically, say that max_s |V(s)-V'(s)| = epsilon. Let Delta = max_s (V(s) - V^{\pi}(s)). Want to show that Delta is not too much larger than epsilon. Proof: Let s be a state of maximum difference Delta between V(s) and V^{\pi}(s). Say that the optimal action from s is "a" but \pi says to do "b". V^{\pi}(s) = E[R(s,b)] + gamma * sum_{s'}[Pr_{s,b}(s') V^{\pi}(s')] Now, what if we follow pi for just one step and then behave optimally from then on. At best this helps by gamma*Delta since the quantities inside the sum go up by at most Delta. [In fact, this is an easy way to see that actions "a" and "b" must be different.] So, this is still worse than optimal at state s by at least Delta - gamma*Delta. I.e., Delta(1-gamma) <= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V(s')] - [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V(s')]. But since "b" looked better according to V' (that's how \pi was defined), it must be the case that 0 >= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V'(s')] - [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V'(s')]. So, subtracting the second RHS from the 1st RHS, and using the fact that |V(s') - V'(s')| <= epsilon, we get: Delta(1-gamma) <= gamma*[sum_{s'}Pr_{s,a}(s')*epsilon] + gamma*[sum_{s'}Pr_{s,b}(s')*epsilon], which is at most 2*gamma*epsilon. So, Delta <= 2*epsilon*gamma / (1 - gamma). So the gap gets worse as gamma gets closer to 1, but we can at least say that it's enough to get epsilon to be small compared to 1/(1-gamma). Combining with function approximation / concept learning ======================================================== What if the state space is too big to represent explicitly? E.g., each state is just described by some set of features. - Neat idea: try to train a learning algorithm that given a pair outputs its Q-value. E.g., could be a neural-net or whatever. - How to train, since we don't have labeled data? Use Q-learning! Say we're in state s, do action a, get reward r, and go to s'. Use (1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label. Can think of this like training up an evaluation function for chess by trying to make it be self-consistent. - What's particularly interesting here is we are effectively using two notions of states being "similar" here. One is defined by the representation of the learning algorithm (e.g., if you have a simple linear weighted combination of features, then two states are similar if they aren't too far apart in the important directions in the feature space). The other is defined by the transition function. - Will this really work? TD-gammon backgammon player learned a policy for backgammon play using this approach and is the best computer backgammon player in the world, and comparable to or better than best human players. On the other hand, theory says this approach isn't guaranteed to work: even if your predictor is powerful enough to fit the value function, you can get into a bad feedback loop. Work by Geoff Gordon on conditions that ensure things will go well. But they are pretty strong conditions.