15-859(B) Machine Learning Theory 04/02/08 * Markov Decision Processes and Reinforcement Learning II ========================================================================= ===Recap=== MDP model is: - a set of states - a set of actions - a probabilistic transition function: Let's use Pr_{s,a}(s') to denote prob of transitioning to state s' if do action a from state s. - a probabilistic reward function: R(s,a) is a probability distribtion over rewards you get from doing action a in state s. Assumption is that you know your current state. Different things we might want to optimize. We will consider expected discounted reward. Given gamma in (0,1), want to maximize E[r_0 + gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ...] where r_t is reward at time t. This has the nice property that the optimal strategy is a policy (a mapping of states to actions). Q-values ======== DEFINITION: Q(s,a) = expected discounted reward for performing action a from state s, and then behaving optimally from then on. DEFINITION: V(s) = max_a Q(s,a). V(s) = value of state s. So, V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ] We then saw we can use this as a legal *definition* of V(s) since there is only one solution. [Actually, just to flesh out what may be an obvious point: the optimal policy simultaneously maximizes V(s) for all states s. That's because if you start at state s and go to s' then you want to now behave in the same way as if you had started at s', because the gamma^t discounting is time-invariant] Given the transition and reward functions, we saw we can solve for Q-values through either Dynamic Programming (called "value iteration") or Linear Programming. # iterations of DP needed depends on "half-life" 1/(1-gamma), so would prefer LP if gamma = 1 - 1/n^10. Another popular method is called "policy iteration": Here you guess some initial policy pi_0 which converts this to a markov chain. Then can solve for the values V^{pi_0} using linear algebra. Then compute greedy policy pi_1 with respect to these values. Then solve for V^{pi_1}, etc. In practice this converges quickly but only theoretical bounds known are that it converges at least as fast as value iteration. Q-learning ========== Deterministic version (if world is deterministic): - Suppose we are in state s, do action a, go to s' and receive reward r. Then set: hat{Q}(s,a) = r + gamma * hat{V}(s') where hat{V}(s') = max_{a'} hat{Q}(s',a'). [Analogous to running Bellman-Ford algorithm in an asynchronous way.] More generally, in a probabilistic environment: - On t-th update of hat{Q}(s,a): hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')] We ended by proving the following theorem about the deterministic version: Define a "Q-interval" to be an interval of time in which every is tried at least once. Let Delta_i = max_{s,a}|hat{Q}(s,a)-Q(s,a)| after the ith Q-interval. Then after the ith Q-interval, Delta_i <= Delta_0*gamma^i. Proof by induction. Base case (the very beginning) is OK. After an update in interval i of hat{Q}(s,a) (let's say that action a takes you from s to s') we have: |hat{Q}(s,a) - Q(s,a)| = |(r + gamma max_{a'} hat{Q}(s',a')) - (r + gamma max_{a'}Q(s',a')| = gamma |max_{a'}hat{Q}(s',a') - max_{a'}Q(s',a')| <= gamma max_{a'} |hat{Q}(s',a') - Q(s',a')| <= gamma * gamma^{i-1}*Delta_0. For probabilistic version, you will need that sum_t alpha_t = infinity but sum_t (alpha_t)^2 is finite. Related Algorithms ================== SARSA: "on policy" versus "off policy" updates. Replace "hat{V}(s')" with "hat{Q}(s',a')". So long as you usually choose the actions you believe to be best (argmax[hat{Q}(s',a')]), this will be similar to the Q-learning update. [Question: what is needed for the above proof to go through?] Also, related TD(lambda) algorithm where you pass experience backward in time discounted by some lambda, rather than just passing it back one level. TD is for *evaluating* a given policy, but can also do with SARSA. Algorithmically, do with vector of multipliers over states called "eligibility trace". At each step, multiply each value by lambda*gamma (to discount), but add 1 to current state. One more issue to address ========================= Say you find approximate values V' and use the greedy policy \pi wrt V'. Is \pi necessarily an approximately optimal policy? Specifically, say that max_s |V(s)-V'(s)| = epsilon. Let Delta = max_s (V(s) - V^{\pi}(s)). Want to show that Delta is not too much larger than epsilon. Proof: Let s be a state of maximum difference Delta between V(s) and V^{\pi}(s). Say that the optimal action from s is "a" but \pi says to do "b". V^{\pi}(s) = E[R(s,b)] + gamma * sum_{s'}[Pr_{s,b}(s') V^{\pi}(s')] Now, what if we follow pi for just one step and then behave optimally from then on. At best this helps by gamma*Delta since the quantities inside the sum go up by at most Delta. [In fact, this is an easy way to see that actions "a" and "b" must be different.] So, this is still worse than optimal at state s by at least Delta - gamma*Delta. I.e., Delta(1-gamma) <= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V(s')] - [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V(s')]. But since "b" looked better according to V' (that's how \pi was defined), it must be the case that 0 >= [E[R(s,a)] + gamma*sum_{s'}Pr_{s,a}(s')V'(s')] - [E[R(s,b)] + gamma*sum_{s'}Pr_{s,b}(s')V'(s')]. So, subtracting the second RHS from the 1st RHS, and using the fact that |V(s') - V'(s')| <= epsilon, we get: Delta(1-gamma) <= gamma*[sum_{s'}Pr_{s,a}(s')*epsilon] + gamma*[sum_{s'}Pr_{s,b}(s')*epsilon], which is at most 2*gamma*epsilon. So, Delta <= 2*epsilon*gamma / (1 - gamma). So the gap gets worse as gamma gets closer to 1, but we can at least say that it's enough to get epsilon to be small compared to 1/(1-gamma). Combining with function approximation / concept learning ======================================================== What if the state space is too big to represent explicitly? E.g., each state is just described by some set of features. - Neat idea: try to train a learning algorithm that given a pair outputs its Q-value. E.g., could be a neural-net or whatever. - How to train, since we don't have labeled data? Use Q-learning! Say we're in state s, do action a, get reward r, and go to s'. Use (1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label. Can think of this like training up an evaluation function for chess by trying to make it be self-consistent. - What's particularly interesting here is we are effectively using two notions of states being "similar" here. One is defined by the representation of the learning algorithm (e.g., if you have a simple linear weighted combination of features, then two states are similar if they aren't too far apart in the important directions in the feature space). The other is defined by the transition function. - Will this really work? TD-gammon backgammon player learned a policy for backgammon play using this approach and is the best computer backgammon player in the world, and comparable to or better than best human players. On the other hand, theory says this approach isn't guaranteed to work: even if your predictor is powerful enough to fit the value function, you can get into a bad feedback loop. Work by Geoff Gordon on conditions that ensure things will go well. But they are pretty strong conditions. ========================= 2-player MDP: what if you have two players A and B. Alternate moves. One state is labeled "win for A". (and can also have a state called "win for B" if we like). No discounting. Goal of A is to maximize prob of entering "win for A" state. - What could we do if we defined the game to be a win for B if it lasts more than T steps? - What if we don't? Turns out there is no known poly-time algorithm for determining the optimal policy for A. (In fact, believed to not be in P). However, the decision question "does A have a strategy that can guarantee at least a probability p of winning" is in NP intersect Co-NP. Can anyone see why it's in NP? Co-NP? These are sometimes called "simple stochastic games".