15-859(B) Machine Learning Theory 03/31/08 * Markov Decision Processes and Reinforcement Learning ========================================================================= Today's topic: Markov Decision Processes (MDPs) and reinforcement learning. General scenario: much like the DFA problem: we are an agent in some environment, we have observations, we perform actions, they take us to a new state of the world. But also, we'll add a new ingredient to that setup, which is that from time to time we get a cookie (a reward), and our goal is going to be not so much to model the environment as to find a strategy that gets us lots of cookies. We're going to make some assumptions that make the problem easier, and some that make the problem harder, compared to the DFA problem. Specifically, we'll assume: - All states look different: you can always tell, from your current observation, what state you're in. [makes the problem easier] Also, reward depends only on current state and action. - But, transitions are probabilistic [makes the problem harder] - Goal is to learn a good strategy for collecting a lot of reward, rather than necessarily to make a model. [makes the problem a bit different] Formally, a Markov Decision Process consists of - a set of states - a set of actions - a probabilistic transition function: f(s,a) is a probability distribution over states, and - a probabilistic reward function: R(s,a) is a probability distribtion over rewards you get from doing action a in state s. Assumption is that you know your current state. Simple example: grid-world with walls. can go left, right, up, or down. Entering top-right corner gives you a reward of 100 and then takes you to a random state. Perhaps if you're not currently "hugging" a wall, then each action has some chance of taking you to a random neighbor. Nice features of MDPs: ---------------------- * Like DFA model, it's appealing as a model of agent trying to figure out how to act in the world. Incorporates notion that actions can be good or bad not only because of immediate reward/penalty, but because of future rewards/penalties they can bring. E.g., game playing (though there you often have huge number of states), or a robot that needs to do a task that takes several steps to carry out. * learning algorithms will be very natural: propagate rewards backward through state space: if you do an action from state s and get a reward of 100, you might give value of 90 to the state s' you were in right before s, since you can get from s' to s. A lot like notion of "conditioning" in psychology. That's why this is called reinforcement learning. * Probabilities allow you to reasonably handle situations like: you told your robot to go forward 1 foot, and instead it went 1.5 feet and turned a bit to the right. Or you ran out of battery power and got moved back to the start. Or someone just randomly came along and picked you up and put you down somewhere else. * Probabilities give to some extent a way of handling states that look the same, by merging them into one state with probabilistic transitions going out. Limitations ----------- * States that look the same can be a real problem, even with probabilities. E.g., imagine robot that needs to go into Wean 4623. Right now in corridor. Door might be locked or unlocked: that's two states of the world that look the same. Once we get there, we find door is locked. MDP model would have robot keep going back into corridor and trying again. * markov assumption not quite right: e.g., if your robot's wheels can get stuck, an MDP could model this as saying the go-forward action has a 2% chance of keeping you in the same state. But really, if that occured, you expect the probability it doesn't work the next time to have gone up by a lot. (could also be viewed as same as previous problem). "POMDPs" capture probabilistic transitions *and* lack of full observability, but there is much less in terms of positive results about them, so we are going to stick with MDPs. What do we want to optimize? ============================ We haven't exactly specified what we want our learned strategy to optimize. Some natural choices are: - expected average reward (reward per time step) - expected reward in first t time steps (t given to us). - expected discounted reward: sum of rewards but giving less weight to rewards in the future. Specifically, a typical way of doing this: r_0 + gamma * r_1 + gamma^2 * r_2 + gamma^3 * r_3 ... where gamma < 1 is given to us, and r_t is reward in tth step. Most common is the 3rd one. Problem with average reward is it can let your robot dilly-dally: two strategies that differ only in the first 1000 time steps are equivalent. Fixing time-horizon t can be very sensitive to exact horizon. Discounting forces you to always do something reasonable. Technical question: why discount by gamma^t and not, say, 1/t^2? People will give statements like "it's like an interest rate" but the real reason people use gamma^t is that it simplifies the problem: the optimal strategy is now TIME-INDEPENDENT: it's just a mapping from states to actions. In other words, if action a is the best choice given that we are at state s at time 0, then we want to do action a *every* time we reach state s. That's because the suffix of the sum starting from time t looks exactly like the sum itself, scaled by gamma^t. This kind of strategy is called a POLICY. Q-values ======== DEFINITION: Q(s,a) = expected discounted reward for performing action a from state s, and then behaving optimally from then on. DEFINITION: V(s) = max_a Q(s,a). V(s) = value of state s. The way that a procedure called Q-learning is going to work is to learn approximations to these Q values, as opposed to explicitly learning a mapping from states to actions. Once we have Q-values, we can then pick actions accordingly. Nice relationship among Q-values: Q(s,a) = E[R(s,a)] + gamma * sum_s' Pr(s')*V(s'). where Pr(s') is the probability that s' is the next state when you do action a from state s. Plugging in the definition of V(s), we can either write the Q's in terms of themselves, or the V's in terms of themselves. E.g., V(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V(s') ] In fact, we can use this as a legal definition of Q(s,a) and V(s). In particular, there is only one solution. Can anyone see why? Proof: one nice way to see this is: let V be the true value of each state according to our original definition, and suppose you modify V, changing each entry by at most some Delta, and plug it into the RHS of the above equation. Then, notice you'll be off by at most Delta*gamma on the left-hand-side. I.e., if I had some other solution that was off by at most Delta, then it would also have to be off by at most gamma*Delta. What if we are *given* the transition and reward functions. How to solve for Q values? Two natural ways. Method #1: Dynamic Programming. We just use the above observation. Start with some arbitrary guesses for the values V, and then run them through the above equation to get an updated guess. Specifically, V_i(s) = max_a [ E[R(s,a)] + gamma * sum_s' Pr(s')V_{i-1}(s') ] In fact, if we initialize V_0(s) = 0, then V_t(s) represents the maximum discounted reward we can get if the game ends in t steps. Note: this will converge exponentially quickly to the correct values. If initially off by Delta, the in t steps you're off by at most Delta*gamma^t. Method #2: Linear Programming. Replace "max" with ">=" and then minimize sum_s V(s) subject to those constraints. Since the quantities are all monotonically related in the inequalities, the minimum sum will occur when, for each s, one of its constraints is at equality. Q-learning ========== Idea in Q-learning is that if we *don't* have a model and only get rewards as we go, we can still do DP ("Bellman backups") in an asynchronous way. Let hat{Q}(s) be our estimated Q value for state s. Deterministic version (if world is deterministic): - initialize all hat{Q} to 0 - Suppose we are in state s, do action a, go to s' and receive reward r. Then set: hat{Q}(s,a) = r + gamma * hat{V}(s') where hat{V}(s') = max_{a'} hat{Q}(s',a'). [Analogous to running Bellman-Ford shortest path algorithm in an asynchronous way. Say your goal is to determine the shortest path to a given goal state. You are currently in a state s where you know a path of length 50, but you travel an edge of length 10 and get to a state s' where you know of a path of length 30. So, you update s to being distance at most 40] More generally, in a probabilistic environment: - On t-th update of hat{Q}(s,a): hat{Q}(s,a) = (1-alpha_t)hat{Q}(s,a) + alpha_t[r + gamma*hat{V}(s')] - Idea: dampen the randomness. Use alpha_t = 1/t or similar. With alpha_t = 1/t, you get a fair average of all the rewards received for doing action a in state s (just unroll the sum). If you make more slowly decreasing, you favor your more recent experience. Convergence of Q-learning (deterministic version): - Say we start with some hat{Q} values. Define Delta_0 = max_{s,a} |hat{Q}(s,a) - Q(s,a)|. - Let's define a "Q-interval" to be an interval of time in which every is tried at least once. Claim is that after ith Q-interval, we've done at least as well as i steps of dynamic programming. Specifically, let Delta_i = max_{s,a}|hat{Q}(s,a)-Q(s,a)| after the ith Q-interval. Claim: after the ith Q-interval, Delta_i <= Delta_0*gamma^i. In addition, during the ith Q-interval, this maximum difference will be at most Delta_0 * gamma^{i-1}. Prove by induction. Base case (the very beginning) is OK. After an update in interval i of hat{Q}(s,a) (let's say that action a takes you from s to s') we have: |hat{Q}(s,a) - Q(s,a)| = |(r + gamma max_{a'} hat{Q}(s',a')) - (r + gamma max_{a'}Q(s',a')| = gamma |max_{a'}hat{Q}(s',a') - max_{a'}Q(s',a')| <= gamma max_{a'} |hat{Q}(s',a') - Q(s',a')| <= gamma * gamma^{i-1}*Delta_0. For probabilistic version, you will need that sum_t alpha_t = infinity but sum_t (alpha_t)^2 is finite. Various issues -------------- - We've just talked about how to update the estimates given the actions chosen, but how to choose the action? Typical thing: with prob 1-epsilon choose the action you think is best, with prob epsilon choose a random action. Another thing you can do is if you start all the hat{Q}-values too high, then this will naturally cause the algorithm to select actions it hasn't tried much. There's more sophisticated work on analyzing action choices as a function of properties (like mixing rates) of the environment. Combining with function approximation / concept learning ======================================================== What if the state space is too big to represent explicitly? E.g., each state is just described by some set of features. - Neat idea: try to train a learning algorithm that given a pair outputs its Q-value. E.g., could be a neural-net or whatever. - How to train, since we don't have labeled data? Use Q-learning! Say we're in state s, do action a, get reward r, and go to s'. Use (1-alpha)hat(Q)(s,a) + alpha (r + gamma * hat{V}(s')) as the label. Can think of this like training up an evaluation function for chess by trying to make it be self-consistent. - Will this really work? TD-gammon backgammon player learned a policy for backgammon play using this approach and is the best computer backgammon player in the world, and comparable to or better than best human players. On the other hand, theory says this approach isn't guaranteed to work: even if your predictor is powerful enough to fit the value function, you can get into a bad feedback loop. Work by Geoff Gordon on conditions that ensure things will go well. But they are pretty strong conditions. Other material if time: Say you find approximate values V' and use the greedy policy wrt V'. Why is that necessarily an apx optimal policy?