============================================ Direct Gradient-Based Reinforcement Learning ============================================ Jonathan Baxter Research School of Information Sciences and Engineering Australian National University Many control, scheduling, planning and game-playing tasks can be formulated as reinforcement learning problems, in which an agent chooses actions to take in some environment, aiming to maximize a reward function. Most reinforcement learning methods attempt to learn the values (expected subsequent rewards) of states, and choose appropriate actions to maximize these. This approach has had some empirical success, but there are few useful theoretical guarantees on performance, except in a table-lookup setting. In particular, it is not uncommon for performance to degrade during learning. This talk describes an alternative approach to reinforcement learning, which we call direct reinforcement learning. We consider policies that are defined by parameters -- they might define approximate value functions that are used to generate a policy by some form of look-ahead, or they might directly parameterize the agent's policy. Rather than attempting to find accurate value estimates and then use these to generate a policy, we instead directly adjust the parameters to improve the average reward. Specifically, we compute the gradient of the average reward with respect to the parameters, and then use gradient ascent to generate a new set of parameters with increased average reward. The talk will present an algorithm for computing accurate approximations to the gradient of the average reward from a single sample path. The algorithm has antecedents in Williams' REINFORCE class of algorithms. We use these estimates in a conjugate-gradient ascent algorithm that uses a novel line-search routine, relying solely on gradient estimates. We show how the running time of the gradient algorithm is related to the mixing time of the Markov chain underlying the reinforcement learning problem. In a variety of domains, including a communications network call admission problem and a dynamic system control problem, this algorithm rapidly finds optimal or near-optimal solutions. (Joint work with Peter Bartlett and Lex Weaver)