============================================
Direct Gradient-Based Reinforcement Learning
============================================

Jonathan Baxter
Research School of Information Sciences and Engineering
Australian National University

Many control, scheduling, planning and game-playing tasks can be
formulated as reinforcement learning problems, in which an agent
chooses actions to take in some environment, aiming to maximize a
reward function.  Most reinforcement learning methods attempt to learn
the values (expected subsequent rewards) of states, and choose
appropriate actions to maximize these. This approach has had some
empirical success, but there are few useful theoretical guarantees on
performance, except in a table-lookup setting. In particular, it is
not uncommon for performance to degrade during learning.

This talk describes an alternative approach to reinforcement learning,
which we call direct reinforcement learning.  We consider policies
that are defined by parameters -- they might define approximate value
functions that are used to generate a policy by some form of
look-ahead, or they might directly parameterize the agent's policy.
Rather than attempting to find accurate value estimates and then use
these to generate a policy, we instead directly adjust the parameters
to improve the average reward.  Specifically, we compute the gradient
of the average reward with respect to the parameters, and then use
gradient ascent to generate a new set of parameters with increased
average reward.

The talk will present an algorithm for computing accurate
approximations to the gradient of the average reward from a single
sample path.  The algorithm has antecedents in Williams' REINFORCE
class of algorithms.  We use these estimates in a conjugate-gradient
ascent algorithm that uses a novel line-search routine, relying solely
on gradient estimates.  We show how the running time of the gradient
algorithm is related to the mixing time of the Markov chain underlying
the reinforcement learning problem.

In a variety of domains, including a communications network call
admission problem and a dynamic system control problem, this algorithm
rapidly finds optimal or near-optimal solutions.

(Joint work with Peter Bartlett and Lex Weaver)