Incremental Least-Squares Temporal Difference Learning
Michael Bowling, University of Alberta, Edmonton
Online policy evaluation with linear function approximation is a
commonly arising problem in reinforcement learning. Temporal
difference (TD) learning is the traditional approach, which is both
guaranteed to converge and only requires computation that is linear in
the number of features per time step. However, it may necessitate
many trajectory samples to achieve a good approximation.
Least-squares TD learning (LSTD) techniques are a recent alternative
that makes more effecient use of data trajectories, but requires
computation that is quadratic in the number of features per time step.
This computational cost prevents the least-squares approach from being
practical with a large feature space. This talk describes a new
variant of TD learning, called incremental least-squares TD learning,
or iLSTD. It uses trajectory data more efficiently than TD, and with
certain function approximators only requires linear, rather than
quadratic, computation per time step. In addition to demonstrating
the computational and data advantages of iLSTD, this talk will also
show how iLSTD can be efficiently extended with eligibility traces and
present sufficient conditions for convergence.
This is joint work with Alborz Geramifard, Martin Zinkevich, and Rich Sutton.
Michael Bowling is an assistant professor at the University of Alberta
in Edmonton, Canada. He received his Ph.D. in 2003 from Carnegie
Mellon University, where his dissertation focused on multiagent
learning. He is now a recovering robot soccer competitor,
investigating machine learning algorithms for robotics, computer
games, and poker.