Incremental Least-Squares Temporal Difference Learning

Michael Bowling, University of Alberta, Edmonton

Online policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning. Temporal difference (TD) learning is the traditional approach, which is both guaranteed to converge and only requires computation that is linear in the number of features per time step. However, it may necessitate many trajectory samples to achieve a good approximation. Least-squares TD learning (LSTD) techniques are a recent alternative that makes more effecient use of data trajectories, but requires computation that is quadratic in the number of features per time step. This computational cost prevents the least-squares approach from being practical with a large feature space. This talk describes a new variant of TD learning, called incremental least-squares TD learning, or iLSTD. It uses trajectory data more efficiently than TD, and with certain function approximators only requires linear, rather than quadratic, computation per time step. In addition to demonstrating the computational and data advantages of iLSTD, this talk will also show how iLSTD can be efficiently extended with eligibility traces and present sufficient conditions for convergence. This is joint work with Alborz Geramifard, Martin Zinkevich, and Rich Sutton.

Brief Bio

Michael Bowling is an assistant professor at the University of Alberta in Edmonton, Canada. He received his Ph.D. in 2003 from Carnegie Mellon University, where his dissertation focused on multiagent learning. He is now a recovering robot soccer competitor, investigating machine learning algorithms for robotics, computer games, and poker.