Residual Algorithms
Reinforcement learning
algorithms such as Q-learning,
advantage learning,
and value iteration all try to find functions
that satisfy the Bellman equation. For example, in Q-learning,
the goal is to find a function Q(x,u) that satisfies
the Bellman equation:
Q(x,u) = E[ R + gamma * max Q(x',u') ]
where performing action u in state x leads to a transition to
state x' and a reinforcement R. The maximum is taken over
all actions u' that could be performed in state x'.
The expected value is taken over all possible (x',R)
outcomes from performing action u in state x in a
stochastic MDP.
At any given time during learning, the Q function that has been learned
so far may not exactly satisfy the Bellman equation. The Bellman
residual is the difference between the two sides of the equation.
This is slightly different from the TD error. The TD error
is the difference between the two sides of the equation without the
expected value, evalutated after just a single transition for
(x,u). So the Bellman residual is the expected value of
the TD error. The TD error is the Bellman residual plus zero-mean
noise.
One way to learn the Q function is to do gradient descent on e,
which is the mean squared Bellman residual. That is, square the Bellman
residual for each state-action pair and take the mean of them all. The
mean is weighted by how often each state-action pair is visited. Assume
that there is a fixed, stochastic training policy so e
is well defined. Then there are three ways to learn Q:
- The direct method
Do gradient descent on e,
but when calculating the derivatives,
pretend that the weights only effect Q(x,u) but not Q(x',u').
- The residual gradient method
Do gradient descent on e.
- The residual method
Make the weight change recommended by the direct method
times (1-phi), plus the weight change recommended by the residual
gradient method, times phi. Phi is a constant between 0 and 1.
The direct method is the standard method proposed by Watkins
for learning the Q(x,u) function and used in TD-Gammon
for learning the V(x) function. There are simple cases
where the direct method causes all the weights and values to blow
up. This even happens for linear function approximators that
are general enough to represent all possible value functions.
It even happens when doing on policy training with a
function approximator and Q-learning.
On the other hand, the residual gradient method is guaranteed to
converge in the same sense that backprop converges. Of course,
this means that residual methods also converge when phi=1. In
fact, convergence can even be guaranteed for phi values less than
1, depending on the problem. There is even an automatic way to
choose appropriate phi values. Simulations so far have indicated
that the direct method tends to learn fast when it converges,
but can fail to converge. The residual gradient method always
converges but is sometimes slow. Residual algorithms with an
appropriate choice for phi have both guaranteed convergence, and
have been fast in simulation. The algorithm for automatically
determining phi has consistently found the fastest phi that is
still stable.
There are certain theoretical cases where residual algorithms
require a model of the MDP. But for most real-world problems,
a model is not needed. If the MDP is so large that no single
state-action pair is likely to be seen twice during training,
then there is no need for a model. This result is not mentioned
in the papers below, but is significant for the usefulness of
residual algorithms.
More Information
- Residual algorithms are defined and simple simulation results are given.
Baird, L. C. (1995).
Reinforcement Learning with Function Approximation.
In Armand Prieditis & Stuart Russell, eds.
Machine Learning: Proceedings of the Twelfth International Conference,
9-12 July, Morgan Kaufman Publishers, San Francisco, CA.
- Residual algorithms are defined and simple simulation results are given.
Baird, L. C. (1995). Residual Algorithms.
Proceedings of the Workshop on Value Function
Approximation, Machine Learning Conference 1995.
Justin A. Boyan, Andrew W. Moore, Richard S. Sutton, editors.
Technical report CMU-CS-95-206, July 9, 1995.
- An application of residual Q-learning.
Bandera, C., Vico, F. J., Bravo, J. M., Harmon, M. E., and Baird, L. C. (1996).
Residual Q-learning applied to visual attention.
Proceedings of the Thirteenth International
Conference on Machine Learning,
Bari, Italy, 3-6 July.
- Residual advantage learning applied to a game with nonlinear dynamics and a nonlinear
function approximator.
Harmon, M. E., and Baird, L. C. (1996).
Multi-player residual advantage learning with general function approximation.
(Techical Report WL-TR-96-1065).
Wright-Patterson Air Force Base Ohio: Wright Laboratory.
(available from the Defense Technical Information Center,
Cameron Station, Alexandria, VA 22304-6145).
- Residual advantage learning applied to a game with nonlinear dynamics and a nonlinear
function approximator
Harmon, M. E., and Baird, L. C. (1995).
Residual Advantage Learning Applied to a Differential Game.
Proceedings of the International Conference on Neural Networks
(ICNN'96), Washington D.C., 3-6 June.
- Residual advantage learning applied to a game with linear dynamics and a linear
function approximator
Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995).
Reinforcement Learning Applied to a Differential Game.
Adaptive Behavior, MIT Press, (4)1, pp. 3-28.
- Residual advantage learning applied to a game with linear dynamics and a linear
function approximator
Harmon, M. E., Baird, L. C., and Klopf, A. H. (1994).
Advantage Updating Applied to a Differential Game.
Gerald Tesauro, et. al., eds.
Advances in Neural Information Processing Systems 7.
pp. 353-360. MIT Press, 1995.
Back to Glossary Index