Q(x,u) = E[ R + gamma * max Q(x',u') ]where performing action

At any given time during learning, the Q function that has been learned
so far may not exactly satisfy the Bellman equation. The *Bellman
residual* is the difference between the two sides of the equation.
This is slightly different from the *TD error*. The TD error
is the difference between the two sides of the equation without the
expected value, evalutated after just a single transition for
*(x,u)*. So the Bellman residual is the expected value of
the TD error. The TD error is the Bellman residual plus zero-mean
noise.

One way to learn the Q function is to do gradient descent on *e*,
which is the mean squared Bellman residual. That is, square the Bellman
residual for each state-action pair and take the mean of them all. The
mean is weighted by how often each state-action pair is visited. Assume
that there is a fixed, stochastic training policy so *e*
is well defined. Then there are three ways to learn Q:

**The direct method**

Do gradient descent on*e*, but when calculating the derivatives, pretend that the weights only effect Q(x,u) but not Q(x',u').**The residual gradient method**

Do gradient descent on*e*.**The residual method**

Make the weight change recommended by the direct method times (1-phi), plus the weight change recommended by the residual gradient method, times phi. Phi is a constant between 0 and 1.

On the other hand, the residual gradient method is guaranteed to converge in the same sense that backprop converges. Of course, this means that residual methods also converge when phi=1. In fact, convergence can even be guaranteed for phi values less than 1, depending on the problem. There is even an automatic way to choose appropriate phi values. Simulations so far have indicated that the direct method tends to learn fast when it converges, but can fail to converge. The residual gradient method always converges but is sometimes slow. Residual algorithms with an appropriate choice for phi have both guaranteed convergence, and have been fast in simulation. The algorithm for automatically determining phi has consistently found the fastest phi that is still stable.

There are certain theoretical cases where residual algorithms require a model of the MDP. But for most real-world problems, a model is not needed. If the MDP is so large that no single state-action pair is likely to be seen twice during training, then there is no need for a model. This result is not mentioned in the papers below, but is significant for the usefulness of residual algorithms.

- Residual algorithms are defined and simple simulation results are given.

Baird, L. C. (1995). Reinforcement Learning with Function Approximation. In Armand Prieditis & Stuart Russell, eds. Machine Learning: Proceedings of the Twelfth International Conference, 9-12 July, Morgan Kaufman Publishers, San Francisco, CA. - Residual algorithms are defined and simple simulation results are given.

Baird, L. C. (1995). Residual Algorithms. Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference 1995. Justin A. Boyan, Andrew W. Moore, Richard S. Sutton, editors. Technical report CMU-CS-95-206, July 9, 1995. - An application of residual Q-learning.

Bandera, C., Vico, F. J., Bravo, J. M., Harmon, M. E., and Baird, L. C. (1996). Residual Q-learning applied to visual attention. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 3-6 July. - Residual advantage learning applied to a game with nonlinear dynamics and a nonlinear
function approximator.

Harmon, M. E., and Baird, L. C. (1996). Multi-player residual advantage learning with general function approximation. (Techical Report WL-TR-96-1065). Wright-Patterson Air Force Base Ohio: Wright Laboratory. (available from the Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-6145). - Residual advantage learning applied to a game with nonlinear dynamics and a nonlinear
function approximator

Harmon, M. E., and Baird, L. C. (1995). Residual Advantage Learning Applied to a Differential Game. Proceedings of the International Conference on Neural Networks (ICNN'96), Washington D.C., 3-6 June. - Residual advantage learning applied to a game with linear dynamics and a linear
function approximator

Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995). Reinforcement Learning Applied to a Differential Game. Adaptive Behavior, MIT Press, (4)1, pp. 3-28. - Residual advantage learning applied to a game with linear dynamics and a linear
function approximator

Harmon, M. E., Baird, L. C., and Klopf, A. H. (1994). Advantage Updating Applied to a Differential Game. Gerald Tesauro, et. al., eds. Advances in Neural Information Processing Systems 7. pp. 353-360. MIT Press, 1995.