Residual Algorithms

Reinforcement learning algorithms such as Q-learning, advantage learning, and value iteration all try to find functions that satisfy the Bellman equation. For example, in Q-learning, the goal is to find a function Q(x,u) that satisfies the Bellman equation:
     Q(x,u) = E[ R + gamma * max Q(x',u') ]
where performing action u in state x leads to a transition to state x' and a reinforcement R. The maximum is taken over all actions u' that could be performed in state x'. The expected value is taken over all possible (x',R) outcomes from performing action u in state x in a stochastic MDP.

At any given time during learning, the Q function that has been learned so far may not exactly satisfy the Bellman equation. The Bellman residual is the difference between the two sides of the equation. This is slightly different from the TD error. The TD error is the difference between the two sides of the equation without the expected value, evalutated after just a single transition for (x,u). So the Bellman residual is the expected value of the TD error. The TD error is the Bellman residual plus zero-mean noise.

One way to learn the Q function is to do gradient descent on e, which is the mean squared Bellman residual. That is, square the Bellman residual for each state-action pair and take the mean of them all. The mean is weighted by how often each state-action pair is visited. Assume that there is a fixed, stochastic training policy so e is well defined. Then there are three ways to learn Q:

The direct method is the standard method proposed by Watkins for learning the Q(x,u) function and used in TD-Gammon for learning the V(x) function. There are simple cases where the direct method causes all the weights and values to blow up. This even happens for linear function approximators that are general enough to represent all possible value functions. It even happens when doing on policy training with a function approximator and Q-learning.

On the other hand, the residual gradient method is guaranteed to converge in the same sense that backprop converges. Of course, this means that residual methods also converge when phi=1. In fact, convergence can even be guaranteed for phi values less than 1, depending on the problem. There is even an automatic way to choose appropriate phi values. Simulations so far have indicated that the direct method tends to learn fast when it converges, but can fail to converge. The residual gradient method always converges but is sometimes slow. Residual algorithms with an appropriate choice for phi have both guaranteed convergence, and have been fast in simulation. The algorithm for automatically determining phi has consistently found the fastest phi that is still stable.

There are certain theoretical cases where residual algorithms require a model of the MDP. But for most real-world problems, a model is not needed. If the MDP is so large that no single state-action pair is likely to be seen twice during training, then there is no need for a model. This result is not mentioned in the papers below, but is significant for the usefulness of residual algorithms.

More Information

Back to Glossary Index