Advantage Learning

Advantage learning is a form of reinforcement learning similar to Q-learning except that it uses advantages rather than Q-values. For a state x and action u, the advantage for that state-action pair A(x,u) is related to the Q value Q(x,u) as:

    A(x,u)=max(Q(x,u')) + (Q(x,u) - max(Q(x,u'))) * k/dt

where the max is taken over all choices of action u'. Notice that the maximum advantage in a given state will equal the maximum Q-value in that state, which will be the value of that state. If k/dt=1 , then all of the advantages are identical to the Q values. So Q-learning is a special case of advantage learning. If k is a constant and dt is the size of a time step, then advantage learning differs from Q-learning for small time steps in that the differences between advantages in a given state are larger than the differences between Q values.

Advantage updating is an older algorithm than advantage learning. In advantage updating, the definition of A(x,u) was slightly different, and it required storing a value function V(x) in addition to the advantage function. Advantage learning is a more recent algorithm that supercedes advantage updating, and requires only that the A(x,u) advantages be stored. The two algorithms have essentially identical behavior, but the later algorithm requires less information to be stored, and is a simpler algorithm, so it is generally recommended.

Advantage learning and Q-learning learn equally quickly when used with a lookup table. Advantage learning can learn many orders of magnitude faster than Q-learning in some cases where a function approximator is used, even a linear function approximator. Specifically, if time steps are "small" in the sense that the state changes a very small amount on each time step, then advantage learning would be expected to learn much faster than Q-learning. Or, for a semi Markov Decision Problems (SMDP), if even one action consistently causes small state changes, that also counts as "small" time steps. In that case, the dt in the equation would be different for each action.

More Information

Advantage learning applied to a game with nonlinear dynamics and a nonlinear function approximator.
Harmon, M. E., and Baird, L. C. (1996). Multi-player residual advantage learning with general function approximation. (Techical Report WL-TR-96-1065). Wright-Patterson Air Force Base Ohio: Wright Laboratory. (available from the Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-6145).
Advantage learning applied to a game with nonlinear dynamics and a nonlinear function approximator
Harmon, M. E., and Baird, L. C. (1995). Residual Advantage Learning Applied to a Differential Game. Proceedings of the International Conference on Neural Networks (ICNN'96), Washington D.C., 3-6 June.
Advantage learning applied to a game with linear dynamics and a linear function approximator
Harmon, M. E., Baird, L. C., and Klopf, A. H. (1995). Reinforcement Learning Applied to a Differential Game. Adaptive Behavior, MIT Press, (4)1, pp. 3-28.
Advantage learning applied to a game with linear dynamics and a linear function approximator
Harmon, M. E., Baird, L. C., and Klopf, A. H. (1994). Advantage Updating Applied to a Differential Game. Gerald Tesauro, et. al., eds. Advances in Neural Information Processing Systems 7. pp. 353-360. MIT Press, 1995.
The old advantage updating algorithm is defined and simulation results are given.
Baird, L. C. (1993). Advantage Updating. (Technical Report WL-TR-93-1146). Wright-Patterson Air Force Base Ohio: Wright Laboratory. (available from the Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-6145).
The old advantage updating algorithm is defined and simulation results are given.
Baird, L. C. (1994). Reinforcement Learning in Continuous Time: Advantage Updating. Proceedings of the International Conference on Neural Networks. Orlando, FL. June.
The old advantage updating algorithm is applied to a 2-player game.
Harmon, M. E., Baird, L. C., and Klopf, A. H. (1994). Advantage Updating Applied to a Differential Game. Gerald Tesauro, et. al., eds. Advances in Neural Information Processing Systems 7. pp. 353-360. MIT Press, 1995.

Back to Glossary Index