Advantage Learning

Advantage learning is a form of reinforcement learning similar to Q-learning except that it uses advantages rather than Q-values. For a state x and action u, the advantage for that state-action pair A(x,u) is related to the Q value Q(x,u) as:
    A(x,u)=max(Q(x,u')) + (Q(x,u) - max(Q(x,u'))) * k/dt
where the max is taken over all choices of action u'. Notice that the maximum advantage in a given state will equal the maximum Q-value in that state, which will be the value of that state. If k/dt=1 , then all of the advantages are identical to the Q values. So Q-learning is a special case of advantage learning. If k is a constant and dt is the size of a time step, then advantage learning differs from Q-learning for small time steps in that the differences between advantages in a given state are larger than the differences between Q values.

Advantage updating is an older algorithm than advantage learning. In advantage updating, the definition of A(x,u) was slightly different, and it required storing a value function V(x) in addition to the advantage function. Advantage learning is a more recent algorithm that supercedes advantage updating, and requires only that the A(x,u) advantages be stored. The two algorithms have essentially identical behavior, but the later algorithm requires less information to be stored, and is a simpler algorithm, so it is generally recommended.

Advantage learning and Q-learning learn equally quickly when used with a lookup table. Advantage learning can learn many orders of magnitude faster than Q-learning in some cases where a function approximator is used, even a linear function approximator. Specifically, if time steps are "small" in the sense that the state changes a very small amount on each time step, then advantage learning would be expected to learn much faster than Q-learning. Or, for a semi Markov Decision Problems (SMDP), if even one action consistently causes small state changes, that also counts as "small" time steps. In that case, the dt in the equation would be different for each action.

More Information

Back to Glossary Index