The work of the two components of AHC can be accomplished in a unified
manner by Watkins' Q-learning algorithm [128, 129].
Q-learning is typically easier to implement. In order to understand
Q-learning, we have to develop some additional notation. Let
be the expected discounted reinforcement of taking action
*a* in state *s*, then continuing by choosing actions optimally. Note
that is the value of *s* assuming the best action is taken
initially, and so . can hence be
written recursively as

Note also that, since , we have as an optimal policy.

Because the *Q* function makes the action explicit, we can estimate
the *Q* values on-line using a method essentially the same as *TD*(0),
but also use them to define the policy, because an action can be
chosen just by taking the one with the maximum *Q* value for the
current state.

The Q-learning rule is

where is an experience tuple as described
earlier. If each action is executed in each state an infinite number
of times on an infinite run and is decayed appropriately, the
*Q* values will converge with probability 1 to
[128, 125, 49]. Q-learning can also
be extended to update states that occurred more than one step
previously, as in [88].

When the *Q* values are nearly converged to their optimal values, it is
appropriate for the agent to act greedily, taking, in each situation,
the action with the highest *Q* value. During learning, however,
there is a difficult exploitation versus exploration trade-off to be
made. There are no good, formally justified approaches to this
problem in the general case; standard practice is to adopt one of the
*ad hoc* methods discussed in section 2.2.

AHC architectures seem to be more difficult to work with than
Q-learning on a practical level. It can be hard to get the relative
learning rates right in AHC so that the two components converge
together. In addition, Q-learning is *exploration insensitive*:
that is, that the Q values will converge to the optimal values,
independent of how the agent behaves while the data is being collected
(as long as all state-action pairs are tried often enough). This
means that, although the exploration-exploitation issue must be
addressed in Q-learning, the details of the exploration strategy will
not affect the convergence of the learning algorithm. For these
reasons, Q-learning is the most popular and seems to be the most
effective model-free algorithm for learning from delayed
reinforcement. It does not, however, address any of the issues
involved in generalizing over large state and/or action spaces. In
addition, it may converge quite slowly to a good policy.

Wed May 1 13:19:13 EDT 1996