next up previous
Next: Model-free Learning With Average Up: Learning an Optimal Policy: Previous: Adaptive Heuristic Critic and



The work of the two components of AHC can be accomplished in a unified manner by Watkins' Q-learning algorithm [128, 129]. Q-learning is typically easier to implement. In order to understand Q-learning, we have to develop some additional notation. Let tex2html_wrap_inline2120 be the expected discounted reinforcement of taking action a in state s, then continuing by choosing actions optimally. Note that tex2html_wrap_inline2126 is the value of s assuming the best action is taken initially, and so tex2html_wrap_inline2130 . tex2html_wrap_inline2120 can hence be written recursively as


Note also that, since tex2html_wrap_inline2130 , we have tex2html_wrap_inline2136 as an optimal policy.

Because the Q function makes the action explicit, we can estimate the Q values on-line using a method essentially the same as TD(0), but also use them to define the policy, because an action can be chosen just by taking the one with the maximum Q value for the current state.

The Q-learning rule is


where tex2html_wrap_inline2048 is an experience tuple as described earlier. If each action is executed in each state an infinite number of times on an infinite run and tex2html_wrap_inline1902 is decayed appropriately, the Q values will converge with probability 1 to tex2html_wrap_inline2152  [128, 125, 49]. Q-learning can also be extended to update states that occurred more than one step previously, as in tex2html_wrap_inline1724  [88].

When the Q values are nearly converged to their optimal values, it is appropriate for the agent to act greedily, taking, in each situation, the action with the highest Q value. During learning, however, there is a difficult exploitation versus exploration trade-off to be made. There are no good, formally justified approaches to this problem in the general case; standard practice is to adopt one of the ad hoc methods discussed in section 2.2.

AHC architectures seem to be more difficult to work with than Q-learning on a practical level. It can be hard to get the relative learning rates right in AHC so that the two components converge together. In addition, Q-learning is exploration insensitive: that is, that the Q values will converge to the optimal values, independent of how the agent behaves while the data is being collected (as long as all state-action pairs are tried often enough). This means that, although the exploration-exploitation issue must be addressed in Q-learning, the details of the exploration strategy will not affect the convergence of the learning algorithm. For these reasons, Q-learning is the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. It does not, however, address any of the issues involved in generalizing over large state and/or action spaces. In addition, it may converge quite slowly to a good policy.

next up previous
Next: Model-free Learning With Average Up: Learning an Optimal Policy: Previous: Adaptive Heuristic Critic and

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996