next up previous
Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

Dyna

Sutton's Dyna architecture [116, 117] exploits a middle ground, yielding strategies that are both more effective than model-free learning and more computationally efficient than the certainty-equivalence approach. It simultaneously uses experience to build a model ( tex2html_wrap_inline2178 and tex2html_wrap_inline2180 ), uses experience to adjust the policy, and uses the model to adjust the policy.

Dyna operates in a loop of interaction with the environment. Given an experience tuple tex2html_wrap_inline2182 , it behaves as follows:

The Dyna algorithm requires about k times the computation of Q-learning per instance, but this is typically vastly less than for the naive model-based method. A reasonable value of k can be determined based on the relative speeds of computation and of taking action.

Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E, W) and transitions are made deterministically to an adjacent cell, unless there is a block, in which case no movement occurs. As we will see in Table 1, Dyna requires an order of magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy. Dyna requires about six times more computational effort, however.

   figure365
Figure 6: A 3277-state grid world. This was formulated as a shortest-path reinforcement-learning problem, which yields the same result as if a reward of 1 is given at the goal, a reward of zero elsewhere and a discount factor is used.

 

Steps before Backups before
convergence convergence
Q-learning 531,000 531,000
Dyna 62,000 3,055,000
prioritized sweeping 28,000 1,010,000
Table 1: The performance of three algorithms described in the text. All methods used the exploration heuristic of ``optimism in the face of uncertainty'': any state not previously visited was assumed by default to be a goal state. Q-learning used its optimal learning rate parameter for a deterministic maze: tex2html_wrap_inline1718 . Dyna and prioritized sweeping were permitted to take k = 200 backups per transition. For prioritized sweeping, the priority queue often emptied before all backups were used.

 


next up previous
Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996