Sutton's Dyna architecture [116, 117] exploits a middle ground, yielding strategies that are both more effective than model-free learning and more computationally efficient than the certainty-equivalence approach. It simultaneously uses experience to build a model ( and ), uses experience to adjust the policy, and uses the model to adjust the policy.

Dyna operates in a loop of interaction with the environment. Given an experience tuple , it behaves as follows:

- Update the model, incrementing statistics for the transition from
*s*to*s*' on action*a*and for receiving reward*r*for taking action*a*in state*s*. The updated models are and . - Update the policy at state
*s*based on the newly updated model using the rulewhich is a version of the value-iteration update for

*Q*values. - Perform
*k*additional updates: choose*k*state-action pairs at random and update them according to the same rule as before: - Choose an action
*a*' to perform in state*s*', based on the*Q*values but perhaps modified by an exploration strategy.

The Dyna algorithm requires about *k* times the computation of
Q-learning per instance, but this is typically vastly less than for
the naive model-based method. A reasonable value of *k* can be
determined based on the relative speeds of computation and of taking
action.

Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E, W) and transitions are made deterministically to an adjacent cell, unless there is a block, in which case no movement occurs. As we will see in Table 1, Dyna requires an order of magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy. Dyna requires about six times more computational effort, however.

**Figure 6:** A 3277-state grid world. This was formulated as a
shortest-path reinforcement-learning problem, which yields the same
result as if a reward of 1 is given at the goal, a reward of zero
elsewhere and a discount factor is used.

Steps before | Backups before | |

convergence | convergence | |

Q-learning | 531,000 | 531,000 |

Dyna | 62,000 | 3,055,000 |

prioritized sweeping | 28,000 | 1,010,000 |

Wed May 1 13:19:13 EDT 1996