Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

## Dyna

Sutton's Dyna architecture [116, 117] exploits a middle ground, yielding strategies that are both more effective than model-free learning and more computationally efficient than the certainty-equivalence approach. It simultaneously uses experience to build a model ( and ), uses experience to adjust the policy, and uses the model to adjust the policy.

Dyna operates in a loop of interaction with the environment. Given an experience tuple , it behaves as follows:

• Update the model, incrementing statistics for the transition from s to s' on action a and for receiving reward r for taking action a in state s. The updated models are and .
• Update the policy at state s based on the newly updated model using the rule

which is a version of the value-iteration update for Q values.

• Perform k additional updates: choose k state-action pairs at random and update them according to the same rule as before:

• Choose an action a' to perform in state s', based on the Q values but perhaps modified by an exploration strategy.

The Dyna algorithm requires about k times the computation of Q-learning per instance, but this is typically vastly less than for the naive model-based method. A reasonable value of k can be determined based on the relative speeds of computation and of taking action.

Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E, W) and transitions are made deterministically to an adjacent cell, unless there is a block, in which case no movement occurs. As we will see in Table 1, Dyna requires an order of magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy. Dyna requires about six times more computational effort, however.

Figure 6: A 3277-state grid world. This was formulated as a shortest-path reinforcement-learning problem, which yields the same result as if a reward of 1 is given at the goal, a reward of zero elsewhere and a discount factor is used.

 Steps before Backups before convergence convergence Q-learning 531,000 531,000 Dyna 62,000 3,055,000 prioritized sweeping 28,000 1,010,000

Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996