Although Dyna is a great improvement on previous methods, it suffers from being relatively undirected. It is particularly unhelpful when the goal has just been reached or when the agent is stuck in a dead end; it continues to update random state-action pairs, rather than concentrating on the ``interesting'' parts of the state space. These problems are addressed by prioritized sweeping [83] and Queue-Dyna [87], which are two independently-developed but very similar techniques. We will describe prioritized sweeping in some detail.

The algorithm is similar to Dyna, except that updates are no longer
chosen at random and values are now associated with states (as in
value iteration) instead of state-action pairs (as in Q-learning). To
make appropriate choices, we must store additional information in the
model. Each state remembers its *predecessors*: the states that
have a non-zero transition probability to it under some action. In
addition, each state has a *priority*, initially set to zero.

Instead of updating *k* random state-action pairs, prioritized
sweeping updates *k* states with the highest priority. For each
high-priority state *s*, it works as follows:

- Remember the current value of the state: .
- Update the state's value
- Set the state's priority back to 0.
- Compute the value change .
- Use to modify the priorities of the predecessors of
*s*.

If we have updated the *V* value for state *s*' and it has changed by
amount , then the immediate predecessors of *s*' are informed
of this event. Any state *s* for which there exists an action *a*
such that has its priority promoted to
, unless its priority already exceeded
that value.

The global behavior of this algorithm is that when a real-world transition is ``surprising'' (the agent happens upon a goal state, for instance), then lots of computation is directed to propagate this new information back to relevant predecessor states. When the real-world transition is ``boring'' (the actual result is very similar to the predicted result), then computation continues in the most deserving part of the space.

Running prioritized sweeping on the problem in Figure 6, we see a large improvement over Dyna. The optimal policy is reached in about half the number of steps of experience and one-third the computation as Dyna required (and therefore about 20 times fewer steps and twice the computational effort of Q-learning).

Wed May 1 13:19:13 EDT 1996