This implementation of TD(
) is trajectory-based. For a version of
TD(
) that performs updates after each move, refer to [Sutton1987].
|
TD( |
| /* Assumes known world model MDP; F is parametrized by weight vector w. */ |
| repeat steps 1 and 2 forever: |
| Using the model and the current evaluation function F, generate a mostly-greedy |
|
trajectory from a start state to a terminal state: |
|
Also record the rewards |
| Update the fitter from the trajectory as follows: |
| for i := T downto 0, do: |
|
|
update F's weights by delta rule: |
| end |