Another method to allow reinforcement-learning techniques to be applied in large state spaces is modeled on value iteration and Q-learning. Here, a function approximator is used to represent the value function by mapping a state description to a value.

Many reseachers have experimented with this approach: Boyan and Moore [18] used local memory-based methods in conjunction with value iteration; Lin [59] used backpropagation networks for Q-learning; Watkins [128] used CMAC for Q-learning; Tesauro [118, 120] used backpropagation for learning the value function in backgammon (described in Section 8.1); Zhang and Dietterich [136] used backpropagation and to learn good strategies for job-shop scheduling.

Although there have been some positive examples, in general there are unfortunate interactions between function approximation and the learning rules. In discrete environments there is a guarantee that any operation that updates the value function (according to the Bellman equations) can only reduce the error between the current value function and the optimal value function. This guarantee no longer holds when generalization is used. These issues are discussed by Boyan and Moore [18], who give some simple examples of value function errors growing arbitrarily large when generalization is used with value iteration. Their solution to this, applicable only to certain classes of problems, discourages such divergence by only permitting updates whose estimated values can be shown to be near-optimal via a battery of Monte-Carlo experiments.

Thrun and Schwartz [123] theorize that function approximation of value functions is also dangerous because the errors in value functions due to generalization can become compounded by the ``max'' operator in the definition of the value function.

Several recent results [42, 126] show how the
appropriate choice of function approximator can guarantee convergence,
though not necessarily to the optimal values. Baird's *residual
gradient* technique [6] provides guaranteed convergence to
locally optimal solutions.

Perhaps the gloominess of these counter-examples is misplaced.
Boyan and Moore [18] report that their counter-examples
*can* be made to work with problem-specific hand-tuning despite the
unreliability of untuned algorithms that provably converge in discrete
domains. Sutton [113] shows how modified versions of Boyan and
Moore's examples can converge successfully. An open question is whether
general principles, ideally supported by theory, can help us understand
when value function approximation will succeed. In Sutton's comparative
experiments with Boyan and Moore's counter-examples, he changes four
aspects of the experiments:

- Small changes to the task specifications.
- A very different kind of function approximator (CMAC [2]) that has weak generalization.
- A different learning algorithm: SARSA [95] instead of value iteration.
- A different training regime. Boyan and Moore sampled states uniformly in state space, whereas Sutton's method sampled along empirical trajectories.

Wed May 1 13:19:13 EDT 1996