Another method to allow reinforcement-learning techniques to be applied in large state spaces is modeled on value iteration and Q-learning. Here, a function approximator is used to represent the value function by mapping a state description to a value.
Many reseachers have experimented with this approach: Boyan and Moore  used local memory-based methods in conjunction with value iteration; Lin  used backpropagation networks for Q-learning; Watkins  used CMAC for Q-learning; Tesauro [118, 120] used backpropagation for learning the value function in backgammon (described in Section 8.1); Zhang and Dietterich  used backpropagation and to learn good strategies for job-shop scheduling.
Although there have been some positive examples, in general there are unfortunate interactions between function approximation and the learning rules. In discrete environments there is a guarantee that any operation that updates the value function (according to the Bellman equations) can only reduce the error between the current value function and the optimal value function. This guarantee no longer holds when generalization is used. These issues are discussed by Boyan and Moore , who give some simple examples of value function errors growing arbitrarily large when generalization is used with value iteration. Their solution to this, applicable only to certain classes of problems, discourages such divergence by only permitting updates whose estimated values can be shown to be near-optimal via a battery of Monte-Carlo experiments.
Thrun and Schwartz  theorize that function approximation of value functions is also dangerous because the errors in value functions due to generalization can become compounded by the ``max'' operator in the definition of the value function.
Several recent results [42, 126] show how the appropriate choice of function approximator can guarantee convergence, though not necessarily to the optimal values. Baird's residual gradient technique  provides guaranteed convergence to locally optimal solutions.
Perhaps the gloominess of these counter-examples is misplaced. Boyan and Moore  report that their counter-examples can be made to work with problem-specific hand-tuning despite the unreliability of untuned algorithms that provably converge in discrete domains. Sutton  shows how modified versions of Boyan and Moore's examples can converge successfully. An open question is whether general principles, ideally supported by theory, can help us understand when value function approximation will succeed. In Sutton's comparative experiments with Boyan and Moore's counter-examples, he changes four aspects of the experiments: