Delayed Reward

Next: Adaptive Resolution Models Up: Generalization over Input Previous: Logic-Based Methods

Delayed Reward

Another method to allow reinforcement-learning techniques to be applied in large state spaces is modeled on value iteration and Q-learning. Here, a function approximator is used to represent the value function by mapping a state description to a value.

Many reseachers have experimented with this approach: Boyan and Moore [18] used local memory-based methods in conjunction with value iteration; Lin [59] used backpropagation networks for Q-learning; Watkins [128] used CMAC for Q-learning; Tesauro [118, 120] used backpropagation for learning the value function in backgammon (described in Section 8.1); Zhang and Dietterich [136] used backpropagation and to learn good strategies for job-shop scheduling.

Although there have been some positive examples, in general there are unfortunate interactions between function approximation and the learning rules. In discrete environments there is a guarantee that any operation that updates the value function (according to the Bellman equations) can only reduce the error between the current value function and the optimal value function. This guarantee no longer holds when generalization is used. These issues are discussed by Boyan and Moore [18], who give some simple examples of value function errors growing arbitrarily large when generalization is used with value iteration. Their solution to this, applicable only to certain classes of problems, discourages such divergence by only permitting updates whose estimated values can be shown to be near-optimal via a battery of Monte-Carlo experiments.

Thrun and Schwartz [123] theorize that function approximation of value functions is also dangerous because the errors in value functions due to generalization can become compounded by the ``max'' operator in the definition of the value function.

Several recent results [42, 126] show how the appropriate choice of function approximator can guarantee convergence, though not necessarily to the optimal values. Baird's residual gradient technique [6] provides guaranteed convergence to locally optimal solutions.

Perhaps the gloominess of these counter-examples is misplaced. Boyan and Moore [18] report that their counter-examples can be made to work with problem-specific hand-tuning despite the unreliability of untuned algorithms that provably converge in discrete domains. Sutton [113] shows how modified versions of Boyan and Moore's examples can converge successfully. An open question is whether general principles, ideally supported by theory, can help us understand when value function approximation will succeed. In Sutton's comparative experiments with Boyan and Moore's counter-examples, he changes four aspects of the experiments:

Small changes to the task specifications.
A very different kind of function approximator (CMAC [2]) that has weak generalization.
A different learning algorithm: SARSA [95] instead of value iteration.
A different training regime. Boyan and Moore sampled states uniformly in state space, whereas Sutton's method sampled along empirical trajectories.

There are intuitive reasons to believe that the fourth factor is particularly important, but more careful research is needed.

Adaptive Resolution Models

Next: Adaptive Resolution Models Up: Generalization over Input Previous: Logic-Based Methods

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996