Next: Adaptive Heuristic Critic and Up: Reinforcement Learning: A Survey Previous: Computational Complexity

Learning an Optimal Policy: Model-free Methods

In the previous section we reviewed methods for obtaining an optimal policy for an MDP assuming that we already had a model. The model consists of knowledge of the state transition probability function T(s,a,s') and the reinforcement function R(s,a). Reinforcement learning is primarily concerned with how to obtain the optimal policy when such a model is not known in advance. The agent must interact with its environment directly to obtain information which, by means of an appropriate algorithm, can be processed to produce an optimal policy.

At this point, there are two ways to proceed.

Model-free: Learn a controller without learning a model.
Model-based: Learn a model, and use it to derive a controller.

Which approach is better? This is a matter of some debate in the reinforcement-learning community. A number of algorithms have been proposed on both sides. This question also appears in other fields, such as adaptive control, where the dichotomy is between direct and indirect adaptive control.

This section examines model-free learning, and Section 5 examines model-based methods.

The biggest problem facing a reinforcement-learning agent is temporal credit assignment. How do we know whether the action just taken is a good one, when it might have far-reaching effects? One strategy is to wait until the ``end'' and reward the actions taken if the result was good and punish them if the result was bad. In ongoing tasks, it is difficult to know what the ``end'' is, and this might require a great deal of memory. Instead, we will use insights from value iteration to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state. This class of algorithms is known as temporal difference methods [115]. We will consider two different temporal-difference learning strategies for the discounted infinite-horizon model.

Next: Adaptive Heuristic Critic and Up: Reinforcement Learning: A Survey Previous: Computational Complexity

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996