In the previous section we reviewed methods for obtaining an optimal policy for an MDP assuming that we already had a model. The model consists of knowledge of the state transition probability function T(s,a,s') and the reinforcement function R(s,a). Reinforcement learning is primarily concerned with how to obtain the optimal policy when such a model is not known in advance. The agent must interact with its environment directly to obtain information which, by means of an appropriate algorithm, can be processed to produce an optimal policy.
At this point, there are two ways to proceed.
This section examines model-free learning, and Section 5 examines model-based methods.
The biggest problem facing a reinforcement-learning agent is temporal credit assignment. How do we know whether the action just taken is a good one, when it might have far-reaching effects? One strategy is to wait until the ``end'' and reward the actions taken if the result was good and punish them if the result was bad. In ongoing tasks, it is difficult to know what the ``end'' is, and this might require a great deal of memory. Instead, we will use insights from value iteration to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state. This class of algorithms is known as temporal difference methods [115]. We will consider two different temporal-difference learning strategies for the discounted infinite-horizon model.