Williams [131, 132] studied the problem of choosing actions to maximize immedate reward. He identified a broad class of update rules that perform gradient descent on the expected reward and showed how to integrate these rules with backpropagation. This class, called REINFORCE algorithms, includes linear reward-inaction (Section 2.1.3) as a special case.
The generic REINFORCE update for a parameter can be
written
where
is a non-negative factor, r the current reinforcement,
a
reinforcement baseline, and
is the probability density function
used to randomly generate actions based on unit activations. Both
and
can take on different values for each
, however, when
is constant throughout the
system, the expected update is exactly in the direction of the
expected reward gradient. Otherwise, the update is in the same half
space as the gradient but not necessarily in the direction of steepest
increase.
Williams points out that the choice of baseline, , can have a
profound effect on the convergence speed of the algorithm.