next up previous
Next: Logic-Based Methods Up: Immediate Reward Previous: ARC

REINFORCE Algorithms

Williams [131, 132] studied the problem of choosing actions to maximize immedate reward. He identified a broad class of update rules that perform gradient descent on the expected reward and showed how to integrate these rules with backpropagation. This class, called REINFORCE algorithms, includes linear reward-inaction (Section 2.1.3) as a special case.

The generic REINFORCE update for a parameter tex2html_wrap_inline2318 can be written


where tex2html_wrap_inline2320 is a non-negative factor, r the current reinforcement, tex2html_wrap_inline2324 a reinforcement baseline, and tex2html_wrap_inline2326 is the probability density function used to randomly generate actions based on unit activations. Both tex2html_wrap_inline2320 and tex2html_wrap_inline2324 can take on different values for each tex2html_wrap_inline2318 , however, when tex2html_wrap_inline2320 is constant throughout the system, the expected update is exactly in the direction of the expected reward gradient. Otherwise, the update is in the same half space as the gradient but not necessarily in the direction of steepest increase.

Williams points out that the choice of baseline, tex2html_wrap_inline2324 , can have a profound effect on the convergence speed of the algorithm.

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996