next up previous
Next: REINFORCE Algorithms Up: Immediate Reward Previous: CRBP

ARC

The associative reinforcement comparison (ARC) algorithm [114] is an instance of the AHC architecture for the case of boolean actions, consisting of two feed-forward networks. One learns the value of situations, the other learns a policy. These can be simple linear networks or can have hidden units.

In the simplest case, the entire system learns only to optimize immediate reward. First, let us consider the behavior of the network that learns the policy, a mapping from a vector describing s to a 0 or 1. If the output unit has activation tex2html_wrap_inline2270 , then a, the action generated, will be 1 if tex2html_wrap_inline2298 , where tex2html_wrap_inline2300 is normal noise, and 0 otherwise.

The adjustment for the output unit is, in the simplest case,

displaymath2288

where the first factor is the reward received for taking the most recent action and the second encodes which action was taken. The actions are encoded as 0 and 1, so a - 1/2 always has the same magnitude; if the reward and the action have the same sign, then action 1 will be made more likely, otherwise action 0 will be.

As described, the network will tend to seek actions that given positive reward. To extend this approach to maximize reward, we can compare the reward to some baseline, b. This changes the adjustment to

displaymath2289

where b is the output of the second network. The second network is trained in a standard supervised mode to estimate r as a function of the input state s.

Variations of this approach have been used in a variety of applications [4, 9, 61, 114].



Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996