From newshub.ccs.yorku.ca!torn!utcsri!rutgers!sun-barr!olivea!uunet!trwacs!erwin Tue Jul 28 09:41:48 EDT 1992
Article 6494 of comp.ai.philosophy:
Xref: newshub.ccs.yorku.ca comp.ai.philosophy:6494 comp.ai.neural-nets:3670 sci.cognitive:228
Path: newshub.ccs.yorku.ca!torn!utcsri!rutgers!sun-barr!olivea!uunet!trwacs!erwin
>From: erwin@trwacs.fp.trw.com (Harry Erwin)
Newsgroups: comp.ai.philosophy,comp.ai.neural-nets,sci.cognitive
Subject: Heuristic Dynamic Programming
Keywords: HDP, dynamic programming, reinforcement learning, neurocontrol, adaptive critics
Message-ID: <671@trwacs.fp.trw.com>
Date: 23 Jul 92 12:01:50 GMT
Followup-To: sci.cognitive
Organization: TRW Systems Division, Fairfax VA
Lines: 53

In "Consistency of HDP applied to a simple reinforcement learning problem"
(Neural Networks, Vol. 3, pp 179-189, 1990), Paul Werbos proposes an
approach to the training of neural networks that avoids the problems
associated with determining immediate actions to gain a long-term goal.
Werbos' approach is an adaptive critic model with the critic network
implementing a dynamic programming algorithm that is used to train the
action network. This note reviews HDP in a biologically relevant context.

At the 1982 Animal Behavior Workshop at the University of Guelph, Ontario,
I showed that the optimal strategy for the generic information collection
task was particularly simple. Using renormalization techniques to
replicate a result in dynamic programming, I showed that the underlying
one-sided game against nature could be played using three behavioral
components:
1. A rule for establishing a probability threshold for determining if the
task should be continued. (This threshold is constant if the risk, gain,
and information collection rates are constant.)
2. A rule for determining the initial estimate of the probability of
success. 
3. A rule for updating the current estimate of the probability of success.
(Typically a bayesian rule.)
The game is then played by continuously or periodically checking to see if
the current estimate of the probability of success has fallen below the
threshold. If that has occurred, the animal should abandon the game,
otherwise continue.

An HDP implementation of this strategy would assign the responsibility for
establishing the probability threshold to the critic network. The action
network would be assigned the responsibility for determining the current
probability of success. Note that the critic network only considers cost
of losing, gain of winning, probability of winning and losing per
information collection period, and rate of information collection. Thus,
once trained, it can control any action network. The action networks
take into account the specifics of the situation, although a third set of
networks are needed to set the parameters for the critic network.

The resulting structure consists of a set of initialization networks that
respond to the specific situation and pass initial parameters to the
critic network and the selected action network. The selected action
network then is responsible for assessing the current situation against a
constant parameter provided by the critic network. 

Training is not really needed for the critic network--it can be
hard-wired. How the other networks are trained is not really clear at this
point. It seems clear that some supervisory mechanism is needed to assess
the outcome of the engagement on a global basis, but that mechanism can
also be hard-wired, avoiding the problem of the infinite regress of
training.


-- 
Harry Erwin
Internet: erwin@trwacs.fp.trw.com


