From newshub.ccs.yorku.ca!torn!cs.utexas.edu!usc!wupost!uunet!trwacs!erwin Mon Aug 24 15:40:54 EDT 1992
Article 6625 of comp.ai.philosophy:
Xref: newshub.ccs.yorku.ca comp.ai.philosophy:6625 sci.cognitive:333
Path: newshub.ccs.yorku.ca!torn!cs.utexas.edu!usc!wupost!uunet!trwacs!erwin
>From: erwin@trwacs.fp.trw.com (Harry Erwin)
Newsgroups: comp.ai.philosophy,sci.cognitive
Subject: Heuristic Dynamic Programming
Keywords: adaptive critic
Message-ID: <696@trwacs.fp.trw.com>
Date: 16 Aug 92 19:35:03 GMT
Followup-To: sci.cognitive
Organization: TRW Systems Division, Fairfax VA
Lines: 60


Heuristic Dynamic Programming in a Realistic Biological Context
Harry R. Erwin
erwin@trwacs.fp.trw.com
 
As I showed at the 1982 Animal Behavior Workshop in Guelph, Ontario,
the optimum strategy for playing a discrete game against nature involving
information collection is a simple threshold strategy. The player uses 
Bayesian statistics to maintain an estimate of his probability of success, 
and compares that estimate against a threshold at each decision point. 
If the probability of success remains above the threshold, he continues 
the game; otherwise, he quits. The threshold can be calculated by treating 
the game as a problem in dynamic programming. (John Bather, Pers. Com.,
1983)
 
In a biological context, this strategy lends itself to implemention using 
HDP. The critic network would provide the current threshold value as a local 
goal value, and the action network would compare the current probability 
against that value. If the current probability exceeded the threshold, the 
preferred action would be to continue to collect information; otherwise it 
would be to quit. Note that the critic network responds to the perceived 
payoffs and risks of the game and not to the current situation. Both critic
and action networks would be prior to the motor cortex, which would then
treat both as a combined critic network and attempt to reduce fear to 
nominal levels.
 
Current payoffs---\
                   O-- local goal value--------------\
Target category---/ (A)               feedback        \
                                         ---           \
Target condition--\                     |   |           \
                   \                    V   |            \ (D)
Self condition----->0--initial estimate --->0-current est>0
                   / (B)                   / (C)           \
Environment-------/                       /                 \
                                         /                decision
Information collected and processed-----/         (expressed as fear level)
                                                               \ (E)
Motor options-------------------------------------------------->0->motor
                                                                   cortex
 
Note that there are a number of places where training would occur. Subsystem A
needs to learn how to calculate the local goal values corresponding to various
payoffs and intensities of the game (primarily defined by target category).
I suspect most species have this hard-coded in the genome. (The local goal
values are not obvious functions of the inputs!) Subsystem B can be trained
more easily--in mammals, that is part of the role of play and parental 
teaching. Subsystems C and D are probably hard-coded, even in man. Subsystem C 
implements logistic functions, while Subsystem D does a simple comparison.
Subsystem E probably uses fear level to affect the preference functions for
various actions used by the motor controller, although it may select a desired
fear level and output partials to the motor controller instead. (I suspect that
version is more correct, because the corresponding 2-person game can't be
handled by outputting simple fear level, and man does play the 2-person game.)
 
Cheers,
-- 
Harry Erwin
Internet: erwin@trwacs.fp.trw.com



