next up previous
Next: TPOT-RL Applied to a Up: Team-PartitionedOpaque-Transition RL Previous: Value Function Learning

Action Selection

Informative action-dependent features can be used to reduce the free variables in the learning task still further at the action-selection stage if the features themselves discriminate situations in which actions should not be used. For example, if whenever tex2html_wrap_inline2183 , tex2html_wrap_inline2185 is not likely to achieve its expected reward, then the agent can decide to ignore actions with tex2html_wrap_inline2187 .

Formally, consider tex2html_wrap_inline2189 and tex2html_wrap_inline2191 with tex2html_wrap_inline2193 . When in state s, the agent then chooses an action from tex2html_wrap_inline2197 , either randomly when exploring or according to maximum Q-value when exploiting. Any exploration strategy, such as Boltzman exploration, can be used over the possible actions in tex2html_wrap_inline2199 . In effect, W acts in TPOT-RL as an action filter which reduces the number of options under consideration at any given time. Of course, exploration at the filter level can be achieved by dynamically adjusting W.

   table1416
Table 2: The resulting Q-tables when (a) tex2html_wrap_inline2329 , and (b) tex2html_wrap_inline2331 .

For example, Table 2, illustrates the effect of varying |W|. In the rare event that tex2html_wrap_inline2335 , i.e. tex2html_wrap_inline2337 , either a random action can be chosen, or rough Q-value estimates can be stored using sparse training data. This condition becomes rarer as |A| increases. For example, with |U| = 3, |W| = 1, |A| = 2 as in Table 2(b), 4/9 = 44.4% of feature vectors have no action that passes the W filter. However, with |A| = 8 only 256/6561 = 3.9% of feature vectors have no action that passes the W filter. If |W| = 2 and |A| = 8, only 1 of 6561 feature vectors fails to pass the filter. Thus using W to filter action selection can reduce the number of free variables in the learning problem without significantly reducing the coverage of the learned Q-table.

By using action-dependent features to create a coarse feature space, and with the help of a reward function based entirely on individual observation of the environment, TPOT-RL enables team learning in a multi-agent, adversarial environment even when agents cannot track state transitions.



next up previous
Next: TPOT-RL Applied to a Up: Team-PartitionedOpaque-Transition RL Previous: Value Function Learning



Peter Stone
Fri Feb 27 18:45:43 EST 1998