Informative action-dependent features can be used to reduce the free variables in the learning task still further at the action-selection stage if the features themselves discriminate situations in which actions should not be used. For example, if whenever , is not likely to achieve its expected reward, then the agent can decide to ignore actions with .
Formally, consider and with . When in state s, the agent then chooses an action from , either randomly when exploring or according to maximum Q-value when exploiting. Any exploration strategy, such as Boltzman exploration, can be used over the possible actions in . In effect, W acts in TPOT-RL as an action filter which reduces the number of options under consideration at any given time. Of course, exploration at the filter level can be achieved by dynamically adjusting W.
Table 2: The resulting Q-tables when (a) , and (b) .
For example, Table 2, illustrates the effect of varying |W|. In the rare event that , i.e. , either a random action can be chosen, or rough Q-value estimates can be stored using sparse training data. This condition becomes rarer as |A| increases. For example, with |U| = 3, |W| = 1, |A| = 2 as in Table 2(b), 4/9 = 44.4% of feature vectors have no action that passes the W filter. However, with |A| = 8 only 256/6561 = 3.9% of feature vectors have no action that passes the W filter. If |W| = 2 and |A| = 8, only 1 of 6561 feature vectors fails to pass the filter. Thus using W to filter action selection can reduce the number of free variables in the learning problem without significantly reducing the coverage of the learned Q-table.
By using action-dependent features to create a coarse feature space, and with the help of a reward function based entirely on individual observation of the environment, TPOT-RL enables team learning in a multi-agent, adversarial environment even when agents cannot track state transitions.