When the agent's actions do not influence state transitions, the resulting problem becomes one of choosing actions to maximize immediate reward as a function of the agent's current state. These problems bear a resemblance to the bandit problems discussed in Section 2 except that the agent should condition its action selection on the current state. For this reason, this class of problems has been described as associative reinforcement learning.
The algorithms in this section address the problem of learning from immediate boolean reinforcement where the state is vector valued and the action is a boolean vector. Such algorithms can and have been used in the context of a delayed reinforcement, for instance, as the RL component in the AHC architecture described in Section 4.1. They can also be generalized to real-valued reward through reward comparison methods .