The networks described in Section 6.1.1 generalize over state descriptions presented as inputs. They also produce outputs in a discrete, factored representation and thus could be seen as generalizing over actions as well.
In cases such as this when actions are described combinatorially, it is important to generalize over actions to avoid keeping separate statistics for the huge number of actions that can be chosen. In continuous action spaces, the need for generalization is even more pronounced.
When estimating Q values using a neural network, it is possible to use either a distinct network for each action, or a network with a distinct output for each action. When the action space is continuous, neither approach is possible. An alternative strategy is to use a single network with both the state and action as input and Q value as the output. Training such a network is not conceptually difficult, but using the network to find the optimal action can be a challenge. One method is to do a local gradient-ascent search on the action in order to find one with high value .
Gullapalli [43, 44] has developed a ``neural'' reinforcement-learning unit for use in continuous action spaces. The unit generates actions with a normal distribution; it adjusts the mean and variance based on previous experience. When the chosen actions are not performing well, the variance is high, resulting in exploration of the range of choices. When an action performs well, the mean is moved in that direction and the variance decreased, resulting in a tendency to generate more action values near the successful one. This method was successfully employed to learn to control a robot arm with many continuous degrees of freedom.