Reinforcement learning typically works by refining its estimate of expected future reward. In goal-directed tasks, such as the ones investigated here, this is equivalent to progressively improving the estimate of the distance to goal from each state. This estimate is updated by the best local action, i.e. the one moving the robot, or arm, to the new state with the smallest estimated distance. Early in the learning process, only states close to the goal are likely to have accurate estimates of true distance. Each time an action is taken, the estimate of distance at the new state is used to update the estimate at the old state. Eventually this process will propagate back accurate estimates from the goal to all other states.
Rather than directly estimating the distance to goal, the system uses the expected discounted reward for each state ( ). The influence of rewards, , are reduced progressively the farther into the future they occur by using a less than one. In this work, the only reward is for reaching the goal. So the farther the state is from the goal the smaller the value. The use of an expectation allows the actions to be stochastic, so when the robot, or arm, takes a particular action in a particular state, the next state is not always the same.
To carry out reinforcement learning, this research uses the Q-learning algorithm (Watkins & Dayan 1992). This algorithm assumes the world is a discrete Markov process, thus both states and actions are discrete. For each action a in each state s, Q-learning maintains a rolling average of the immediate reward r plus the maximum value of any action a' in the next state s' (see Equation 1). The action selected in each state is usually the one with the highest score. But to encourage exploration of the state space, this paper uses an -greedy policy (Sutton 1996) which chooses a random action a fraction of the time. The only effect that function composition has on the Q-learning algorithm is that the initial value for each state-action pair is set to some value other than zero.
The Q-function over state and action is usually referred to as the action-value function. In this paper, it is the action-value function that is transformed and composed to form a solution to the new task. The value function, discussed in previous sections and shown in the figures, is the maximum value of the Q-function. It is used to generate the partition and associated graphs needed to control the process.
Watkins and Dayan (1992) proved that Q-learning will converge to the optimal value with certain constraints on the reward and the learning rate . The optimal solution is produced by taking the action with the greatest value in any state. So, for goal-directed tasks, a greedy algorithm will take the shortest path to the goal, once learning is complete. The extension to continuous spaces may be done using function approximation. The simplest method, and the one used here, is to divide the state dimensions into intervals. The resulting action-value function has cells representing the average Q-value of taking each action from somewhere within a region of the state space. In off-line learning, where any action in any state can be executed, this representation has been proven to converge (Gordon 1995). In on-line learning, where the current state is determined by the environment, this approach is generally successful, but there exists no proof of its convergence.