This section compares learning curves for function composition and a simple baseline algorithm. Four sets of results are presented; one for each of the two types of related task in each of the two domains. The learning curves represent the average distance to goal as a function of the number of actions taken during learning. The distance is averaged over 64 different start positions, distributed uniformly throughout the state space, and over the different experimental runs. To determine this distance, normal learning is stopped after a fixed number of actions and a copy of the function learned so far is stored. One of the 64 start positions is selected, learning is restarted and the number of actions needed to reach the goal is recorded. If a trial takes 2000 actions and has not yet reached the goal, it is stopped and the distance to goal recorded as 2000. The function is then reinitialized with the stored version and another start state selected. This is repeated 64 times. Then the function is reinitialized once more and normal learning resumed.
The baseline algorithm and the underlying learning algorithm for the function composition system is the basic Q-learning algorithm, using a discrete function approximator as discussed in Section 3.1. The learning rate is set to 0.1, the greedy policy uses an of 0.1 (the best action is selected 90% of the time), the future discount is 0.8 and a reward of 1.0 is received on reaching the goal. Although the state spaces for the different domains represent two quite different things - the robot's location and the angle of the arm's two joints - the actual representation is the same. The state space ranges between for each dimension. A step is in each dimension either separately or together, giving the eight possible actions. The actions are stochastic, a uniformly distributed random value between being added to each dimension of the action. In the robot navigation examples if the robot hits the wall, it is positioned a small distance from the wall along the direction of its last action. This has not been implemented for the robot arm as it is a somewhat more complex calculation. Instead, if a collision with an obstacle occurs the arm is restored to its position before taking the action.
Learning begins at a randomly selected start state and continues until the goal is reached. Then a new start state is selected randomly and the process repeated. This continues until the requisite total number of actions is achieved. Speed up is calculated by dividing the number of learning steps at one specific point on the baseline learning curve by the number of learning steps at an equivalent point on the function composition system's learning curve. The knee of the function composition system's curve is used. This occurs where the low level learning algorithm is initialized with the composed function. This is compared to the approximate position of the knee of the baseline curve.