The first experiment investigates the time taken to correct a learned function when the goal is relocated in the robot navigation domain. There are nine different room configurations, as shown in Figure 26, the number of rooms varying from three to five and there are four different goal positions. Each room has one or two doorways and one or two paths to the goal. To initialize the case base, a function is learned for each of these configurations with the goal in the position shown by the black square. The rooms are generated randomly, with some constraints on the configuration of the rooms and doorways: a room can not be too small or narrow, a doorway can not be too large. The case base also includes functions generated for the experiments discussed in Section 4.3. This was necessary to give a sufficient variety of cases to cover most of the new tasks. Even with this addition, not all subgraphs are matched. Constant valued default functions are used when there is not a match. This reduces speed up significantly, but does not eliminate it altogether.
Once the case base is loaded, the basic Q-learning algorithm is rerun on each room configuration with the goal in the position shown. After 400,000 steps the goal is moved, this is denoted as time t on the x-axis of Figure 27. The goal is moved to one of the three remaining corners of the state space, a task not included in the case base. Learning continues for a further 300,000 steps. At fixed intervals, learning is stopped and the average number of steps to reach the goal is recorded. The curves in Figure 27 are the average of 27 experimental runs, three new goal positions for each of the nine room configurations.
The basic Q-learning algorithm, the top curve of Figure 27, performs poorly because, when the goal is moved, the existing function pushes the robot towards the old goal position. A variant of the basic algorithm reinitializes the function to zero everywhere on detecting that the goal has moved. This reinitialized Q-learning, the middle curve, performed much better, but it still has to learn the new task from scratch.
The function composition system, the lowest curve, performed by far the best. The precise position of the knee of this curve is difficult to determine due to the effect of using default functions. If only those examples using case base functions are considered, the knee point is very sharp at about 3000 steps. The average number of steps to goal at 3000 steps, for all examples, is 40. The non-reinitialized Q-learning fails to reach this value within 300,000 steps giving a speed of over 100. The reinitialized Q-learning reaches this value at about 120,000 steps, giving a speed up of about 40. Function composition generally produces accurate solutions. Even if some error is introduced, further Q-learning quickly refines the function towards the asymptotic value of about 17. After about 150,000 steps, normal Q-learning reaches an average value of 24 steps and then slowly refines the solution to reach an average value of 21 after 300,000 steps.