This overview continues by showing how the graphs, extracted from the features in the function learned by reinforcement learning, can be used to produce a good approximation to the solution for a new goal position. The left hand side of Figure 3 shows plane graphs for all the rooms (ignore the dashed lines and circles for now). The node representing the goal is labeled ``G''. A directed edge is added from ``I'' to ``O'' or ``I'' to ``G'', as appropriate. Associated with this edge is a number representing the distance between the nodes. This is determined from the value of the function at the points of the doorways. Each individual graph is then merged with its neighbor to produce a graph for the whole problem, the right hand side of Figure 3. The doorway nodes have been relabeled to ``D''. The composite graph represents the whole function. Each individual subgraph represents a particular part of the function. This information is stored in a case base. Each subgraph is an index and the corresponding part of the function is the case.
Now suppose the goal is moved from the top right corner to the top left corner of the state space. Reinforcement learning in its most basic form would be required to learn the new function from scratch. In this work if the goal is moved, once the new goal position is known, the node representing the goal can be relocated. The new goal position is shown as the dashed circle in Figure 3. The edges connecting the doorways and the goal are changed to account for the new goal position. The dashed lines representing the new edges replace the arrows in the same subgraph. To produce a new function, the idea is to regress backwards from the goal along these edges. For each edge, the small subgraph containing the edge is extracted. The extracted subgraph is used to index the case base of functions. The retrieved function is transformed and added to the appropriate region of the state space to form the new function.
In this example, some of the existing subgraphs match the new configuration. The two that do not are the subgraph originally containing the goal and the subgraph now containing the goal. It is certainly possible to exchange these two, using an appropriate transform. But other graphs in the case base may better match the new task. The best match for the subgraph containing the new goal is, in fact, the subgraph for the goal in the original problem. To fit this to the new task, the plane graph is rotated and stretched slightly in the new x direction by changing the coordinates of its nodes, see Figure 4. Then this same transformation is applied to the function. But for the room containing the original goal, a case obtained when solving another task is a better match. The other three rooms use the functions from the original problem, since changing the goal position has little effect on the actions taken. In fact, only the height of the functions must be changed. This is simply a multiplication by a value representing the distance to goal from the ``O'' doorway (this is discussed in detail at the end of Section 3.3). Because the matching of the subgraphs allows some error and asymmetric scaling may be used, the resulting function may not be exact. But as the experiments will demonstrate, the function is often very close and further reinforcement learning will quickly correct any error.
The new position of the goal must be established before the graph can be modified and function composition can occur. The system is not told that the goal has moved, rather it discovers this by determining that it is no longer at the maximum of the existing function. There is some uncertainty in the exact boundary of the original goal. The robot may reach a state which it believes is part of the original goal region, but fail to detect it even if the goal has not moved. To be reasonably certain that the goal has in fact moved, this is required to occur ten times with no intervening occurrence of the goal being detected at the maximum.
The system then composes a search function, by assuming a particular room contains the goal. Search functions are also produced by composing previously learned functions. However, for the room assumed to contain the goal the function is a constant. This does not bias the search to any particular part of the room and allows some limited learning to encourage exploration of the room. The search function drives the robot into the room from anywhere else in the state space. If it fails to find the goal after a fixed number of steps, a new search function is composed with another room assumed to contain the goal. This process is repeated until the goal has been located ten times, this ensures a good estimate of the ``center of mass'' of the goal. The ``center of mass'' is used as the new position of the goal node in the composite graph. Requiring that the old goal or new goal positions are sampled a fixed number of times has proven to be effective in the domains discussed in this paper. Nevertheless, it is a somewhat ad hoc procedure and will be addressed in future work, discussed in Section 6.2.