This overview begins with a very high level introduction to reinforcement learning and the function it produces. It will show that there are features in this function which can be extracted and converted into a graphical representation.
One of the experimental test beds used is this paper is a simulated robot environment of different configurations of interconnected rooms. The robot must learn to navigate efficiently through these rooms to reach a specified goal from any start location. Figure 1 shows one example with 5 rooms and the goal in the top right corner. The robot's actions are small steps in any of eight directions, as indicated by the arrows. Here, the location, or state, is simply the robot's x and y coordinates. The thin lines of Figure 1 are the walls of the rooms, the thick lines the boundary of the state space.
If each action is independent of preceding actions, the task becomes one of learning the best action in any state. The best overall action would be one that takes the robot immediately to the goal. But this is only possible in states close to the goal. Suppose the robot is in a particular state and that the number of steps to goal from each of its neighboring states is known, indicated by the numbered squares surrounding the robot in Figure 1. Then a one step look ahead procedure would consider each step and select the one that reaches the neighboring state with the shortest distance to goal. In Figure 1 the robot would move to the state 10 steps from goal. If this process is repeated, the robot will take the shortest path to goal. In practice we must, of course, learn such values. This can be done using some type of reinforcement learning (Sutton, 1990; Watkins & Dayan, 1992) which progressively improves estimates of the distance to goal from each state until they converge to the correct values.
The function shown in Figure 2 is called the value function. Subsequently, the term function will mean the value function unless otherwise indicated. The function is the result of reinforcement learning on the problem of Figure 1, but instead of it representing the actual distance to goal, it represents essentially an exponential decay with distance to goal. The reasons for this will be made clear in Section 3.1. The shaded areas represent large gradients in the learned function. Comparing this to the environment shown in Figure 1, it is apparent that these correspond to the walls of the various rooms. These are the strong features discussed in this paper. They exist because of the extra distance for the robot to travel around the wall to reach the inside of the next room on the path to the goal. These features are visually readily apparent to a human, so it seems natural to use vision processing techniques to locate them.
An edge detection technique called a snake is used to locate these features. The snake produces a polygon, in this instance a rectangle, locating the boundary of each room. The doorways to the room occur where the differential of the function, along the body of the snake, is at a local minimum. The direction of the differential with respect to edges of the polygon, associated with the walls of the room, determines if it is an entrance or an exit. A positive gradient into the room indicates an entrance; a positive gradient out of the room indicates an exit. From this information, a plane graph, labeled with an (x,y) coordinate for each node, is constructed. Figure 2 shows one such example, for the room at the top left corner of the state space, subsequent graphs will not show the coordinates. Nodes corresponding to the doorways are labeled ``I'' or ``O'' for in and out respectively; their positions on the function are indicated by the dashed arrows.