This overview begins with a very high level introduction to reinforcement learning and the function it produces. It will show that there are features in this function which can be extracted and converted into a graphical representation.

One of the experimental test beds used is this paper is a simulated
robot environment of different configurations of interconnected rooms.
The robot must learn to navigate efficiently through these rooms to
reach a specified goal from any start location. Figure
1 shows one example with 5 rooms and the goal in the top
right corner. The robot's actions are small steps in any of eight
directions, as indicated by the arrows. Here, the location, or state,
is simply the robot's *x* and *y* coordinates. The thin lines of
Figure 1 are the walls of the rooms, the thick lines the
boundary of the state space.

If each action is independent of preceding actions, the task becomes one of learning the best action in any state. The best overall action would be one that takes the robot immediately to the goal. But this is only possible in states close to the goal. Suppose the robot is in a particular state and that the number of steps to goal from each of its neighboring states is known, indicated by the numbered squares surrounding the robot in Figure 1. Then a one step look ahead procedure would consider each step and select the one that reaches the neighboring state with the shortest distance to goal. In Figure 1 the robot would move to the state 10 steps from goal. If this process is repeated, the robot will take the shortest path to goal. In practice we must, of course, learn such values. This can be done using some type of reinforcement learning (Sutton, 1990; Watkins & Dayan, 1992) which progressively improves estimates of the distance to goal from each state until they converge to the correct values.

The function shown in Figure 2 is called the value
function. Subsequently, the term function will mean the value function
unless otherwise indicated. The function is the result of
reinforcement learning on the problem of Figure 1, but
instead of it representing the actual distance to goal, it represents
essentially an exponential decay with distance to goal. The reasons
for this will be made clear in Section 3.1. The shaded areas
represent large gradients in the learned function. Comparing this to
the environment shown in Figure 1, it is apparent that
these correspond to the walls of the various rooms. These are the
*strong features* discussed in this paper. They exist because of
the extra distance for the robot to travel around the wall to reach
the inside of the next room on the path to the goal. These features
are visually readily apparent to a human, so it seems natural to use
vision processing techniques to locate them.

An edge detection technique called a snake is used to locate these
features. The snake produces a polygon, in this instance a rectangle,
locating the boundary of each room. The doorways to the room occur
where the differential of the function, along the body of the snake,
is at a local minimum. The direction of the differential with respect
to edges of the polygon, associated with the walls of the room,
determines if it is an entrance or an exit. A positive gradient into
the room indicates an entrance; a positive gradient out of the room
indicates an exit. From this information, a plane graph, labeled with
an (*x*,*y*) coordinate for each node, is constructed. Figure
2 shows one such example, for the room at the top left
corner of the state space, subsequent graphs will not show the
coordinates. Nodes corresponding to the doorways are labeled ``I'' or
``O'' for in and out respectively; their positions on the function are
indicated by the dashed arrows.

Thursday January 31 01:30:31 EST 2002