Next: 8 Conclusions Up: Accelerating Reinforcement Learning by Previous: 6.2 Limitations in the

7 Related Work

The most strongly related work is that investigating macro actions in reinforcement learning. Precup, Sutton and Singh (1997;1998) propose a possible semantics for macro actions within the framework of normal reinforcement learning. Singh (1992) uses policies, learned to solve low level problems, as primitives for reinforcement learning at a higher level. Mahadevan and Connell (1992) use reinforcement learning in behavior based robot control. To learn a solution to a new task, all these systems require a definition for each subtask and their interrelationships in solving the compound task. The work presented here gives one way that macro actions can be extracted directly from the system's interaction with its environment, without any such hand-crafted definitions. It also shows how to determine the interrelationships of these macro actions needed to solve the new task. Thrun's research (1994) does identify macro actions, by finding commonalities in multiple tasks. But unlike the research presented here, no mapping of such actions to new tasks is proposed. Hauskrecht et al. (1998) discuss various methods of generating macro actions. Parr (1998) develops algorithms to control the caching of policies that can be used in multiple tasks. But in both cases, they need to be given a partitioning over the state space. It is the automatic generation of just such a partition that has been the focus of much of the work presented in this paper. It may well be that this approach to generating partitions and to determining the interrelationships between partitions of related tasks will prove useful to this other work.

Another group of closely connected work is the various forms of instance based or case based learning that have been used in conjunction with reinforcement learning. They have been used to address a number of issues: (1) the economical representation of the state space, (2) prioritizing states for updating and (3) dealing with hidden state. The first issue is addressed by Peng (1995) and by Tadepalli and Ok (1996) who use learned instances combined with linear regression over a set of neighboring points. Sheppard and Salzberg (1997) also use learned instances, but they are carefully selected by a genetic algorithm. The second issue is addressed by Moore and Atkeson (1993) who keep a queue of ``interesting'' instances, predecessors of those states where learning produces a large change in values. These are updated most frequently to improve the learning rate. The third issue is addressed by McCallum (1995b) who uses trees which expand the state representation to include prior states, removing ambiguity due to hidden states. In further work, McCallum (1995a) uses a single representation to address both the hidden state problem and the general problem of representing a large state space by using a case base of state sequences associated with various trajectories. Unlike this other research, in the work presented here the case is not an example of the value function during learning. Instead, it is the result of a complete learning episode, so the method should be complementary to these other approaches.

This work is also related to case based planning (Veloso & Carbonell, 1993; Hammond, 1990), firstly through the general connection of reinforcement learning and planning. But it is analogous in other ways. When there is a small change to the world, such as the goal being moved, a composite plan is modified by using sub-plans extracted from other composite plans.

Last, but not least, is the connection with object recognition in vision research (Chin & Dyer, 1986; Suetens, Fua, & Hanson, 1992). In the work presented here, many of the methods - if not the final application - has come from that field. The features in the reinforcement learning function are akin to edges in an image. These are located by finding the zero crossing point of the Laplacian as introduced by Marr (1982). In the work presented here, it was proposed that the features largely dictate the form of the function. Mallat and Zhong (1992) have shown that a function can be accurately reconstructed from a record of its steep slopes.

Next: 8 Conclusions Up: Accelerating Reinforcement Learning by Previous: 6.2 Limitations in the

Chris Drummond
Thursday January 31 01:30:31 EST 2002