Plateau-Escaping Macro-Actions

Due to the nature of the relaxed problem used to generate the RPG heuristic there are aspects of the original problem that are not captured. Thus, when the RPG heuristic is used to perform EHC, plateaux are often encountered. On plateaux, the RPG heuristic value of all successor states is the same as, or worse than, the heuristic value of the current state. The nature of the plateaux encountered, and whether EHC is able to find a path to escape from them, is influenced by the properties of the planning domain [Hoffmann 2001].

Ignoring the delete effects of the pickup action in the Gripper domain creates a problem in which, in a given state, it is possible to pick up many balls using one gripper, so long as the gripper is initially available: the delete effect of the action, marking the gripper as no longer available, is removed. The relaxed plan in the initial problem state is pickup all the balls with one gripper, move to the next room, then drop them all. The length of this plan, the heuristic value of the initial state, is $ n + 1 + n$, that is $ 2n + 1$ (where $ n$ is the number of balls). If, in the initial state, a ball is picked up using one of the grippers, the relaxed plan for the resulting state will be to pickup the remaining balls in the other gripper, move to the second room and then drop them all; this has a length of $ (n - 1) + 1 + n$, that is $ 2n$, which is less than the heuristic value of the initial state so this action will be chosen as the one to apply.

The next state, however, is at the start of a plateau. The actions applicable (those for which all the preconditions are satisfied) are either to drop the ball that has been picked up, pickup another ball or move to the next room. The `correct' action would be to pickup another ball: the relaxed plan to the goal state for the resulting state would be to drop one of the balls, pickup all the remaining balls in the newly-freed gripper, move to the next room, and drop all the balls. However, the heuristic value of this state would be $ 1 + (n-2) + 1 + n$, or $ 2n$, the same value as the state in which the action is applied. Moving to the next room would produce a state with the heuristic value of $ 2n$ (move to the initial room, pickup remaining $ (n-1)$ balls, drop all balls in the final room--no move action is required to move back to any room the robot has already visited). Dropping one of the balls would also produce a state with a heuristic value of $ 2n$ (pickup all remaining $ (n-1)$ balls in newly-freed gripper, move to next room, drop all balls). As all successor states have the same RPG heuristic value as their parent state, the heuristic is unable to provide useful guidance as to which action to apply.

With some exhaustive search forward from this point, an improvement in heuristic value can be made in two ways: either move to the next room then drop a ball, or pickup a ball then move to the next room--both of these lead to heuristic values of $ (2n - 1)$. The plateau will, however, be encountered each time the robot is in the first room, holding one ball, and the action choices are either to pickup another ball or move to the next room (or drop a ball). Each time the plateau is encountered, the action sequence to escape the plateau is identical--move-drop or pickup-move (in EHC the actual sequence chosen will depend on the order in which the actions are considered by the planner). Having to discover one of these action sequences by exhaustive search each time the plateau is encountered is a considerable bottleneck in the search process: this is true in general for many domains.

In order to address the overhead caused by recurrent plateaux in the search space, Marvin memoises the action sequences used to escape the previously encountered plateaux; these action sequences are used to form what are called `Plateau-Escaping Macro-Actions'. A macro-action is generated from the action sequence using the code presented in Figure 4. Each step of the action sequence is considered in turn, and an abstract action step is made for it by replacing the entities given as parameters to the action with placeholder identifiers--one for each distinct entity. These placeholder identifiers then form the parameter list of the macro-action; and the recorded abstract action steps dictate the component actions from which the macro-action is built.

Returning to the pickup-move action sequence, the action sequence:

0: pickup robot1 ball2 room1
1: move robot1 room1 room2

would form a macro-action:

pickup-move (?a - robot) (?b - ball) (?c - room) (?d - room)

0: pickup ?a ?b ?c
1: move ?a ?c ?d

This macro-action can then be instantiated by specifying the parameters ?a to ?d, resulting in a sequence of actions. For example, (pickup-move robot1 ball3 room1 room2) would give an action sequence:

0: pickup robot1 ball3 room1
1: move robot1 room1 room2

In Marvin, the preconditions of the steps within the macro-action are not collected to give a single precondition formula for the macro-action. Instead, an instantiated macro-action is said to be applicable in a given state if the first component action of the macro-action is applicable, and subsequent actions are applicable in the relevant resulting states.

Having now built macro-actions from the plateau-escaping action sequences, when the search is later attempting to escape a plateau, these macro-actions are available for application. If the plateau arose due to the same weakness in the heuristic that led to an earlier plateau, then a macro-actions will be able to lead the search to a strictly better state by skipping over the intermediate states. The plateau-escaping macro-actions are only used when the search is attempting to escape a plateaux--this avoids slowing down search when the RPG heuristic is able to provide effective guidance using only single-step actions.

To reduce the number of macro-actions considered, and the blow-up in the size of the explored search space that would otherwise occur, the only macro-actions considered are those containing actions at the first time step that are helpful actions in the current state.

Figure 4: Pseudo-code for building macro-actions from plan segments
\STATE \textbf{Procedure: BuildMacro...\_types and abstract\_steps as a macro-action

Andrew Coles and Amanda Smith 2007-01-09