Evaluation Methodology

In classical planning, a plan is a series of operators. A successful plan is one that, when applied to the initial state, achieves the goal. In probabilistic planning, there are many proposals for plan representations (straight-line plans, plan trees, policy graphs, and triangle tables, for example), but none is considered a widely accepted standard. In addition, even simple plans are challenging to evaluate exactly in a non-deterministic environment, as all possible outcomes need to be checked and results combined (Littman, Goldsmith, & Mundhenk, 1998).

For these reasons, we chose to evaluate planners by simulation. That is, our plan validator was a server, and individual planning algorithms acted as clients. Planners connected to the validator, received an initial state, and returned an operator/action. This dialog continued until a terminating condition was reached at which point the validator evaluated the performance of the planner during that trajectory from initial state to terminating condition. This entire process was repeated several times and results averaged over the multiple runs.

Because this evaluation scheme blurs the distinction between a planner and an executor, it means that computation is no longer a one-time preprocessing cost, but something integrated with action selection itself. Planner quality, therefore, needs to be a combination of expected utility and running time. For simplicity, we set a time threshold and only allowed reward to be gathered until time ran out. This time threshold was known to competitors, whose planners could take it into consideration when deciding how to balance computation and action. Since we did not know whether participants would reuse results from one trajectory to speed planning in the next, we set an overall time limit that applied to the total of all repetitions of the evaluator for a given domain.

Concretely, in our evaluations, participants were presented with twenty previously unseen problems in PPDDL format. To evaluate each problem, participants connected to one of our evaluation servers (at CMU or Rutgers). The server provided the planner with an initial state and the planner selected and returned an action. This dialogue was iterated until a goal was reached, time ran out, or the planner sent a “done” action. The value obtained for the problem was the goal reward, if the goal was reached, minus the sum of the action costs (if any). For each problem, this procedure was repeated 30 times in a maximum of 15 minutes and the results averaged.

There were two types of problems in the evaluation set: reward problems and goal problems. For goal problems, the success percentage determined a participant's score for the problem (no action costs). In reward problems, every action had a fixed cost. The times to completion were recorded, but not explicitly used for ranking. Planners that completed less than 30 runs in 15 minutes were given a score of 0 for the unfinished runs.

In the design of the server, we believed that the time needed for computation in the planner would far outweigh any possible communication delay. However, in preliminary evaluations, participants—especially those halfway across the world—experienced disruptive levels of latency when evaluating their planners by connecting remotely to the server. Before the formal evaluation, we offered participants local accounts at CMU and nearly all availed themselves of this option.

Håkan L. S. Younes