Planning performance can be measured in terms of the speed of solution of the problems and the number and quality of the solutions found. Other metrics might also be identified. There are different planning architectures and heuristic evaluation functions as well as different kinds of domains and problems. The state of the art is represented by specific exemplars of these architectures and heuristic functions and it is interesting to explore the suitability of different architectures for use on different domains, the scaling behaviour of particular approaches, the comparative performance of different planning systems, etc. We address some of these issues in the following sections.
We perform three collections of analyses to ascertain comparative performance based on consideration of raw speed and quality data, the extent to which domain influenced planner behaviour and the scaling behaviour of the planners. The first collection is a comparison between the planners based on their raw competition performance. We analyse the data from the point of view of a consumer interested in ranking the planners according to speed and plan quality. These experiments are aimed at answering coarse level questions of the form: which planner should I buy? In asking questions such as this we are trying to arrive at a general basis for comparison. Of course, our investigation of this question is constrained by the metrics used in the competition. Furthermore, the trade-off one makes between time and quality depends on the context in which a planner might be deployed. Therefore, we cannot combine the judgements of relative speed and relative plan quality performance to determine unequivocally which planner to buy. We take as a basic assumption that potential users of domain-independent planning technology are interested primarily in broad coverage with good behaviour across a wide variety of domains, rather than in restricted coverage with spectacular behaviour in just a few domains. Our raw performance analyses are based mainly on the Wilcoxon rank-sum matched-pairs test (see Appendix C), as explained below. An advantage of this test is that it is non-parametric, meaning that it does not rely on assumptions about the shapes or properties of the underlying population distributions. It is also robust to outliers.
The second collection of experiments is concerned with identifying whether there were domains that were significantly easier (or harder). We perform these experiments at each of the levels of problems used in the competition to determine whether there is agreement amongst the planners about the difficulty of the problems. In part, this assists us in going on to explore scalability issues but it also allows us to determine whether the problem set presented at the competition contained interesting challenges. Our third collection of experiments compares the way in which planners scale on problem sets in which they agree about problem difficulty.