We use sufficiently large samples that the T-distribution, used by the Wilcoxon test, is approximately normal. We therefore quote Z-values to indicate the significance in the differences between the mean performances of the paired planners. We do not compare all pairs of planners: in some cases the superiority of a planner over another is so clear from examination of the raw data that statistical testing is obviated.

The Wilcoxon test tells us when one planner performs consistently better than another and whether this consistency is statistically significant. It does *not* tell us how much better one planner is than another. It could be that there is very little difference in performance -- perhaps the discrepancy can be accounted for by mere implementation differences. The consistency with which the difference occurs determines whether there is a statistically significant discrepancy between the performances of the planners. One planner can perform consistently better than another even though the other wins some of the time. For example, in a comparison between A and B, planner A might win more frequently than B because it is faster at solving a subset of problems in the set even though it is much slower at solving another subset of the problems. Provided the subset won by B is sufficiently large the greater number of wins due to A will not emerge as significant. For example, in a set of 10 problems ranked according to the differences in performance, if the first 7 problems in the ranking are won by A and the last 3 by B, A will obtain a score of 28 and B a score of 27. In this case no significant difference emerges *regardless* of the magnitude of the difference between A and B in the last three problems (see Appendix C). The rank-sum approach has the benefit of allowing large numbers of small wins to outweigh small numbers of large wins. This is desirable because a large win in a few problems is not an indication of overall better performance. We are interested in trying to measure the consistency of domain-independent behaviour and therefore in comparing the consistency of performance across domains given that planners perform differently within domains.

The size of the win need not indicate the complexity of the underlying problem so this does not allow us to make judgements about scalability. Essentially the test reveals consistent patterns in performance across a range of problems. We consider this to be of interest because knowing that one planner is significantly consistently better than another helps us make an objective judgement about which of the two planners is performing better across a varied problem set. If there is variability in the performances so that no consistent pattern emerges, it is very hard -- perhaps impossible -- to make this judgement objectively.

In a few cases, comparison using the Wilcoxon test does not lead to any statistically significant conclusions, but the proportion of wins by one of the planners is higher than is consistent with the null hypothesis of equal performance. This can be tested using a Z-test for a proportion (see Appendix C). Where this test yields a significant result we report it, as described below. This test is less informative than the Wilcoxon test as it does not take into account how the wins are distributed through the problem set.

In performing pairwise comparisons we must deal with cases where a planner did not solve a problem. We assign infinitely bad speed to the planner in these cases, ensuring that maximum possible benefit is given to planners solving problems, even very slowly. This methodology is valid because the Wilcoxon test is based on rank so the effect is simply to push the unsolved cases to the extreme of the ranking. In the case of quality we perform an initial collection of tests using infinitely bad quality for the unsolved problems. It is somewhat difficult to decide what it means to compare a true quality result with the infinitely bad quality assigned when a planner produced no solution. Our conclusion is that this comparison may not be informative enough, so in addition we perform the Wilcoxon test on just those cases where both planners produced a plan. We refer to these cases as *double hits*.