To rank the problems in order of difficulty we use the results obtained from the bootstrapping experiment described in Section 6. Our rankings are level dependent, so we looked at scaling within the four problem levels and recorded the results separately. We do not attempt to combine these results into a single overall conclusion about scaling -- we recognize that different planners scale better at some problem levels than at others, and that no single planner can therefore emerge as scaling best overall.

We only compare two planners if they agreed about difficulty in at least two domains. This gives us, in each case where a comparison is made, a data set of more than 30 points. Where the planners did not agree in at least two domains we conclude that there is insufficient agreement between them about what constitutes problem difficulty for it to be possible to measure their relative scaling behaviours in a meaningful way.

To perform a comparison between two planners we rank the problems in the data set in order of agreed difficulty and then rank the differences between the performances of the planners on these problems. We then explore whether the ranking of the differences is correlated with the ranking of the problems according to their difficulty. We use ranks because we cannot make assumptions about the shapes of the underlying problem distributions or the functions that truly describe the performances of the planners being compared, so our results are robust with respect to these factors.

Given two planners, and , a positive correlation between the rankings of the differences in values between and and the problem difficulty ranking means that the difference in performance between and (that is, performance of minus performance of ) is increasing as the problems become more difficult. If the curve of is increasing faster scales better than . A negative correlation means that scales better than . A zero (or near-zero) correlation means that the scaling behaviour of the two planners is insignificantly different. We use Spearman's rank correlation test (see Appendix C) to identify the critical value required for confidence at the 0.05 level.

We restrict our attention to the planners that solved the most problems overall in the two categories. Thus, the fully-automated planners we compared were FF, LPG, MIPS, VHPOP and SAPA. We consider all pairs of the hand coded planners. We do not perform any cross-category tests as it is evident from the raw data that the hand coded planners exhibit better scaling behaviour than any of the fully-automated planners.