Because we used only those domains where there was agreement about the relative difficulty of the problems it is not necessary to restrict our conclusions to be domain-dependent. However, we had only a restricted collection of data points at our disposal so we must be careful how we generalise this picture. On the basis of our analyses we believe we can make some tentative judgements about how planners are scaling in pairwise comparisons within the four competition levels.
For the fully-automated planners it can be observed informally that there is a high degree of consistency between the scaling behaviours of particular planners across the problem levels in which they competed. Although we cannot draw overall conclusions from the data set with a high level of confidence we can observe that FF exhibits the best scaling behaviour in the levels at which it competed and LPG exhibits the best scaling behaviour at the temporal levels. It should be remembered that we did not perform single-domain comparisons, although these might be interesting from the point of view of exploring domain-specific scaling behaviour and might produce some interesting results. We felt that these results would be interesting curiosities rather than anything that could support general conclusions.
The hand coded planners also show a high degree of cross-level consistency. It can be observed informally that TLPLAN scales much better than SHOP2 across all levels, whereas it scales only marginally better than TALPLANNER in the STRIPS domains and not significantly in any other level. TALPLANNER scales better than SHOP2 at all levels in which they both competed. It can be seen that SHOP2 is not scaling well relative to its competitors, although it should be remembered that the quality of plans produced by SHOP2 is superior in some domains.
Formally the tables allow us to draw specific conclusions about the relative scaling behaviours of specific pairs of planners, within specific problem levels, at the 0.05 level.