The MRCs described in Section 7 demonstrate that the planners do agree, as expected, about the relative difficulty of problem instances within most domain/level combinations. In these cases it is possible to go on to explore the domain and level specific scaling behaviour of the planners, and we go on to investigate that in this section. We cannot explore the scaling behaviour of the planners across domains because, as we discussed in Section 6, there does not seem to be much across-the-board agreement concerning the relative hardness of the domains so we would be unlikely to see agreement in multiple judgments across the domain boundaries.
The ideal basis on which to explore scaling behaviour would be to have a collection of problems with a canonical scaling of difficulty and then to compare the performance of planners as they scaled on progressively harder problems within this collection. Unfortunately, many factors contribute to making problems hard and these do not affect planners uniformly. As a result, there is no canonical measurement of problem difficulty in many domains. Instead, we must determine the relative difficulty of problems by using the planners themselves as judges. This means that we can only consider the relative scaling behaviours of planners when the planners agree on the underlying ordering of the difficulty of problems. Thus, we begin by identifying appropriate sets of problems -- those on which a given pair of planners agree about the relative hardness of problems according to our analysis in Section 7 -- and then proceed to compare the way that each of the planners in the pair scales as the problems increase in difficulty. The first stage of the analysis considers only the order that the two planners place on the problems within a set, while the second stage examines how the performance varies between the two planners as they progress from problem to problem.
The hypotheses explored in this section are:
Null Hypothesis: Where planners agree about the difficulty of problems for a given domain/level combination, they exhibit the same scaling behaviour.
Alternative Hypothesis: Where planners agree about the difficulty of problems for a given domain/level combination, they demonstrate different scaling behaviours, where the better scaling performance can be identified from the data set.
This section is concerned with the question of scaling behaviour within problem sets from specific domain/level combinations in which there is already determined to be agreement, as identified in Section 7.