In the experiment used to address the first question, the null hypothesis is that the planner finds all problems at a specific level equally difficult across all domains. To test this we constructed, using a bootstrapping technique, ten thousand samples of twenty values from the collection of all timings obtained from domains at the appropriate level. The values were selected at random from the performances of planners competing in the domains, one value for each of a collection of randomly selected problems. For example, if problem one was chosen from DriverLog, problem two from Depots, problem three from Rovers, problem four from Depots, etc., then the value associated with problem one would be that produced for that problem by a planner selected at random from those that competed in DriverLog. Similarly, the value associated with problem two would be chosen from a planner that competed in Depots, and so on. For each collection of 20 values we plotted the number of problem instances left to solve against time, as above. This resulted in a sampling distribution of *level-specific* areas. Using these bootstrap samples we check whether the area calculated for the particular planner-domain-level combination lies at the extremes of this distribution, or not. If it lies in the first 2.5% of the distribution we reject the null hypothesis on the grounds that the planner found problems at that level, in that domain, to be significantly *easy*. If it lies in the top 2.5% of the distribution we reject the null hypothesis and conclude that those problems were significantly *hard* for that planner.

In testing the relative hardness of problem levels within a domain (the second question), we perform similar experiments in which, for each planner, the bootstrapped samples were obtained by sampling timings from all problem levels within all domains. This resulted in a new sampling distribution of the *level-independent* area statistic. The null hypothesis, that the domain/level combination is not an indicator of difficulty, is tested by seeing whether the areas computed for planner-domain-level combinations are extreme with respect to the new sampling distribution.