The results support rejection of the null hypothesis in almost all cases. We can therefore adopt the alternative hypothesis, observing that there are many cases in which planners do agree on the relative difficulties of problem instances within a given domain/level combination.
During the competition we observed that TALPLANNER is at a disadvantage with respect to the other hand-coded planners, in terms of comparative speed, when running on small problems. This is probably because of the java virtual machine start-up time which becomes significant relative to actual solution time on small instances. We see the effects of this start-up time in the tables. Note that, in those domain/level combinations in which TALplanner competed (STRIPS, SIMPLETIME and TIME) we see a low level of agreement amongst the hand-coded planners on the small problems (except in the case of the Rover domain). This is not because TALPLANNER disagrees with the other planners about the ranking of the actual problems, but because the problems are small enough that the variability in setup time throws noise into the ranking and obscures the true picture of relative problem difficulty. With the set of large problems we see that this anomaly is removed -- the problems are sufficiently challenging that the java startup time becomes insignificant -- and a high level of agreement over ranking is obtained. Interestingly, the hand-coded planners show a consistently high level of agreement about the ranking of Rovers problems. The fact that this does not emerge in the fully-automated set may be due to the larger number of judges in the fully-automated category.