First, we assessed whether performance might be positively biased toward problems tested during development. Each developer6 was asked to indicate which domains they used during development. We then compared each planner's performance on their development problems (i.e., the development set) to the problems remaining in the complete test set (rest). We ran 2x2 tests comparing number of problems solved versus failed in the development and test sets. We included only the number solved and failed in the analysis as timed-out problems made no difference to the results7.
The results of this analysis are summarized in
Table 3; Figure 1
graphically displays the ratio of successes to failures for the
development and other problems. All of the planners except C
performed significantly better on their development problems. This
suggests that these planners have been tailored (intentionally or not)
for particular types of problems and that they will tend to do better
on test sets biased accordingly. For example, one of the planners in
our set, STAN, was designed with an emphasis on logistics
problems [Fox Long1999].
We tested values of 5, 10, 20 and 30 for n (30 is half of the domains at our disposal). To give a sense of the variability in size, at , the most problems solved in a trial varied from 11 to 64. To assess the changes in rankings across the trials, we computed rank dominance for all pairs of planners; rank dominance is defined as the number of trials in which planner x's rank was lower than planner y's (note: ties would count toward neither planner). The 13 planners in our study resulted in 78 dominance pairings. If the relative ranking between two planners is stable, then one would expect one to always dominate the other, i.e., have rank dominance of 10.
Table 4 shows the number of pairs having each value (0-10) of rank dominance for the four values for n. For a given pair, we used the highest number as the rank dominance for the pair, e.g., if one always has a lower rank, then the pair's rank dominance is 10 or if both have five, then it is five. Because of ties, the maximum can be less than five. The data suggest that even when picking half of the domains, the rankings are not completely stable: in 56% of the pairings, one always dominates, but 22% have a 0.3 or greater chance of switching relative ranking. The values degrade as decreases with only 27% always dominating for .