First, we assessed whether performance might be positively biased toward
problems tested during development. Each developer^{6} was asked to indicate which domains they
used during development. We then compared each planner's performance on their
development problems (i.e., the development set) to the problems remaining in
the complete test set (rest). We ran 2x2 tests comparing number of problems
solved versus failed in the development and test sets. We included only the
number solved and failed in the analysis as timed-out problems made no
difference to the results^{7}.

The results of this analysis are summarized in
Table 3; Figure 1
graphically displays the ratio of successes to failures for the
development and other problems. All of the planners except C
performed significantly better on their development problems. This
suggests that these planners have been tailored (intentionally or not)
for particular types of problems and that they will tend to do better
on test sets biased accordingly. For example, one of the planners in
our set, STAN, was designed with an emphasis on logistics
problems [Fox Long1999].

We tested values of 5, 10, 20 and 30 for *n* (30 is half of the
domains at our disposal). To give a sense of the variability in size,
at , the most problems solved in a trial varied from 11 to 64. To
assess the changes in rankings across the trials, we computed *
rank dominance* for all pairs of planners; rank dominance is defined
as the number of trials in which planner *x*'s rank was lower than
planner *y*'s (note: ties would count toward neither planner). The
13 planners in our study resulted in 78 dominance pairings. If the
relative ranking between two planners is stable, then one would expect
one to always dominate the other, i.e., have rank dominance of 10.

Table 4 shows the number of pairs having each
value (0-10) of rank dominance for the four values for *
n*. For a given pair, we used the highest number as the rank dominance
for the pair, e.g., if one always has a lower rank, then the pair's
rank dominance is 10 or if both have five, then it is five. Because of
ties, the maximum can be less than five. The data suggest that even
when picking half of the domains, the rankings are not completely
stable: in 56% of the pairings, one always dominates, but 22% have a
0.3 or greater chance of switching relative ranking. The values
degrade as decreases with only 27% always dominating for .