Next: Problem Assumption 2: How Up: Problem Assumptions Previous: Problem Assumptions

Problem Assumption 1: To What Extent Is Performance of General Purpose Planners Biased Toward Particular Problems/Domains?

Although most planners are developed as general purpose, the competitions and previous studies have shown that planners excel on different domains/problems. Unfortunately, the community does not yet have a good understanding of why a planner does well on a particular domain. We studied the impact of problem selection on performance in two ways.

First, we assessed whether performance might be positively biased toward problems tested during development. Each developer6 was asked to indicate which domains they used during development. We then compared each planner's performance on their development problems (i.e., the development set) to the problems remaining in the complete test set (rest). We ran 2x2 tests comparing number of problems solved versus failed in the development and test sets. We included only the number solved and failed in the analysis as timed-out problems made no difference to the results7.

The results of this analysis are summarized in Table 3; Figure 1 graphically displays the ratio of successes to failures for the development and other problems. All of the planners except C performed significantly better on their development problems. This suggests that these planners have been tailored (intentionally or not) for particular types of problems and that they will tend to do better on test sets biased accordingly. For example, one of the planners in our set, STAN, was designed with an emphasis on logistics problems [Fox Long1999].

Table 3: results comparing outcome on development versus other problems.
 Development Rest Planner Sol. Fail Sol. Fail P A 48 56 207 1026 51.70 0.001 B 42 34 226 929 51.27 0.001 C 30 0 549 16 0.13 0.722 G 43 35 233 924 49.56 0.001 H 52 9 234 655 91.41 0.001 I 113 20 328 920 187.72 0.001 J 114 24 388 949 157.62 0.001 K 37 56 203 987 27.82 0.001 L 63 32 358 846 52.13 0.001

The above analysis introduces a variety of biases. The developers tended to give us short lists that probably were not really representative of what they actually used. The set used is a moving target, rather than stationary as this suggests. The set of problems included in experimentation for publication may be different still. Consequently, for the second part, we broadened the question to determine the effect of different subsets of problems on performance. For each of 10 trials, we randomly selected n domains (and their companion problems) to form the problem set. We counted how many of these problems could be solved by each planner and then ranked the relative performance of each planner. Thus, for each value of n, we obtained 10 planner rankings. We focused on rankings of problems solved for two reasons: First, each domain includes a different number of problems, making the count of problems variable across each of the trials. Second, relative ranking gets to the heart of whether one planner might be considered to be an improvement over another.

We tested values of 5, 10, 20 and 30 for n (30 is half of the domains at our disposal). To give a sense of the variability in size, at , the most problems solved in a trial varied from 11 to 64. To assess the changes in rankings across the trials, we computed rank dominance for all pairs of planners; rank dominance is defined as the number of trials in which planner x's rank was lower than planner y's (note: ties would count toward neither planner). The 13 planners in our study resulted in 78 dominance pairings. If the relative ranking between two planners is stable, then one would expect one to always dominate the other, i.e., have rank dominance of 10.

Table 4 shows the number of pairs having each value (0-10) of rank dominance for the four values for n. For a given pair, we used the highest number as the rank dominance for the pair, e.g., if one always has a lower rank, then the pair's rank dominance is 10 or if both have five, then it is five. Because of ties, the maximum can be less than five. The data suggest that even when picking half of the domains, the rankings are not completely stable: in 56% of the pairings, one always dominates, but 22% have a 0.3 or greater chance of switching relative ranking. The values degrade as decreases with only 27% always dominating for .

Table 4: Rank dominance counts for 10 samples of domains with domain sizes (n) of five through 30.
 Rank Dominance n 0 1 2 3 4 5 6 7 8 9 10 Total Pairs 5 0 1 2 0 5 7 10 4 10 18 21 78 10 0 3 0 0 4 10 6 7 5 23 20 78 20 0 0 0 0 1 3 8 7 11 8 40 78 30 0 0 0 0 1 1 9 6 9 8 44 78

Next: Problem Assumption 2: How Up: Problem Assumptions Previous: Problem Assumptions