We have analyzed the effects of experiment design decisions in empirical comparison of planners and made some recommendations for ameliorating the effects of these decisions. Most of the recommendations are common sense suggestions for improving the current methodology.
To expand beyond the current methodology will require at least two substantive changes. First, the field needs to question whether we should be trying to show performance on planning problems in general. A shift from general comparisons to focused comparisons (on problem class or mechanism or on hypothesis testing) could produce significant advances in our understanding of planning.
Second, the benchmark problem sets require attention. Many of the problems should be discarded because they are too simple to show much. The domains are far removed from real applications. It may be time to revisit testbeds. For example, several researchers in robotics have constructed an interactive testbed for comparing motion planning algorithms [Piccinocchi et al. 1997]. The testbed consists of a user interface for defining new problems, a collection of well-known algorithms and a simulator for testing algorithms on specific problems. Thus, the user can design his/her own problems and compare performance of various algorithms (including their own) on them via a web site. Such a testbed affords several advantages over the current paradigm of static benchmark problems and developer conducted comparisons, in particular, replicability and extendability of the test set. Alternatively, challenging problem sets can be developed by modifying deployed applications [Wilkins desJardins 2001,Engelhardt et al. 2001].
In recent years, the planning community has significantly improved the size of planning problems that can be solved in reasonable time and has advanced the state of the art in empirical comparison of our systems. To interpret the results of empirical comparisons and understand how they should motivate further development in planning, the community needs to understand the effects of the empirical methodology itself. The purpose of this paper is to further that understanding and initiate a dialogue about the methodology that should be used.
Acknowledgments This research was partially supported by a Career award from the National Science Foundation IRI-9624058 and by a grant from Air Force Office of Scientific Research F49620-00-1-0144. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. We are most grateful to the reviewers for the careful reading of and well-considered comments on the submitted version; we hope we have done justice to your suggestions.