Beginning in 1998 the international planning community has held a biennial event to support the direct comparison of planning systems on a changing collection of benchmark planning problems. The benefits of this series of events have been significant: over five years, planning systems have been developed that are capable of solving large and complex problems, using richly expressive domain models and meeting advanced demands on the structure and quality of solutions. The competition series has inspired many advances in the planning research community as well as an increasingly empirical methodology and a growing interest in the application of planners to real problems.
In this paper we describe the structure, objectives and outcomes of the third competition, which took place in Toulouse in 2002. The competition was colocated with the AI Planning and Scheduling (AIPS) conference. At that conference a brief report was presented of some of the results achieved by the participating planners. We begin by presenting an overview of the main results as presented at the conference, showing the number of problems attempted and solved by each planner and identifying the competition prize-winners. As in previous years the competition resulted in the collection of a large data set comprising data points for several different domains. A certain comparative understanding can be obtained by examining the data for the individual domains, but conclusions drawn on this basis cannot be generalised across domains. One of the goals of this paper is to try to reveal some insights that cross the boundaries of the domains and allow some general questions to be answered. These include: which planners reveal the most consistent, stable performance across domains? What benefit is obtained by exploiting hand-coded control knowledge? Is there any general agreement over what makes a planning problem hard? Are particular planning approaches best suited to particular kinds of problem domains?
The accepted scientific methodology for addressing such questions is to frame precise hypotheses prior to the collection of the data sets in order to control for any extraneous variables that might distort the reality with respect to these questions. To date this has not been the practice of the planning community with respect to the competitions. In the third competition we proceeded, as in previous years, by collecting data prior to detailed consideration of the specific questions we wished to answer. The community has not yet agreed that the primary role of the competition is to provide a carefully crafted platform for the scientific investigation of planners: indeed, its main roles so far have been to motivate researchers in the field, to identify new research goals and thereby push forward the research horizons, and to publicise progress to the wider community. However, because competitions have winners there is a natural tendancy to draw conclusions from the competition data sets about the state of the art. If these conclusions are not scientifically supported they can be misleading and even erroneous. Therefore there is an argument for trying to combine the two objectives, although admittedly there is a tension between them that might make it difficult to combine them successfully.
Because of the way in which the planning competitions are currently conducted, the analyses we describe in this paper are post hoc. We conducted all of our analyses on the data collected during the competition period: we did not run any further experiments after the competition because the participants felt that it was important that the data they submitted during the competition should comprise the evidence on which they were judged. We have identified a number of analyses which we think provide interesting information to the planning community and, in the following sections, we explore each theme in as rigorous a way as possible within the constraints of the data we have at our disposal. It has been difficult to work with a fixed data set that was collected without precise experimental questions in mind, and we were unable to test many of the hypotheses that occurred to us during our analyses because the data was inappropriate or incomplete. However, despite the limitations of the data set we believe we have been able to pose and answer some important questions about the comparative performances revealed in the competition. We phrase the objectives of our analyses in terms of null and alternative hyptheses because this is the standard approach when applying statistical tests. Our approach was partly inspired by the earlier work of Howe and Dahlman howe. Their work raised the standard to which evaluation of planners should be done and compels the planning community to decide whether future planning competitions should be conducted in a way that supports the goals of scientific investigation of progress in the field.
The rest of this paper is organised as follows. We begin by discussing the context -- the competition series itself and the general form of the third competition, including the domains, the competitors and the specific challenges raised. We then briefly summarise the results of the competition before embarking on a detailed post hoc analysis of the competition results. This analysis investigates the relative performances of the planners, the relative difficulties of the problem sets used and the relative scaling behaviours of the competitors across domains and levels. We provide an appendix in which we present summaries of the competing planners and details of the domains used.