Having described our overall experimental design, we now turn to the analysis of the results. To begin, we sought to re-establish the claims we originally made in our earlier work. Specifically, we wanted first to reconfirm, using a larger data set, that least-cost flaw selection is an effective technique for reducing the size of the search space generated by a POCL planner. We therefore ran an experiment in which we compared the the node counts for the five strategies we had earlier studied--LCFR, DUnf, DUnf-LC, UCPOP, and UCPOP-LC--plus one new one, DUnf-Gen, explained below.

The results of this experiment are shown in Figures and . The former is a log-log scatter plot, showing the performance of each of the six strategies on the 33 problems in the basic set. The problems were sorted by the minimal number of nodes generated on them by any of the six strategies. Thus, the left-hand side of the graph includes the problems that at least one of the six strategies found to be relatively easy, while the right-hand side has the problems that were hard for all six strategies. We omitted problems that none of the six strategies were able to solve. The actual number of nodes generated by each strategy is plotted on the Y-axis, against the minimal number of nodes for that problem, on the X-axis. LCFR's performance is highlighted with a line connecting its data points. This graph shows that, in general, LCFR generates small search spaces on this problem set, relative to the other strategies in this class. There were only six problems for which LCFR was not within 10% of the minimum. Three of these are in the Get-Paid/Uget-Paid class of problems--including two of the ``hardest'' problems (UGet-Paid3 and UGet-Paid4). We discuss this class of problems more in Section .

An alternative view of the data is given in Figure , which shows the aggregate performance of the six strategies, i.e., the average of their node-count %-overrun on the basic problems. As can be seen, LCFR has the smallest average %-overrun.

Figures and
present similar views of the data for the Tileworld
domain, while Figure gives the data for the Trains problems.
On the Trains domain, these six strategies were only able to solve the
easiest problem (Trains1), so we simply show the actual node counts in
Figure . We have omitted two data points,
because they were so extreme that their inclusion on the graph made it
impossible to see the differences among the other strategies:
LCFR and DUnf-Gen with *S*+*OC*+*UC* node selection took 28,218, and
35,483 nodes, respectively, to solve the problem.

For the Tileworld and Trains problems, we observed the same sorts of
interactions between node and flaw selection strategies as were seen
by Gerevini and Schubert. Specifically, LCFR performs relatively
poorly with *S*+*OC* on the Tileworld problems, and it performs very
poorly with *S*+*OC*+*UC* on the Trains problems.
However, when paired with the other node-selection
strategies, LCFR produces the smallest search spaces of any strategies
in this class.

In sum, LCFR does tend to produce smaller search spaces than the other strategies in this class. But a question remains. LCFR uses a least-cost strategy, and a side effect of this is that it will prefer forced flaws, since forced flaws are low-cost flaws. It is therefore conceivable that LCFR's performance is mostly or even fully due to its preference for forced flaws, and not (or not greatly) influenced by its use of a least-cost strategy for unforced flaws. This same hypothesis could explain why DUnf-LC consistently outperforms DUnf, and why UCPOP-LC consistently outperforms UCPOP.

It was to address this issue that we included DUnf-Gen in our experiment. DUnf-Gen is a simple strategy that prefers forced flaws of any kind, and otherwise uses a LIFO regime. We would expect DUnf-Gen and LCFR to perform similarly, since they frequently make the same decision. Specifically, they will both select the same flaw in any node in which there is a forced flaw; they will differ when there are only unforced flaws, with DUnf-Gen selecting a most recently introduced flaw and LCFR selecting a least-cost flaw.

In practice, DUnf-Gen's performance closely mimicked that of LCFR's.
On the basic problem set it did only marginally worse than LCFR. In
fact, it does marginally *better* when we reverse the order in
which the planner adds the preconditions of each new step to the open
list (see Section ). LCFR does somewhat better than
DUnf-Gen on both the Trains and Tileworld problems, and this is true
regardless of the order in which the preconditions were added to the
open list, but the extent to which it does better varies.

Thus, the data is inconclusive about the value of using a least-cost
strategy for unforced flaws. LCFR clearly benefits from selecting
forced flaws early (as a side effect of preferring least-cost flaws),
but it *may* not matter whether it continues to use a least-cost
strategy for the unforced flaws. If indeed it is generally sufficient
to use a least-cost strategy only for forced flaws, then ZLIFO's
performance is somewhat less puzzling, since ZLIFO also prefers forced
flaws. However the puzzle is not completely resolved. After all,
DUnf-Gen, like ZLIFO, prefers forced flaws and and then makes
LIFO-based decisions about unforced flaws, and while its performance
is not clearly inferior to LCFR's, neither is it clearly superior.
Even if the use of LIFO for unforced flaws does not obviously increase
the search-space, neither does it appear to decrease it.

Wed Jun 25 17:40:58 EDT 1997