Experimental Design next up previous
Next: The Value of Least-Cost Up: Experimental Comparison of Flaw Previous: Experimental Comparison of Flaw  

Experimental Design

To conduct our comparison, we implemented a set of flaw selection strategies in UCPOP v.4.gif Table gif lists the strategies that we implemented. Except for LCFR-DSep and DUnf-Gen, which are discussed later, all the implemented strategies were described in Section gif.

 
UCPOP {n,s}LIFO/{o}LIFO
UCPOP-LC {n,s}LIFO/{o}LC
DSep {n}LIFO/{o}LIFO/{s}LIFO
DSep-LC {n}LIFO/{o}LIFO/{s}LIFO
DUnf {n,s}0LIFO / {n,s}1LIFO / {o}LIFO / {n,s}2-infinityLIFO
DUnf-LC {n,s}0LIFO / {n,s}1LIFO / {o}LC / {n,s}2-infinityLIFO
DUnf-Gen {n,s,o}0LIFO / {n,s,o}1LIFO / {n,s,o}LC
LCFR {n,o,s}LC
LCFR-DSep {n.o}LC/{s}LC
ZLIFO {n}LIFO / {o}0{LIFO} / {o}1{New} / {o}2-infinityLIFO / {s}LIFO
Table: Implemented Flaw Selection Strategies

 

We tested all the strategies on three problem sets, also used in our earlier work [Joslin & Pollack, 1994] and in Gerevini and Schubert's (1996):

  1. The Basic Problems, 33 problems taken from the test suite distributed with the UCPOP system. These include problems from a variety of domains, including the blocks world, the Monkeys and Bananas problem, Pednault's (1988) briefcase-and-office problem, Russell's Russell, 1992 tire changing world, etc.
  2. The Trains Problems, three problems taken from the TRAINS transportation domain [Allen, Schubert et al., 1995 ].
  3. The Tileworld Problems, seven problems taken from the Tileworld domain [Pollack & Ringuette 1990].

We ran each strategy on each problem twice. The first time, we imposed a node limit, of 10,000 nodes for the basic problems, and of 100,000 nodes for the Trains and Tileworld problems. The second time, we imposed a time limit, of 100 seconds for the basic problems, and of 1000 seconds for the Trains and Tileworld problems.

Gerevini and Schubert experimented with several different node selection strategies for the Trains and Tileworld domains, so to facilitate comparison we also used the same node selection strategies as they did. For the basic problems, we used S+OC.

In reporting our results, we make use not only of raw counts of nodes generated and computation time in seconds taken, but we also compute a measure of how badly a strategy performed on a given problem or set of problems. We call this measure %-overrun, and compute it as follows. Let m be the minimum node count on a given problem for any of the strategies we tested, and let c be the node count for a particular strategy S. Then

%-overrun(S) =[(c-m)/m]*100

Thus, for example, if the best strategy on a given problem generated 100 nodes, then a strategy that generated 200 nodes would have a 100 %-overrun on that problem. The strategy that does best on a given problem will have a %-overrun of 0 on that problem. In Section gif, we make use of similarly computed %-overruns for computation time.

If a strategy hit the node limit, we set c to the relevant node limit (10,000 or 100,000) to compute its node-count %-overrun.gif Similarly, if a strategy hit the time limit, we used the relevant time limit (100 or 1000) to compute the computation-time %-overrun.

Online Appendix A provides the raw data--node counts and computation-time taken--for all the experiments we conducted; it also includes computed %-overruns.

In conducting experiments such as these, one has to set either a node- or time limit cutoff for each strategy/problem pair. However, there is always a danger that these cutoffs unfairly bias the data, if the limits are set in such a way that certain strategies that fail would instead have succeeded were the limits increased slightly. We have carefully analyzed our data to help eliminate the possibility of such a bias; details are given in Appendix A.


next up previous
Next: The Value of Least-Cost Up: Experimental Comparison of Flaw Previous: Experimental Comparison of Flaw

TOM Conversion
Wed Jun 25 17:40:58 EDT 1997