A search algorithm implemented on a parallel system requires a balanced division of work between contributing processors to reduce idle time and minimize redundant or wasted effort. One method of dividing up the work in IDA* search is with a parallel window search (PWS), introduced by Powley and Korf . Using PWS, each processor is given a copy of the entire search tree and a unique cost threshold. The processors search the same tree to different thresholds simultaneously. If a processor completes an iteration without finding a solution, it is given a new unique threshold (deeper than any threshold yet searched) and begins a new search pass with the new threshold. When an optimal solution is desired, processors that find a goal node must remain idle until all processors with lower cost thresholds have completed their current iteration. A typical division of work using PWS is illustrated in Figure 1.
One advantage of parallel window search is that the redundant search inherent in IDA* is not performed serially. During each non-initial iteration of IDA*, all of the nodes expanded in the previous iteration are expanded again. Using multiple processors, this redundant work is performed concurrently. A second advantage of parallel window search is the improved time in finding a first solution. If a search space holds many goal nodes, IDA* may find a deep solution much more quickly than an optimal solution. Parallel window search can take advantage of this type of search space. Processors that are searching beyond the optimal threshold may find a solution down the first branch they explore, and can return that solution long before other processors finish their iteration. This may result in superlinear speedup because the serial algorithm conservatively increments the cost threshold and does not look beyond the current threshold.
On the other hand, parallel window search can face a decline in efficiency when the number of processors is significantly greater than the number of iterations required to find an optimal (or a first) solution, causing all remaining processors to sit idle. This situation will occur when many processors are available, yet few iterations are required because the heuristic estimate is fairly accurate.
An alternative parallel search approach relies on distributing the tree among the processors [19,29]. With this approach, the root node of the search space is given to the first processor and other processors are assigned subtrees of that root node as they request work. As an alternative, the distributed tree search algorithm (DTS) employs breadth-first expansion until there are at least as many expanded leaf nodes as available processors. Processors receive unique nodes from the expanding process and are responsible for the entire subtree rooted at the received node. Communication-free versions of this distribution scheme have also been reported [22,30]. In all of these tree distribution approaches, the processors perform IDA* on their unique subtrees simultaneously. All processors search to the same threshold. After all processors have finished a single iteration, they begin a new search pass through the same set of subtrees using a larger threshold. A sample distribution of the search space is shown in Figure 2.
One advantage of this distribution scheme is that no processor is performing wasted work beyond the goal depth. Because the algorithm searches the space completely to one threshold before starting the search to a new threshold, none of the processors is ever searching at a level beyond the level of the optimal solution. It is possible, however, for DTS to perform wasted work at the goal depth. For example, in Figure 2 processor 3 searches nodes at the goal level that would not be searched in a serial search algorithm moving left-to-right through the tree.
A disadvantage of this approach is the fact that processors are often idle. To ensure optimality, a processor that quickly finishes one iteration must wait for all other processors to finish before starting the next iteration. This idle time can make the system very inefficient and reduce the performance of the search application. The efficiency of this approach can be improved by performing load balancing between neighboring processors working on the same iteration.
These described approaches offer unique benefits. Parallel window search is effective when many iterations of IDA* are required, when the tree is so imbalanced that DTS will require excessive load balancing, or when a deep, non-optimal solution is acceptable. On the other hand, dividing the search space among processors can be more effective when the branching factor is very large and the number of IDA* iterations is relatively small. A compromise between these approaches divides the set of processors into clusters . Each cluster is given a unique cost threshold, and the search space is divided between processors within each cluster, as shown in Figure 3. Setting the number of clusters to one simulates distributed tree search, and setting the number of clusters to the number of available processors simulates parallel window search.