Scans and Linear Recurrences

Background: Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. One of the simplest and most useful building blocks for parallel algorithms is the all-prefix-sums operation. Guy Blelloch wrote an overview paper entitled Prefix Sums and Their Applications, that defines the all-prefix-sums operation, shows how to implement it on a PRAM and illustrates many applications of the operation. In addition to being a useful building block, the all-prefix-sums operation is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm.

Scans on Vector Multiprocessors: Guy Blelloch, Sid Chatterjee, and Marco Zagha wrote a paper for Supercomputing'90 entitled Scan Primitives for Vector Computers that describes an optimized implementation of a set of scan primitives for a CRAY Y-MP. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any register-based vector computer. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n_1/2) for each set. We have used the techniques described in the paper for implementing sorting and sparse matrix multiplication algorithms, and for implementing a C Vector Library (CVL) which is used in the current NESL implementation.

Linear Recurrences on Vector Multiprocessors: Guy Blelloch, Sid Chatterjee, and Marco Zagha wrote a paper for IPPS 92 entitled Solving Linear Recurrences with Loop Raking (A revised version will appear in JPDC). The paper presents a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. Our implementations of this technique on a single processor of the Cray C90 (for first- and second-order linear recurrences) are up to 7.3 times faster than the corresponding optimized library routines in SCILIB. On 4 processors, we gain an additional speedup of at least 3.7.

Related work:

Anwar Mohammed Ghuloum's research on automatically detecting and parallelizing scans and reductions.
Scan articles in the SEL-HPC High Performance Computing Archive

Up to the Irregular Algorithms page.

guy.blelloch@cs.cmu.edu. Last updated 26 June 1995