Supercomputing '90, November 1990.

**Abstract:**
This paper describes an optimized implementation of a set of scan (also
called all-prefix-sums) primitives on a single processor of a CRAY Y-MP,
and demonstrates that their use leads to greatly improved performance
for several applications that cannot be vectorized with existing
compiler technology. The algorithm used to implement the scans is based
on an algorithm for parallel computers and is applicable with minor
modifications to any register-based vector computer. On the CRAY Y-MP,
the asymptotic running time of the plus-scan is about 2.25 times that of
a vector add, and is within 20% of optimal. An important aspect of our
implementation is that a set of segmented versions of these scans are
only marginally more expensive than the unsegmented versions. These
segmented versions can be used to execute a scan on multiple data sets
without having to pay the vector startup cost (n_1/2) for each set.

The paper describes a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 and 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks.

@inproceedings{cray-scan, author = "Siddhartha Chatterjee and Guy~E. Blelloch and Marco Zagha", title = "Scan Primitives for Vector Computers", booktitle = "Proceedings Supercomputing~'90", pages = "666--675", month = nov, year = 1990 }