- A Comparison of Sorting Algorithms for the Connection Machine CM-2
- Radix Sort for Vector Multiprocessors
- Merging on Vector Multiprocessors
- Sorting and Merging in NESL
- Related Work

On a 64K-processor CM-2, our sample sort implementation can sort 32 x 10^6 64-bit keys in 5.1 seconds, which is over 10 times faster than the previous CM-2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. Our SPAA 91 paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations. This research was started with the "Supersort" project at Thinking Machines Corp: Guy Blelloch, Charles Leiserson, Bruce Maggs, Greg Plaxton, Steven Smith, and Marco Zagha, with technical advice from Steve Heller and Mark Bromley at TMC.

**Radix Sort for Vector Multiprocessors**:
We have implemented an efficient fully vectorized and parallelized
radix sort algorithm for the Cray Y-MP. On one processor of the Y-MP,
our sort is over 5 times faster on large sorting problems than the
optimized library sort provided from Fortran. On eight processors we
achieve an additional speedup of almost 5, yielding a routine over 25
times faster than the library sort. Using this multiprocessor
version, we can sort at a rate of 15 million 64-bit keys per second.

Our Supercomputing'91 paper (by Marco Zagha and Guy Blelloch) describes the parallel radix-sort algorithm, introduces some techniques for mapping parallel algorithms onto vector multiprocessors, discusses our implementation of the radix-sort on the Y-MP, and derives equations to predict and optimize the performance of the sort. The most interesting elements of the implementation are a vectorized and parallelized histogram operation, a vectorized and parallelized scan (all-prefix-sums) operation, and a memory access pattern designed to avoid memory bank conflicts. The algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2.

The NAS benchmark report includes Integer Sort benchmark timings of our Cray Y-MP/C90 implementation, which is currently the fastest reported for any machine. Our implementations were developed on the Cray Y-MP and Cray Y-MP/C90 at the Pittsburgh Supercomputing Center. Our sorting code for the Cray Y-MP family (written in Cray assembly language) is publicly available.

**Merging on Vector Multiprocessors**:
Marco Zagha is currently working on a fast implementation of a PRAM
merging algorithm for the Cray C90. For large input vectors, the
implementation is significantly faster than bitonic merge because the
PRAM algorithm performs linear work while bitonic merge performs
*n*log(*n*) work. Our implementation also outperforms
a serial merge because it makes efficient use of vector instructions and
can take advantage of multiple processors.

**Sorting and Merging in NESL**:
We have coded several
sorting
algorithms (quicksort, bitonic-sort, odd-even-merge-sort) in
NESL.
For example, here is the quicksort implementation:

function qsort(a) = if (#a < 2) then a else let pivot = a[#a/2]; lesser = {e in a| e < pivot}; equal = {e in a| e == pivot}; greater = {e in a| e > pivot}; result = {qsort(v): v in [lesser,greater]}; in result[0] ++ equal ++ result[1] $

- Tom Stricker's paper on iWarp sorting on the iWarp project's papers page
- Sorting articles in the SEL-HPC High Performance Computing Archive

marcoz@cs.cmu.edu. Last updated 26 June 1995