Sorting and Merging

Outline: A Comparison of Sorting Algorithms for the Connection Machine CM-2: We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM-2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets.
On a 64K-processor CM-2, our sample sort implementation can sort 32 x 10^6 64-bit keys in 5.1 seconds, which is over 10 times faster than the previous CM-2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. Our SPAA 91 paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations. This research was started with the "Supersort" project at Thinking Machines Corp: Guy Blelloch, Charles Leiserson, Bruce Maggs, Greg Plaxton, Steven Smith, and Marco Zagha, with technical advice from Steve Heller and Mark Bromley at TMC.

Radix Sort for Vector Multiprocessors: We have implemented an efficient fully vectorized and parallelized radix sort algorithm for the Cray Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided from Fortran. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second.

Our Supercomputing'91 paper (by Marco Zagha and Guy Blelloch) describes the parallel radix-sort algorithm, introduces some techniques for mapping parallel algorithms onto vector multiprocessors, discusses our implementation of the radix-sort on the Y-MP, and derives equations to predict and optimize the performance of the sort. The most interesting elements of the implementation are a vectorized and parallelized histogram operation, a vectorized and parallelized scan (all-prefix-sums) operation, and a memory access pattern designed to avoid memory bank conflicts. The algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2.

The NAS benchmark report includes Integer Sort benchmark timings of our Cray Y-MP/C90 implementation, which is currently the fastest reported for any machine. Our implementations were developed on the Cray Y-MP and Cray Y-MP/C90 at the Pittsburgh Supercomputing Center. Our sorting code for the Cray Y-MP family (written in Cray assembly language) is publicly available.

Merging on Vector Multiprocessors: Marco Zagha is currently working on a fast implementation of a PRAM merging algorithm for the Cray C90. For large input vectors, the implementation is significantly faster than bitonic merge because the PRAM algorithm performs linear work while bitonic merge performs nlog(n) work. Our implementation also outperforms a serial merge because it makes efficient use of vector instructions and can take advantage of multiple processors.

Sorting and Merging in NESL: We have coded several sorting algorithms (quicksort, bitonic-sort, odd-even-merge-sort) in NESL. For example, here is the quicksort implementation:

   function qsort(a) =
   if (#a < 2) then a
   else let pivot   = a[#a/2];
            lesser  = {e in a| e < pivot};
            equal   = {e in a| e == pivot};
            greater = {e in a| e > pivot};
            result  = {qsort(v): v in [lesser,greater]};
        in result[0] ++ equal ++ result[1] $

Related Work: Last updated 26 June 1995