Radix Sort for Vector Multiprocessors

Marco Zagha and Guy E. Blelloch.
Supercomputing '91, November 1991.

61k compressed postscript

Abstract: We have implemented an efficient fully vectorized and parallelized radix sort algorithm on the Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided from Fortran. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second.

This paper describes the parallel radix-sort algorithm, introduces some techniques for mapping parallel algorithms onto vector multiprocessors, discusses our implementation of the radix-sort on the Y-MP, and derives equations to predict and optimize the performance of the sort. The most interesting elements of the implementation are a vectorized and parallelized histogram operation, a vectorized and parallelized scan (all-prefix-sums) operation, and a memory access pattern designed to avoid memory bank conflicts. The algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2.

@inproceedings{cray-sort,
	author = "Marco Zagha and Guy~E. Blelloch",
	title = "Radix Sort for Vector Multiprocessors",
	booktitle = "Proceedings Supercomputing~'91",
	pages = "712--721",
	month = nov,
	year = 1991 }