From AAPC Algorithms to High Performance Permutation Routing and Sorting

Thomas M. Stricker and Jonathan C. Hardwick
Proceedings of the 8th ACM Symposium on Parallel Algorithms and Architectures, June 1996.

46k compressed postscript

Abstract: Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omega switches. However, the constant factors of these algorithms are often an obscure function of system parameters such as link speed, processor clock rate, and memory access time. In this paper we investigate these architectural factors, showing the impact of the communication style, the network routing table, and most importantly, the local memory system, on AAPC performance and permutation routing on the Cray T3D.

The fast hardware barriers on the T3D permit a straightforward AAPC implementation using routing phases separated by barriers, which improve performance by controlling congestion. However, we found that a practical implementation was difficult, and the resulting AAPC performance was less than expected. After detailed analysis, several corrections were made to the AAPC algorithm and to the machine's routing table, raising the performance from 41% to 74% of the nominal bisection bandwidth of the network.

Most AAPC performance measurements are for permuting large, contiguous blocks of data (i.e., every processor has an array of P contiguous elements to be sent to every other processor). In practice, sorting and true h-h permutation routing (Where h=n/p >> 1, the number of elements per processor) require data elements to be gathered from their source location into a buffer, transferred over the network, and scattered into their final location in a destination array. We obtain an optimal T3D implementation by chaining local and remote memory operations together. We quantify the implementation's efficiency both experimentally and theoretically, using the recently-introduced copy transfer model, and present results for a counting sort based on this AAPC implementation.

@inproceedings{StrickerHardwick96,
	author = "Thomas~M. Stricker and Jonathan~C. Hardwick",
	title = "From AAPC Algorithms to High Performance Permutation 
		Routing and Sorting",
	booktitle = "Proceedings 8th {ACM} Symposium on
		     Parallel Algorithms and Architectures",
	month = jun,
	year = 1996 }