An Architecture for Optimal All-to-All Personalized Communication


Susan Hinrichs, Corey Kosak, David R. O'Hallaron, Thomas M. Stricker
                    and Riichiro Take (1)

               School of Computer Science
               Carnegie Mellon University
             Pittsburgh, PA 15213-3891, USA

       {shinrich,kosak,droh,tomstr,take}@cs.cmu.edu


Abstract                                                               
                                                                       
In all-to-all personalized communication (AAPC), every node of
a parallel system sends a potentially unique packet to every other
node. AAPC is an important primitive operation for modern parallel
compilers, since it is used to redistribute data structures during par-
allel computations. As an extremely dense communication pattern,
AAPC causes congestion in many types of networks and therefore
executes very poorly on general purpose, asynchronous message
passing routers.
                                                                       
    We present and evaluate a network architecture that executes all-
to-all communication optimally on a two-dimensional torus. The
router combines optimal partitions of the AAPC step with a self-
synchronizing switching mechanism integrated into a conventional
wormhole router. Optimality is achieved by routing along shortest
paths while fully utilizing all links. A simple hardware addition
for synchronized message switching can guarantee optimal AAPC
routing in many existing network architectures.
                                                                       
    The flexible communication agent of the iWarp VLSI compo-
nent allowed us to implement an efficient prototype for the eval-
uation of the hardware complexity as well as possible software
overheads. The measured performance on an 8x8 torus exceeded
2 GigaBytes/sec or 80% of the limit set by the raw speed of the
interconnects. We make a quantitative comparison of the AAPC
router with a conventional message passing system. The potential
gain of such a router for larger parallel programs is illustrated with
the example of a two-dimensional Fast Fourier Transform.
                                                                       
(1) Author's current address: Riichiro Take, Fujitsu Laboratories
Ltd., 1015 Kamikodanaka, Nakaharaku, Kawasaki 211, Japan. email:
riro@flab.fujitsu.co.jp.

Note: Reprint from proceedings in the ACM Symposium on Parallel
Algorithms and Architectures, SPAA94, June 27-29, 1994, Cape May, New
Jersey, pp. 310-319.