Optimizing Memory System Performance for 
                    Communication in Parallel Computers

                      T. Stricker 1) and T. Gross 1),2)


                                 Abstract

Communication in  a  parallel system  frequently  involves moving data
from the  memory  of  one node to  the memory of another;  this is the
standard  communication  model  employed in  message  passing systems.
Depending on the application, we observe a variety of patterns as part
of communication steps, e.g., regular (i.e. blocks of  data), strided,
or irregular (indexed)  memory accesses. The effective speed  of these
communication steps  is  determined by the  network  bandwidth and the
memory bandwidth, and  measurements on current parallel supercomputers
indicate that  the  performance is  limited  by the  memory  bandwidth
rather than the network bandwidth. Current systems provide a wealth of
options to perform communication, and a compiler or user is faced with
the difficulty of  finding the communication  operations that best use
the  available  memory  and network  bandwidth. This paper provides  a
framework to evaluate different solutions for inter node communication
and   presents  the  copy-transfer  model;  this  model  captures  the
contributions  of the  memory  system to inter-node communication.  We
demonstrate the usefulness of  this simple model by applying it to two
commercial  parallel systems, the Cray T3D and  the Intel Paragon.  In
particular we identify  two methods to transfer data between  nodes in
these two machines. In buffer-packing transfers, a contiguous block of
data  is  transferred across the network. If  the data  are not stored
contiguously, they are  copied  to  (gathering) or  from  (scattering)
buffers  in  local memory  before  and  after  the  transfer.  Chained
transfers perform  gathering,  transfer  and  scattering  in one step,
reading  the  data  elements  with  some  non-sequential  pattern  and
immediately  transferring them  on to the destination.  Our model  and
measurements indicate  that  chaining  of  the gather,  transfer,  and
scatter  operations  results in better performance than buffer packing
for  many important  access patterns. Most  standard  message  passing
libraries (like MPI,  PVM  or NX) force the parallelizing compiler (or
the programmer) to employ the buffer-packing communication operations.
However, the  addition of hardware support dedicated to  communication
(e.g., DMAs, line-transfer units) now gives the compiler a wider range
of options.

----------------------------------------------------------------------
1) School of Computer Science, Carnegie Mellon University, Pittsburgh,
   PA 15213 
2) Institut fuer Computer Systeme, ETH Zuerich, CH 8092 Zuerich, 
   Switzerland

This research was sponsored  in part by the Advanced Research Projects
Agency/CSTO  monitored  by  SPAWAR  under  contract  N00039-93-C-0152.
Computational  resources  were  provided  in  part  by the  Pittsburgh
Supercomputing Center  (PSC). The  views and  conclusions contained in
this document are those of the authors and  should  not be interpreted
as representing the official policies, either expressed or implied, of
the U.S. Government.