Porting a Vector Library: a Comparison of MPI, Paris, CMMD and PVM
         (or, "I'll never have to port CVL again")


INTRODUCTION

In this paper we outline the design and implementation in MPI of a
portable parallel vector library, which is used as the basis for
implementing nested data-parallel languages.  We compare the ease of
writing and debugging the MPI code with our experiences writing previous
implementations in CM-2 Paris, CM-5 CMMD and PVM.  We discuss the
features of MPI that helped and hindered the effort.  We give initial
performance results for the MPI implementation running on the SP-1,
Paragon and CM-5, and compare them with machine-specific versions
running on the C90, CM-2 and CM-5.  Finally, we discuss the design
limitations of the vector library and its resulting poor performance on
current MPP RISC architectures, and outline our plans to overcome this
by using MPI as a compiler target.  The library and associated
high-level languages are available via FTP.


CVL OVERVIEW

CVL (C Vector Library) supplies a set of vector operations that work on
vectors of arbitrary length, and an abstract memory model that is
independent of the underlying architecture.  CVL was designed so that
efficient implementations could be developed for a wide variety of
parallel machines, and it is currently used as a back end for the nested
data-parallel languages NESL and Proteus.  CVL supplies a rich variety
of vector operations, including elementwise function application, global
operations such as scans and reductions, and various types of
permutations.  Most CVL routines are defined for both segmented and
unsegmented vectors, since segmentation is critical for implementing
nested data parallelism.


MPI IMPLEMENTATION

MPI CVL uses a hostless SPMD model of computation.  The alternative is
"hosted" SIMD, where a host broadcasts CVL instructions to a collection
of slave nodes.  This model was used for CM-5 CVL, but for
loosely-coupled machines such as the SP-1 the broadcasts introduce extra
overhead and unnecessary synchronization.  Simple CVL instructions are
coded as loops across the block-distributed vectors.  CVL scans and
reductions use their MPI equivalents (see below).  General all-to-all
permutations are coded using nonblocking send and receive instructions
to take advantage of any communication/computation overlap possible on
the target machine.  To amortize message overhead, messages to the same
node are aggregated in a buffer before sending.  Tuning for a particular
MPI implementation is limited to choosing a size for these buffers.


COMPARING MPI TO PARIS, CMMD AND PVM

CM-2 CVL is written in C and Paris, a parallel vector instruction set
that has direct equivalents for many CVL instructions.  CM-5 CVL is
written in C and the message-passing library CMMD.  CM-5 CVL
instructions are coded as loops over data just as in the MPI
implementation, but CMMD allows much finer-grained communication than
MPI, reducing the need to buffer messages before sending.  Both Paris
and CMMD supply scans and reductions, including segmented versions of
these operations, simplifying the implementation of CVL.  However, it is
difficult to develop and debug applications using Paris and CMMD, since
no workstation implementations exist.  A PVM implementation of CVL was
started but never completed; it suffered from PVM's lack of high-level
collective operations and the need to use manufacturers' extensions to
PVM to get reasonable performance.


MPI (MIS)FEATURES

In general, MPI proved easy to use and lived up to its promise of
portability.  Particularly useful were the nonblocking sends and support
for scans and reductions.

The main MPI misfeature from our perspective is the definition of scans
as inclusive rather than exclusive, necessitating extra communication to
generate exclusive scans for operators with no inverse.  As noted above,
support for segmented scans simplifies the implementation of CVL; they
are currently simulated in MPI CVL with user-defined scan operations
that manipulate global state.


MPI CVL PERFORMANCE

Initial benchmarks show that MPI CVL running on the SP-1 achieves 2-6
times the per-node performance of CM-5 CVL, depending on the ratio of
communication to computation for a particular instruction.  The peak
performance for "embarassingly parallel" applications such as line
fitting is about 11 MFLOPS per node.

[We intend to run MPI CVL on the CM-5 and Paragon in the near future,
allowing a comparison of results across platforms and between libraries
(MPI vs CMMD)]


FUTURE WORK

The peak performance of current RISC-based MPPs is often limited by
their main-memory bandwidth.  CVL suffers particularly acutely since
each vector instruction is implemented as a separate loop over the data.
Hence its low peak speed on the SP-1; by comparison, a C/MPI linefit
with fused loops achieves 33 MFLOPS per node.  We are writing a NESL
compiler that will generate C/MPI rather than CVL calls, enabling loop
fusion and other optimizations to be applied.  We think MPI will be an
excellent target library for developing a portable compiler.