Porting a Vector Library: a Comparison of MPI, Paris, CMMD and PVM (or, "I'll never have to port CVL again") INTRODUCTION In this paper we outline the design and implementation in MPI of a portable parallel vector library, which is used as the basis for implementing nested data-parallel languages. We compare the ease of writing and debugging the MPI code with our experiences writing previous implementations in CM-2 Paris, CM-5 CMMD and PVM. We discuss the features of MPI that helped and hindered the effort. We give initial performance results for the MPI implementation running on the SP-1, Paragon and CM-5, and compare them with machine-specific versions running on the C90, CM-2 and CM-5. Finally, we discuss the design limitations of the vector library and its resulting poor performance on current MPP RISC architectures, and outline our plans to overcome this by using MPI as a compiler target. The library and associated high-level languages are available via FTP. CVL OVERVIEW CVL (C Vector Library) supplies a set of vector operations that work on vectors of arbitrary length, and an abstract memory model that is independent of the underlying architecture. CVL was designed so that efficient implementations could be developed for a wide variety of parallel machines, and it is currently used as a back end for the nested data-parallel languages NESL and Proteus. CVL supplies a rich variety of vector operations, including elementwise function application, global operations such as scans and reductions, and various types of permutations. Most CVL routines are defined for both segmented and unsegmented vectors, since segmentation is critical for implementing nested data parallelism. MPI IMPLEMENTATION MPI CVL uses a hostless SPMD model of computation. The alternative is "hosted" SIMD, where a host broadcasts CVL instructions to a collection of slave nodes. This model was used for CM-5 CVL, but for loosely-coupled machines such as the SP-1 the broadcasts introduce extra overhead and unnecessary synchronization. Simple CVL instructions are coded as loops across the block-distributed vectors. CVL scans and reductions use their MPI equivalents (see below). General all-to-all permutations are coded using nonblocking send and receive instructions to take advantage of any communication/computation overlap possible on the target machine. To amortize message overhead, messages to the same node are aggregated in a buffer before sending. Tuning for a particular MPI implementation is limited to choosing a size for these buffers. COMPARING MPI TO PARIS, CMMD AND PVM CM-2 CVL is written in C and Paris, a parallel vector instruction set that has direct equivalents for many CVL instructions. CM-5 CVL is written in C and the message-passing library CMMD. CM-5 CVL instructions are coded as loops over data just as in the MPI implementation, but CMMD allows much finer-grained communication than MPI, reducing the need to buffer messages before sending. Both Paris and CMMD supply scans and reductions, including segmented versions of these operations, simplifying the implementation of CVL. However, it is difficult to develop and debug applications using Paris and CMMD, since no workstation implementations exist. A PVM implementation of CVL was started but never completed; it suffered from PVM's lack of high-level collective operations and the need to use manufacturers' extensions to PVM to get reasonable performance. MPI (MIS)FEATURES In general, MPI proved easy to use and lived up to its promise of portability. Particularly useful were the nonblocking sends and support for scans and reductions. The main MPI misfeature from our perspective is the definition of scans as inclusive rather than exclusive, necessitating extra communication to generate exclusive scans for operators with no inverse. As noted above, support for segmented scans simplifies the implementation of CVL; they are currently simulated in MPI CVL with user-defined scan operations that manipulate global state. MPI CVL PERFORMANCE Initial benchmarks show that MPI CVL running on the SP-1 achieves 2-6 times the per-node performance of CM-5 CVL, depending on the ratio of communication to computation for a particular instruction. The peak performance for "embarassingly parallel" applications such as line fitting is about 11 MFLOPS per node. [We intend to run MPI CVL on the CM-5 and Paragon in the near future, allowing a comparison of results across platforms and between libraries (MPI vs CMMD)] FUTURE WORK The peak performance of current RISC-based MPPs is often limited by their main-memory bandwidth. CVL suffers particularly acutely since each vector instruction is implemented as a separate loop over the data. Hence its low peak speed on the SP-1; by comparison, a C/MPI linefit with fused loops achieves 33 MFLOPS per node. We are writing a NESL compiler that will generate C/MPI rather than CVL calls, enabling loop fusion and other optimizations to be applied. We think MPI will be an excellent target library for developing a portable compiler.