CMU-CS-93-173, August 1993.

**Abstract:**
In this paper we present a new technique for sparse matrix
multiplication on vector multiprocessors based on the efficient
implementation of a segmented sum operation. We describe how the
segmented sum can be implemented on vector multiprocessors such that it
both fully vectorizes within each processor and parallelizes across
processors. Because of our method's insensitivity to relative row size,
it is better suited than the Ellpack/Itpack or the Jagged Diagonal
algorithms for matrices which have a varying number of non-zero elements
in each row. Furthermore, our approach requires less preprocessing (no
more time than a single sparse matrix-vector multiplication), less
auxiliary storage, and uses a more convenient data representation (an
augmented form of the standard compressed sparse row format).

We have implemented our algorithm (SEGMV) on the Cray Y-MP C90, and have compared its performance with other methods on a variety of sparse matrices from the Harwell-Boeing collection and industrial application codes. Our performance on the test matrices is up to 3 times faster than the Jagged Diagonal algorithm and up to 5 times faster than Ellpack/Itpack method. Our preprocessing time is an order of magnitude faster than for the Jagged Diagonal algorithm. Also, using an assembly language implementation of SEGMV on a 16 processor C90, the NAS Conjugate Gradient benchmark runs at 3.5 gigaflops.

@techreport{cray-mvmult, author = "Guy~E. Blelloch and Michael~A. Heroux and Marco Zagha", title = "Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors", institution = "School of Computer Science, Carnegie Mellon University", number = "CMU-CS-93-173", month = aug, year = 1993}