Spark98:
Sparse Matrix Kernels for
Shared Memory and Message Passing Systems
David O'Hallaron
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
January, 1998
email: droh@cs.cmu.edu
web: www.cs.cmu.edu/~quake/spark98.html
Copyright (C) David O'Hallaron 1998. You are free to use this software
without restriction. If you find that the suite is helpful to you, it
would be very helpful if you sent me a note at droh@cs.cmu.edu letting
me know how you are using it.
--------------------------------------------------------------------
1. Summary
--------------------------------------------------------------------
Spark98 is a collection of 10 sparse kernels for shared memory and
message passing systems. The sparse matrices are induced from a pair
of three-dimensional unstructured finite element simulations of
earthquake ground motion in the San Fernando Valley. The aim of
Spark98 is to provide the systems community with examples of irregular
codes that are realistic and yet are fairly small (about 1000 lines of
code), and easy to use and experiment with. Each kernel is a
program/mesh pair. There are 5 C programs (smv, lmv, rmv, mmv, hmv)
and 2 finite element (sf10 and sf5), for a total of 10 kernels:
+---------------------------------------------------------+
| The Spark98 Kernels |
+------+------------------------+-------------------------+
|prog | sf10 (small mesh) | sf5 (moderate mesh) |
+------+------------------------+-------------------------+
| smv | sequential | sequential |
| lmv | shared memory (lock) | shared memory (lock) |
| rmv | shared memory (reduce) | shared memory (reduce) |
| mmv | message passing (mpi) | message passing (mpi) |
| hmv | hybrid | hybrid |
+------+------------------------+-------------------------+
Smv is a baseline sequential program. Lmv and rmv are simple shared
memory programs based on locks and reductions respectively. Mmv is a
parallel message passing program based on the MPI primitives, and hmv
is a hybrid shared memory program written in an aggressive message
passing style.
Each program computes a sequence of sparse matrix vector product
(SMVP) pairs, y1=K1*x1 followed by y2=K2*x2, where K1 and K2 are
sparse matrices with identical sparsity structures, and y1, y2, x1,
and x2 are dense vectors. Matrices are stored in compressed sparse
row format and are block symmetric. Each nonzero matrix entry is a 3x3
submatrix of doubles. Only the diagonal and the upper triangle
coefficients are actually stored. Similarly each vector entry is a 3x1
subvector.
Each matrix K is represented by three vectors: (1) the coefficient
vector A, (2) Acol, which is defined s.t. A[i] is in column Acol[i],
and (3) Aindex, which is defined s.t. A[Aindex[i]] is the first
nonzero entry of row i.
The meshes determine both the size and nonzero structure of the sparse
matrices used in the SMVP operations. Both sf10 and sf5 are 3D
unstructured finite element meshes developed by the CMU Quake project
(www.cs.cmu.edu/~quake) to simulate earthquake-induced ground motion
in the San Fernando Valley of Southern California. Sf10 is
small-sized mesh (7294 nodes, 97,138 nonzero matrix entries) and sf5
is a moderate-sized mesh (30,169 nodes, 410,923 nonzero matrix
entries).
--------------------------------------------------------------------
2. Programs
--------------------------------------------------------------------
For a particular mesh, each program computes an identical sequence of
SMVP operations, but differs in how aggessively it partitions
references among multiple threads. Each program is written in
ANSI C.
Smv is the baseline sequential program.
Lmv is a parallel shared memory program based on locks. For each SMVP,
there is a single copy of A, x, and y. Each of t threads is allocated
a chunk of contiguous rows of A so that each thread has roughly the
same number of nonzeros. Each thread updates y using L locks to
synchronize the updates. The program can be compiled for either the
Posix threads (pthread) interface or the SGI thread interface. Lmv is
extremely simple to parallelize, but can have poor performance,
depending on the number of locks that are used. Generally, the more
locks the better the performance, however, most systems limit the
number of locks.
Rmv is a parallel shared memory program based on reductions. As with
lmv, work is partitioned so that each thread gets roughly the same
number of nonzero entries in the sparse matrix. Similarly, for each
SMVP, there is a single copy of A and x. However, unlike lmv, each
thread updates its own private copy of y. These private vectors
are then summed to produce the final output vector y. This is
essentially the Fx Do&Merge model. The program can be compiled for
either the pthread or SGI threads interface. Rmv is simple to program
and generally performs better than lmv, at the cost of additional
memory overhead in the form of the privatized output vectors.
Mmv is a parallel message passing program written in MPI. Each
processor updates local copies of A, x, and y. This is followed by a
"full assembly" phase that combines the partially assembled output
vectors. The local copies of A, x, and y are induced from a partition
that is computed off-line using a geometric recursive bisection
partitioning algorithm due to Miller, Teng, Thurston, and Vivasis. Mmv
is extremely aggressive in its partitioning of references and performs
better than either lmv or rmv. But it is also much more difficult to
program.
Hmv is a hybrid shared memory/message passing program that
aggressively partitions references like mmv, but uses shared memory to
access and copy data. This is the most efficient (and hardest to
program) shared memory version, with performance comparable to mmv.
The spark98 programs are contained in the ./src directory:
Binary Source #define (define one of these)
smv mv.c
lmv mv.c either PTHREAD_LOCK or SGI_LOCK
rmv mv.c either PTHREAD_REDUCE or SGI_REDUCE
mmv mmv.c
hmv hmv.c either SGI or PTHREAD
Example makefiles for the DEC AlphaServer, the SGI Power Challenge and
SGI Origin 2000, and the Cray T3E are provided in Makefile.dec,
Makefile.sgi, and Makefile.t3e, repectively.
--------------------------------------------------------------------
3. Input files
--------------------------------------------------------------------
The input to a Spark98 kernel is a "packfile" that is generated by the
Archimedes tool chain (see www.cs.cmu.edu/~quake/archimedes.html). A
packefile characterizes a finite element mesh, its partition into
disjoint sets (subdomains) of elements, the nonzero structure of the
sparse matrix induced by the partition in each subdomain, and the
communication schedule for each subdomain. The input packfiles are
contained in the ./inputs directory:
Packfile Mesh nodes Nonzero matrix entries
tiny.~~.pack 75 735
sf10.~~~~.pack 7,294 97,138
sf5.~~~~.pack 30,169 410,923
Each .~~~~.pack is the packfile for a mesh that has been partitioned
into s subdomains. Packfiles for s = 1, 2, 4, 8, 16, 32, 64, and 128
subdomains are provided.
Tiny is a small test case. It is not part of the Spark98 kernels; it
is included for quick sanity checks. Sf10 and sf5 are models of the
San Fernando Valley in Southern California that were developed early
in the Quake project. Sf10 resolves seismic waves with 10 second
periods; Sf5 resolves waves with 5 second periods.
--------------------------------------------------------------------
4. Sample output files
--------------------------------------------------------------------
Each kernel that inputs a packfile for the same mesh computes
identical results, independent of the number of threads. These results
are contained in the following files in the ./outputs directory:
File Description
sf10.out Reference output for the sf10 packfiles
sf5.out Reference output for the sf5 packfiles
tiny.out Reference output for the tiny packfiles
You can check correctness of your code using the -O flag. The results
are printed as integers, thus results from two different runs can be
compared reliably using diff. For example,
% lmv -t8 -O ../inputs/tiny.1.pack > stuff.out
% diff stuff.out ../outputs/tiny.out
--------------------------------------------------------------------
5. Computational characteristics
--------------------------------------------------------------------
The ./info directory contains information about each partitioned mesh:
tiny.~~~~.count
sf10.~~~~.count
sf5.~~~~.count
Each .~~~~.count file contains a wealth of statistical information
about the corresponding .~~~~.pack file, including the number of nodes,
elements, and edges in the original mesh, the distribution of nodes,
edges, elements, and nonzeros in each subdomain, the number of
messages transferred by each subdomain, message sizes, and total
communication volume.
--------------------------------------------------------------------
6. Documentation
--------------------------------------------------------------------
The companion tech report that describes the Spark98 kernels
in detail can be found in doc/cmu-cs-97-178.ps:
D. O'Hallaron, "Spark98: Sparse matrix kernels for shared memory
and message passsing systems", Tech. Rep. CMU-CS-97-178,
School of Computer Science, Carnegie Mellon University, 1997.
(also available from www.cs.cmu.edu/~quake/papers.html).
A complete characterization of the San Fernando meshes is contained in
doc/cmu-cs-96-141.ps:
D. O'Hallaron and J. Shewchuk, "Properties of a Family of Parallel
Finite Element Simulations", Tech. Rep. CMU-CS-96-141, School of
Computer Science, Carnegie Mellon University, 1996. (also available
from www.cs.cmu.edu/~quake/papers.html).
--------------------------------------------------------------------
7. Compiling and running the kernels
--------------------------------------------------------------------
The programs are meant to run on many different machines, and thus are
written in vanilla C with only basic Unix library calls. See the
makefiles for examples of how to build the programs.
Each program accepts a -h command line argument that provides
details about the other command line arguments. Example:
% smv -h
Smv, lmv, and rmv must use packfiles that are partitioned into exactly
1 subdomain. Example:
% lmv ../inputs/sf5.1.pack (correct)
% lmv ../inputs/sf5.4.pack (incorrect)
Lmv and rmv can run with an arbitrary number of threads, which is
specified on the command line. Example with 8 threads:
% rmv -t8 ../inputs/sf5.1.pack
For mmv and hmv, the number of threads is determined by the number
of subdomains in the packfile. Example with 8 threads:
% hmv ../inputs/sf5.8.pack
The number of iterations is controlled with the -i flag:
% rmv -i10 ../inputs/sf5.1.pack
The programs output some simple performance results on stderr:
% smv: packfile mflops secs mflops/secs
mflops is millions of floating point operations for one SMVP. secs is
the time for one SMVP. mflops/secs is the usual MFLOPS rate.
% lmv: packfile threads locks mflops secs mflops/sec [min/max/%]
min (max) is the least (most) number of matrix nonzeros assigned to
any thread. % is (min/max * 100). This is a sanity check to ensure
that the computation is reasonably well load balanced.
% rmv: packfile threads locks mflops secs [comp/comm/%]
mflops/secs [min/max/%]
comm is the time spent reducing the vectors to a single vector.
comp is the time spent doing everything else. % is (comp/secs)*100.
% mmv: packfile mflops secs [comp/comm/%] mflops/secs [min/max/%]
% hmv: packfile mflops secs [comp/comm/%] mflops/secs [min/max/%]
comm is the time spent assembling the partially assembled output
vectors. comp is the time spent doing everything else. % is
(comp/secs)*100.
--------------------------------------------------------------------
8. Notes
--------------------------------------------------------------------
Note: Mmv and hmv perform slightly more flops than the other programs
because of overlap on the boundaries of the partitions. Unfortunately
it isn't possible to recover at runtime the number of flops in the
baseline sequential case from the information in the partitioned
packfiles. Thus the reported Mflops/sec rates between smv/lmv/rmv and
mmv/hmv are not exactly comparable; mmv and hmv will tend to slightly
overstate their Mflops/sec rates relative to smv, lmv, and rmv.
However, because the number of nodes on the boundaries of subdomains
is small relative to the number of nodes in the interiors of
subdomains, the Mflops/sec numbers are fine for rough qualitative
comparisons.
Note: If you need exactly comparable Mflops/sec rates for mmv and hmv,
these can be computed by hand using the Mflops number reported by smv
and the elapsed time reported by mmv or hmv.
Note: The flops performed by rmv during its reduction are NOT counted,
thus its reported Mflops/sec rates can be compared directly to those
reported by smv and lmv.
~~