In this subsection we describe the computational structure of the parallel applications used in this chapter. This information is useful in later sections for understanding the performance results. The parallel applications we use consist of the entire SPLASH suite [72] plus the LU-decomposition application that we used in earlier studies [61][33]. These applications are representative of algorithms used today in scientific and engineering computing environments. One application (OCEAN) is written in Fortran, and the others are written in C. The Argonne National Laboratory macro package [57] is used to provide synchronization and sharing primitives. Table provides a brief summary of the applications, along with their input data sets, and Table shows some general characteristics of the applications when 16 processors are used (as is the case throughout this chapter).
OCEAN [71] simulates the role of eddy and boundary currents in influencing large-scale ocean movements. It uses Successive Over Relaxation (SOR) to solve two-dimensional grids of partial differential equations over a number of time-steps. These time-steps continue until the eddies and mean ocean flow attain a mutual balance. The principal data structures are 25 two-dimensional matrices containing the various values associated with the model's equations. Equal numbers of adjacent columns are statically assigned to each processor. During each time step, the processors iterate over their columns from left to right, performing nearest-neighbor computations with both 5-point and 9-point stencils. Communication occurs when processing the boundary columns since their nearest-neighbors include columns owned by other processors. For our experiments we ran OCEAN with 98x98 grids.
LU performs LU-decomposition for dense matrices. The primary data structure in LU is the matrix being decomposed. Working from left to right, a column is used to modify all columns to its right. Once all columns to the left of a column have modified that column, it can be used to modify the remaining columns. Columns are statically assigned to the processors in an interleaved fashion. Each processor waits until a column has been produced, and then that column is used to modify all columns that the processor owns. Once a processor completes a column, it releases any processors waiting for that column. For our experiments we performed LU-decomposition on a 200x200 matrix.
MP3D [59] is a 3-dimensional particle simulator. It is used to study the pressure and temperature profiles created as an object flies at high speed through the upper atmosphere. The primary data objects in MP3D are the particles (representing the air molecules), and the space cells (representing the physical space, the boundary conditions, and the flying object). The overall computation of MP3D consists of evaluating the positions and velocities of particles over a sequence of time steps. During each time step, the particles are picked up one at a time and moved according to their velocity vectors. If two particles come close to each other, they may undergo a collision based on a probabilistic model. Collisions with the object and the boundaries are also modeled. The simulator is well suited to parallelization because each particle can be treated independently at each time step. The program is parallelized by statically dividing the particles equally among the processors. The main synchronization consists of barriers between each time step. For our experiments we ran MP3D with 100,000 particles, a 32x6x32 space array, and simulated 5 time steps.
CHOLESKY [69] performs sparse Cholesky factorization using a dynamic version of the supernodal fan-out method. The matrix is divided into supernodes (sets of columns with identical non-zero structures), which are further divided into conveniently-sized chunks called panels. A panel receives updates from other panels to its left, and when all updates have been received, the panel is placed on a task queue. The processors remove panels from this task queue and perform all of their associated modifications, which in turn causes other panels to be placed on the task queue. The principal data structure is the sparse matrix itself, which is stored in a compressed format similar to that of SPARSPAK [25]. The primary operation that is performed repeatedly is adding a multiple of one column to another column. Contention occurs for the task queue and the modified columns, which are protected by locks. For our experiments we ran bcsstk15 which is a 3948-by-3948 matrix with 56,934 non-zeroes in the matrix and 647,274 non-zeroes in the factor.
LOCUS (our abbreviation for ``LocusRoute'') [68] is a high-quality global router for VLSI standard cells that has been used to design real integrated circuits. The parallelism in LOCUS comes from routing multiple wires concurrently. Each processor continuously picks up a new wire from the task queue, explores alternative routes, and places the wire along the best route. The principal data structure is a grid of cells called the cost array, which is used to record the presence of wires and therefore guides the placement of new wires. The cost array is not protected by locks, although it is accessed and updated concurrently by several processors, since the resulting distortions are considered acceptable. Contention for the task queue is not a problem since each routing task is fairly large-grain. For our experiments we run the largest circuit provided with the application, Primary2.grin, which contains 3817 wires and a 1290-by-20 cost array.
WATER is adapted from the Perfect Club Benchmarks [20] and performs N-body molecular dynamics simulation of the forces and potentials in a system of water molecules to predict some physical properties of water in a liquid state. The primary data structure is a large array of records which contains the state of each molecule. Molecules are statically allocated to the processors. During each time step, the processors calculate the interaction of the atoms within each molecule, and of the molecules with each other. Due to symmetry, each processor only computes the interaction between a molecule it owns and half the other molecules. We run WATER with 512 molecules through 2 time steps.
PTHOR [76] is a parallel logic simulator based on the Chandy-Misra simulation algorithm. Unlike centralized-time algorithms, this algorithm does not rely on a single global time during simulation. The primary data structures associated with the simulator are the logic elements (e.g. AND-gates, flip-flops), the nets (wires linking the elements), and the task queues which contain activated elements. Each processor executes the following loop. It removes an activated element from one of its task queues and determines the changes on that element's outputs. It then looks up the net data structure to determine which elements are affected by the output change and schedules the newly activated elements on to task queues. For our experiments we simulated several clock cycles for a small RISC processor consisting of about 5000 elements.
BARNES (the full name is ``Barnes-Hut'') is a hierarchical N-body gravitational simulation where each body is modeled as a point mass and exerts forces on all other bodies in the system. To speed up the inter-body force calculations, a set of bodies that is sufficiently far away is abstracted as a simple point mass. To facilitate this clustering, physical space is divided recursively to form an octree, until each cell contains at most one body. This tree structure must be traversed on each time step to account for the movement of bodies. The principal data structure is the octree, which is implemented as an array of bodies and an array of cells that are linked together to form a tree. Bodies are statically assigned to processors for the duration of a time-step. During each time-step, each processor calculates the forces of all bodies on its subset of bodies, and the bodies are then moved according to those forces. Finally, the tree is regenerated for the next time step. A set of distributed locks provides exclusive access to the tree when needed. For our experiments, we run BARNES with 8192 bodies through 3 time steps.