\section{Program descriptions}

The seven programs chosen for investigation fall into three classes.
Five of the programs, SOR, 2DFFT, TDFFT, SEQ, and HIST, are kernels
that exhibit the communication patterns discussed in
section~\ref{sec:commpat}.  These kernels are written in Fx, a variant
of the HPF~\cite{HPF} language, an important standard for scientific
parallel programming.  Also written in Fx is AIRSHED, an air quality
modeling application, which represents a ``real'' HPF application.
Finally, DSD, a distributed fault diagnosis system, represents the
class of long-term, low-utilization distributed programs which exhibit
global patterns of communication.


\subsection{Fx kernels}
\label{sec:fxkernels}

\begin{figure}
\label{fig:fxkernels}
\centering
\begin{tabular}{|l|l|l|}
\hline
Pattern & Kernel & Description\\
\hline
Neighbor &  SOR & 2D Successive overrelaxation\\
All-to-all & 2DFFT & 2D Data parallel FFT\\
Partition & T2DFFT & 2D Task parallel FFT\\
Broadcast & SEQ & Sequential I/O\\
Tree & HIST & 2D Image histogram\\
\hline
\end{tabular}
\caption{Fx kernels}
\end{figure}

Five of the Fx programs, SOR, 2DFFT, T2DFFT, SEQ, and HIST, were
chosen to exhibit communication patterns common to SPMD parallel
programs discussed in section~\ref{sec:commpat}.  These kernels are
summarized in figure~\ref{fig:fxkernels}.  For each program, we
discuss the distribution of its data (an $N \times N$ matrix) over
its $P$ processors, the local computation on each processor, and
the global communication it exhibits.

\subsubsection{SOR}
SOR is a successive overrelaxation kernel.  In each step, each element
of an $N \times N$ matrix computes its next value as a function of its
neighboring elements.  In the Fx implementation, the rows of the
matrix are distributed across $P$ processors by blocks: processor 0
owns the first $\frac{N}{P}$ rows, processor 1 the next $\frac{N}{P}$ rows,
etc.  Because of this distribution, at each step, every processor $p$
except for processors $0$ and $P-1$ must exchange a row of data with
processor $p-1$ and processor $p+1$ before computing the next value of
each of the elements it owns.  In every step, each processor performs
$O(\frac{N^2}{P})$ local work and sends an $O(N)$ size message to
processors $p-1$ and $p+1$.  SOR is our example of such a {\em
neighbor} communication pattern.

\subsubsection{2DFFT}
2DFFT is a two-dimensional Fast Fourier Transform.  Like in SOR, the
$N \times N$ input matrix has its rows block-distributed over the
processors.  In the first step, local one-dimensional FFTs are run
over each row a processor owns.  Next, the matrix is redistibuted so
that its columns are block-distributed over the processors.  Finally,
local one-dimensional FFTs are run over each column a processor owns.
Each processor performs $O(\frac{N^2 \log N}{P})$ work and generates a
$O(\left ( \frac{N}{P} \right )^2 )$ size message for
every other processor.  2DFFT is our example of a {\em all-to-all}
communication pattern.

\subsubsection{T2DFFT}
T2DFFT is a pipelined, task parallel 2DFFT.  Half of the processors
perform the local row FFTs and send the resulting matrix to the other
half, which perform the local column FFTs.  A side effect of the
communication is the distribution transpose, so each sending processor
sends an $O(\left ( \frac{N}{P} \right )^2)$ size message to each of
the receiving processors.  Notice that each message is twice as large
as for 2DFFT for the same number of processors.  Each processor
performs $O(\frac{N^2 log N}{P})$ work.  This is our example of a {\em
partition} communication pattern.

\subsubsection{SEQ}
SEQ is an example of the kind of {\em broadcast} communication pattern
that results from sequential I/O in Fx programs.  An $N \times N$ matrix
distributed over the processors is initialized element-wise by data
produced on processor 0.  This is implemented by having processor 0 
broadcast each element to each of the other processors, which collect
the elements they need.  This program performs no computation, but
processor 0 sends $N^2$ $O(1)$ size messages to every other processor.
This is our example of a {\em broadcast} communication pattern.

\subsubsection{HIST}
HIST computes the histogram of elements of a $N \times N$ input matrix.
The input matrix has its rows distributed over the processors.  Each
processor computes a local histogram vector for the rows it owns.
After this, there are $\log P$ steps, where at step $i$,
processors whose numbers are odd multiples of $2^i$ send their
histogram vector to the processors that are even multiples of $2^i$.
These processors merge the incoming histogram vector with their local
histogram vector.  Ultimately, processor 0 has the complete histogram,
which it broadcasts to all the other processors.  This is an example
of a {\em tree} communication pattern.

\subsection{AIRSHED Simulation}

The multiscale AIRSHED model captures the formation, reaction,
and transport of atmospheric pollutants and related chemical
species~\cite{POLLUTION}.
The AIRSHED application simulates the behavior of the AIRSHED model when
it is applied to $s$ chemical species, distributed over domains containing
$p$ grid points in each of $l$ atmospheric layers~\cite{AIRSHED}.
In our simulation,
$s = 35$ species, $p = 1024$ grid points, and $l = 4$ atmospheric layers.

The program computes in two principle phases: (1) horizontal
transport (using a finite element method with repeated application
of a direct solver), followed by (2) chemistry/vertical transport (using
an iterative, predictor-corrector method).  Input is
an $l \times s \times p$ concentration array $C$.  Initial conditions
are input from disk, and in a preprocessing phase for the horizontal
transport phases to follow, the finite element stiffness matrix for each
layer is assembled and factored.  The atmospheric
conditions captured by the stiffness matrix are assumed to be constant
during the simulation hour, so this step is performed just once per hour.
This is followed by a sequence of $k$ simulation steps
($k = 5$ in the simulation),
where each step consists of a horizontal transport phase, followed by a
chemistry/vertical transport phase, followed by another horizontal
transport phase.  Each horizontal transport phase performs $l \times s$
backsolves, one for each layer and species.  All may be computed
independently.  However, for each layer $l$, all backsolves use the same
factored matrix $A_l$.  The chemistry/vertical transport phase performs
an independent computation for each of the $p$ grid points.  Output
for the hour is an updated concentration array $C'$, which is the input
to the next hour.

In the implementation,
the array is distributed across $P$ processors by layer: processor 0
owns the first $\frac{l}{P}$ layers, processor 1 owns the next
$\frac{l}{P}$ layers, and so on.
In the first stage, the preprocessing and horizontal transport
operates on the
{\em layer} dimension, so the computation is local and
no communication is involved.
In the second stage, however, the chemistry/vertical
transport operates on the {\em grid} dimension,
and so a transpose on the concentration array $C$ is performed
to distribute the data across the processors by grid point: processor 0
owns the first $\frac{p}{P}$ grid points, processor 1 owns the next
$\frac{p}{P}$ grid points, and so on.
Such transpose requires that each processor sends a message
of size $O(\frac{p \times s \times l}{P^2})$ to every other processors.
Once the chemistry/vertical transport computation is finished,
a reversed transpose is performed in a similar fashion -- each processor
sends a message of size $O(\frac{p \times s \times l}{P^2})$ to each of
the other processors.  This is followed by another horizontal transport
phase.  In summary, each step is characterized by a period of computation
phase of duration $t_i$ (preprocessing),
followed by $k$ back-to-back pairs of {\em all-to-all} traffic
attributed to the distribution transpose, interleaved
with horizontal transport (of duration $t_h$)
and vertical/chemical transport computation (of duration $t_v$).

\subsection{DSD}

Distributed System Diagnosis (DSD)
allows every workstation (a.k.a.\ node) in a
group to know the fault-state of every other node.  In the
Adaptive DSD algorithm implemented here~\cite{DSM-TR}, diagnosis
will continue even if only one node is still good.  Adaptive DSD also
has the nice property in that there is no single point-of-failure.
All nodes share in the testing responsibility.  Adaptive DSD has also
been shown to contain the provably minimum amount of testing of all
DSD algorithms.  DSD programs will allow future parallel programs to
detect when a machine has gone down, allowing the tasks assigned to
that machine to be re-assigned as needed.

DSD is very periodic in nature.  Each machine starts testing other 
machines every testing round, which is definable in a configuration file.
For our tests, machines were instructed to start testing every sixty 
seconds.  Each test of a machine requires the exchange of a total of 
twelve messages, half of which are ACKs.  These consist of a test
request \& reply, a data request \& reply, and another test request \& reply.
If the machine tests good during both tests, then the data is assumed
to be good.

This implementation of DSD exhibits mainly neighbor communication,
but also uses binary tree message passing during ``events'' - when 
machines change state between up and down.
