\section{Program descriptions}

The six Fx~\cite{TASK-PARALLELISM-HPF-JOURNAL} programs chosen for
investigation fall into two classes.  Five of the programs, SOR,
2DFFT, TDFFT, SEQ, and HIST, are kernels that exhibit the
communication patterns discussed in section~\ref{sec:commpat}.  These
kernels are part of the Fx compiler test
suite. AIRSHED~\cite{AIRSHED,AIRSHED-DESCRIPTION}, an air quality
modeling application, represents a ``real'' scientific application.


\subsection{Fx kernels}
\label{sec:fxkernels}

\begin{figure}
\centering
\begin{tabular}{|l|l|l|}
\hline
Pattern & Kernel & Description\\
\hline
Neighbor &  SOR & 2D Successive overrelaxation\\
All-to-all & 2DFFT & 2D Data parallel FFT\\
Partition & T2DFFT & 2D Task parallel FFT\\
Broadcast & SEQ & Sequential I/O\\
Tree & HIST & 2D Image histogram\\
\hline
\end{tabular}
\caption{Fx kernels}
\label{fig:fxkernels}
\end{figure}

Five of the Fx programs, SOR, 2DFFT, T2DFFT, SEQ, and HIST, were
chosen to exhibit communication patterns common to SPMD parallel
programs discussed in section~\ref{sec:commpat}.  These kernels are
summarized in figure~\ref{fig:fxkernels}.  For each program, we
discuss the distribution of its data (an $N \times N$ matrix) over
its $P$ processors, the local computation on each processor, and
the global communication it exhibits.

\subsubsection{SOR}
SOR is a successive overrelaxation kernel.  In each step, each element
of an $N \times N$ matrix computes its next value as a function of its
neighboring elements.  In the Fx implementation, the rows of the
matrix are distributed across $P$ processors by blocks: processor 0
owns the first $\frac{N}{P}$ rows, processor 1 the next $\frac{N}{P}$ rows,
etc.  Because of this distribution, at each step, every processor $p$
except for processors $0$ and $P-1$ must exchange a row of data with
processor $p-1$ and processor $p+1$ before computing the next value of
each of the elements it owns.  In every step, each processor performs
$O(\frac{N^2}{P})$ local work and sends an $O(N)$ size message to
processors $p-1$ and $p+1$.  SOR is our example of such a {\em
neighbor} communication pattern.

\subsubsection{2DFFT}
2DFFT is a two-dimensional Fast Fourier Transform.  Like in SOR, the
$N \times N$ input matrix has its rows block-distributed over the
processors.  In the first step, local one-dimensional FFTs are run
over each row a processor owns.  Next, the matrix is redistributed so
that its columns are block-distributed over the processors.  Finally,
local one-dimensional FFTs are run over each column a processor owns.
Each processor performs $O(\frac{N^2 \log N}{P})$ work and generates a
$O(\left ( \frac{N}{P} \right )^2 )$ size message for
every other processor.  2DFFT is our example of a {\em all-to-all}
communication pattern.

\subsubsection{T2DFFT}
T2DFFT is a pipelined, task parallel 2DFFT.  Half of the processors
perform the local row FFTs and send the resulting matrix to the other
half, which perform the local column FFTs.  A side effect of the
communication is the distribution transpose, so each sending processor
sends an $O(\left ( \frac{N}{P} \right )^2)$ size message to each of
the receiving processors.  Notice that each message is twice as large
as for 2DFFT for the same number of processors.  Each processor
performs $O(\frac{N^2 log N}{P})$ work.  This is our example of a {\em
partition} communication pattern.

\subsubsection{SEQ}
SEQ is an example of the kind of {\em broadcast} communication pattern
that results from sequential I/O in Fx programs.  An $N \times N$ matrix
distributed over the processors is initialized element-wise by data
produced on processor 0.  This is implemented by having processor 0
broadcast each element to each of the other processors, which collect
the elements they need.  This program performs no computation, but
processor 0 sends $N^2$ $O(1)$ size messages to every other processor.
This is our example of a {\em broadcast} communication pattern.

\subsubsection{HIST}
HIST computes the histogram of elements of a $N \times N$ input matrix.
The input matrix has its rows distributed over the processors.  Each
processor computes a local histogram vector for the rows it owns.
After this, there are $\log P$ steps, where at step $i$,
processors whose numbers are odd multiples of $2^i$ send their
histogram vector to the processors that are even multiples of $2^i$.
These processors merge the incoming histogram vector with their local
histogram vector.  Ultimately, processor 0 has the complete histogram,
which it broadcasts to all the other processors.  This is an example
of a {\em tree} communication pattern.

\subsection{AIRSHED Simulation}

The multiscale AIRSHED model captures the formation, reaction, and
transport of atmospheric pollutants and related chemical
species~\cite{POLLUTION}.  The goal of a related research project is
to convert this massive application into a portable and scalable
parallel program~\cite{AIRSHED-DESCRIPTION}.  As a part of this work,
AIRSHED is being ported to Fx.  However, at the time of our research,
this port had not been completed.  Instead, we measured an Fx skeleton
of the application which was prepared by the group performing the
actual port.  The skeleton application models both the computation and
communication of the actual application.

AIRSHED simulates the movement and reaction of $s$ chemical species,
distributed over domains containing $p$ grid points in each of $l$
atmospheric layers~\cite{AIRSHED}.  In our simulation, $s = 35$ species,
$p = 1024$ grid points, and $l = 4$ atmospheric layers.  The program
computes in two principle phases:  (1) horizontal transport (using a
finite element method with repeated application of a direct solver),
followed by (2) chemistry/vertical transport (using an iterative,
predictor-corrector method).  Input is an $l \times s \times p$
concentration array $C$.  Initial conditions are input from disk, and in
a preprocessing phase for the horizontal transport phases to follow, the
finite element stiffness matrix for each layer is assembled and
factored.  The atmospheric conditions captured by the stiffness matrix
are assumed to be constant during the simulation hour, so this step is
performed just once per hour.  This is followed by a sequence of $k$
simulation steps ($k = 5$ in the simulation), where each step consists
of a horizontal transport phase, followed by a chemistry/vertical
transport phase, followed by another horizontal transport phase.  Each
horizontal transport phase performs $l \times s$ backsolves, one for
each layer and species.  All may be computed independently.  However,
for each layer $l$, all backsolves use the same factored matrix $A_l$.
The chemistry/vertical transport phase performs an independent
computation for each of the $p$ grid points.  Output for the hour is an
updated concentration array $C'$, which is the input to the next hour.

In the implementation, the array is distributed across $P$ processors by
layer:  processor 0 owns the first $\frac{l}{P}$ layers, processor 1
owns the next $\frac{l}{P}$ layers, and so on.  In the first stage, the
preprocessing and horizontal transport operates on the {\em layer}
dimension, so the computation is local and no communication is involved.
In the second stage, however, the chemistry/vertical transport operates
on the {\em grid} dimension, and so a transpose on the concentration
array $C$ is performed to distribute the data across the processors by
grid point:  processor 0 owns the first $\frac{p}{P}$ grid points,
processor 1 owns the next $\frac{p}{P}$ grid points, and so on.  Such
transpose requires that each processor sends a message of size
$O(\frac{p \times s \times l}{P^2})$ to every other processors.  Once
the chemistry/vertical transport computation is finished, a reversed
transpose is performed in a similar fashion -- each processor sends a
message of size $O(\frac{p \times s \times l}{P^2})$ to each of the
other processors.  This is followed by another horizontal transport
phase.  In summary, each step is characterized by a period of
computation phase of duration $t_i$ (preprocessing), followed by $k$
back-to-back pairs of {\em all-to-all} traffic attributed to the
distribution transpose, interleaved with horizontal transport (of
duration $t_h$) and vertical/chemical transport computation (of duration
$t_v$).







