\section{Airshed}

\subsection{Airshed simulation}

The airshed simulation is significantly more complex than the previous
examples.  The multiscale airshed model captures the formation, reaction,
and transport of atmospheric pollutants and related chemical species.
The airshed application simulates the behavior of the airshed model when
it is applied to $s$ chemical species, distributed over domains containing
$p$ grid points in each of $l$ atmospheric layers.  In our simulation,
$s = 35$ species, $p = 1024$ grid points, and $l = 4$ atmospheric layers.

The program computes in two principle phases: (1) horizontal
transport (using a finite element method with repeated application
of a direct solver), followed by (2) chemistry/vertical transport (using
an iterative, predictor-corrector method).  Input is
an $l \times s \times p$ concentration array $C$.  Initial conditions
are input from disk, and in a preprocessing phase for the horizontal
transport phases to follow, the finite element stiffness matrix for each
layer is assembled and factored.  The atmospheric
conditions captured by the stiffness matrix are assumed to be constant
during the simulation hour, so this step is performed just once per hour.
This is followed by a sequence of $k$ steps ($k = 5$ in the simulation),
where each step consistes of a horizontal transport phase, followed by a
chemistry/vertical transport phase, followed by another horizontal
transport phase.  Each horizontal transport phase performs $l \times s$
backsolves, one for each layer and species.  All may be computed
independently.  However, for each layer $l$, all backsolves use the same
factored matrix $A_l$.  The chemistry/vertical transport phase performs
an independent computation for each of the $p$ grid ponts.  Output
for the hour is an updated concentration array, which is then input
to the next hour.

In the implementation,
the array is distributed across $P$ processors by layer: processor 0
owns the first $\frac{l}{P}$ layers, processor 1 owns the next
$\frac{l}{P}$ layers, and so on.
In the first stage, the preprocessing and horizontal transport
operates on the
{\em layer} dimension, and so the computation
can be performed locally without any communication.
In the second stage, however, the chemistry/vertical
transport operates on the {\em grid} dimension,
and therefore we perform a transpose on the concentration array $C$
to distribute the data across the processors by grid point: processor 0
owns the first $\frac{p}{P}$ grid points, processor 1 owns the next
$\frac{p}{P}$ grid points, and so on.
Such transpose requires that each processor sends a message
of size $O(\frac{p \times s \times l}{P^2})$ to every other processors.
Once the chemistry/vertical transport computation is finished,
a reversed transpose is performed in a similar fashion -- each processor
sends a message of size $O(\frac{p \times s \times l}{P^2})$ to each of
the other processors.  This is followed by another horizontal transport
phase.  In summary, each step is characterized by a period of compute
phase of duration $t_i$ (preprocessing),
followed by $k$ back-to-back pairs of transpose traffic, interleaved
with horizontal transport (of duration $t_h$)
and vertical/chemical transport computation (of duration $t_v$).

\subsection{Airshed Simulation}

For Airshed,  we examined both the aggregate traffic as well as the traffic
from one of the connection (all connections are symmetric).  Average
bandwidth utilization was found to be only 0.262 Mbps.

\subsubsection{Packet size statistics}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 58 & 1518 & 899 & 693 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 58 & 1518 & 889 & 688 \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet size statistics for AIRSHED}
\label{fig:airshedpacketstat}
\end{figure}

Figure~\ref{fig:airshedpacketstat} shows the minimum, maximum, average,
and standard deviation of packet sizes for the AIRSHED application
(for all connections and for the representative connection).  We observe
that the packet size distribution for the single connection is very
similar to the aggregate packet distribution, which supports the argument
that the traffic from the single connection
is representative of the aggregate traffic.

\subsubsection{Interarrival time statistics}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 0.0 & 23448.6 & 26.8 & 513.3 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 0.0 & 37018.5 & 317.4 & 2353.6 \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet interarrival time statistics for AIRSHED}
\label{fig:airshedintstat}
\end{figure}

Figure~\ref{fig:airshedintstat} shows the minimum, maximum, average,
and the standard deviation of packet interarrival times.  Note that
both the maximum and average interarrival times are of an order of
magnitude greater than that of the kernel applications.  As in the case
of the kernel applications, the
ratio of maximum to average interarrival time is quite high, which is
characteristic of a bursty traffic.

\subsubsection{Bandwidth}

The average aggregate and per-connection bandwidths for the AIRSHED
application are 32.7 KB/s and 2.7 KB/s, respectively.
Figure~\ref{fig:airshedwinbw}
shows the instantaneous bandwidth averaged over a 10 ms window (over
a 500 sec interval, and a 60 sec interval).  It
is clear that the bandwidth demand is highly periodic, and is periodic
over {\em three} time scales.  The simulation is divided into a sequence
of $h$ simulation-hours ($h = 100$ in the simulation),
each of which involves a sequence of $k$ simulations steps ($k = 5$).
Each simulation hour starts with a preprocessing stage, where the stiffness
matrix is computed.  Once the stiffness matrix is computed,
the program moves on to the simulation stages.  Such simulation is
characterized by (1) local horizontal transport computation phase,
(2) a subsequent global {\em all-to-all} communication
due to distribution transpose, (3) local chemical/vertical transport
computation phase, and finally (4) a global {\em all-to-all} communication
due to distribution transpose (in the reversed direction).

\begin{figure}
\label{fig:airshedwinbw}
\begin{center}
\begin{tabular}{cc}
\psfig{figure=AIRSHED.all.patch.time.winbw.chop.1000.1500.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.winbw.chop.1000.1500.ps,height=1.75in} \\
(aggregate, 500 seconds) &
(connection, 500 seconds) \\
\psfig{figure=AIRSHED.all.patch.time.winbw.chop.1000.1060.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.winbw.chop.1000.1060.ps,height=1.75in} \\
(aggregate, 60 seconds) &
(connection, 60 seconds) \\
\end{tabular}
\end{center}
\caption{Instantaneous bandwidth of AIRSHED (10ms averaging interval)}
\end{figure}


A total of
100 bursty periods are observed, corresponding to the 100 simulation hours.
The bandwidth utilization between each bursty period is very low because no
communication is involved during the preprocessing stage at the beginning
of each simulation hour.  Each bursty period can be further divided into
5 pairs of peaks, with each pair of peaks corresponding to one simulation
step.  The time between each pair of peaks reflects the time spent in the
chemical/vertical transport computation stage, whereas the time interval
-- which is slightly shorter -- between adjacent pairs corresponds to the
time spent in the horizontal transport computation.  

Such periodicity becomes very clear when we observe the
power spectrums for the AIRSHED simulation(figure~\ref{fig:airshedpsd}).
There are three peaks (plus harmonics) in the power spectrum at approximately
0.015 Hz (66 sec, corresponding to a simulation hour),
0.2 Hz (5 sec, corresponding to the chemical/vertical transport phase),
and 5 Hz (200 ms, corresponding to the horizontal transport phase).

\begin{figure}
\label{fig:airshedpsd}
\begin{center}
\begin{tabular}{cc}
\psfig{figure=AIRSHED.all.patch.time.psd.0.01.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.psd.0.01.ps,height=1.75in} \\
(aggregate, 0 -- 0.1 Hz) &
(connection, 0 -- 0.1 Hz) \\
\psfig{figure=AIRSHED.all.patch.time.psd.0.1.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.psd.0.1.ps,height=1.75in} \\
(aggregate, 0 -- 1 Hz) &
(connection, 0 -- 1 Hz) \\
\psfig{figure=AIRSHED.all.patch.time.psd.0.20.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.psd.0.20.ps,height=1.75in} \\
(aggregate, 0 -- 20 Hz) &
(connection, 0 -- 20 Hz) \\
\end{tabular}
\end{center}
\caption{Power spectrum of bandwidth of AIRSHED (10ms averaging interval)}
\end{figure}

Finally, figure~\ref{fig:airsheddbind}
shows the D-BIND characterization of AIRSHED for
both the aggregate traffic and the single-connection traffic.

\begin{figure}
\label{fig:airsheddbind}
\begin{center}
\begin{tabular}{cc}
\psfig{figure=AIRSHED.all.patch.time.bwp.chop.ps,height=1.75in} &
\psfig{figure=AIRSHED.ba.patch.time.bwp.chop.ps,height=1.75in} \\
(AIRSHED - aggregate) &
(AIRSHED - connection) \\
\end{tabular}
\end{center}
\caption{D-BIND characterization of bandwidth of AIRSHED}
\end{figure}

\subsection{Airshed Simulation}

The analysis of the airshed simulation is interesting because
it is a ``real'' application which involves complex manipulation of data
and communication.  Airshed is similar to the kernels described in the
sense that it can be characterized by an alternating sequence of
computation and communication phases.  However, the pattern in the airshed
simulation is more complex in that it cannot be characterized by a single
period.  From figure~\ref{fig:airshedwinbw} and \ref{fig:airshedpsd},
we learn that in each
simulation hour, the network is first clear of traffic, and then followed
by a bursty period.  A closer look at the bursty period shows that
it is further divided into $k$ pairs of peaks, with the silence period
between peaks within the same pair being different from the silence
period between adjacent peaks.  Each peak, however, takes up approximately
the same bandwidth because the size of the message is always
$O(\frac{l \times s \times p}{P})$.
