\section{Results}
In this section, we describe the traffic characteristics for each of the
six Fx programs.

\subsection{Fx kernels}

For each of the kernels, we examined its aggregate traffic and the
traffic of a representative connection, if there was one.  We define
a connection to be a kernel-specific simplex channel between a source
machine in a destination machine.  Thus for $P=4$, each of the kernels
exhibits 12 connections.  Notice that by considering a connection between
{\em machines} as opposed to between machine-port pairs, we capture all
kernel-specific traffic between a source and destination machine.  This
includes TCP traffic for message passing, UDP traffic between the PVM
daemons, and TCP ACKs for the symmetric channel.  The communication
pattern of HIST and SEQ are not symmetric, so we only examine the
aggregate traffic of these kernels.  T2DFFT's pattern is symmetric about the
partition, so we consider a connection from a machine in the sending half to
a machine in the receiving half.  The other kernels have symmetric
communication patterns, so we choose the connection between an
two arbitrary machines.

The traffic of each of the kernels is characterized by its packet
sizes, interarrival times for packets, and bandwidth, both for the
aggregate traffic and the traffic over the representative connection.
We concentrate on characterizing the bandwidth, since this appears the
most interesting.

We note here that the graphs presented are not all to the same scale.
The intention is to better highlight the features of each graph.  However,
this does make quick comparisons between graphs more difficult.

\subsubsection{Packet size statistics}


Figure~\ref{fig:fxpacketstat} shows the minimum, maximum, average and
standard deviation of packet sizes for each of the five applications.
The first table covers all the connections while the second includes
only packets in a single representative connection.  Although we do
not present histograms here, it is important to remark that for several
of the kernels (2DFFT, HIST, SOR), the distribution of packet sizes is
{\em trimodal}.  This is because these programs send messages large
messages which are split over several maximal size packets and a
single smaller packet for the remainder.  Further, because TCP is used
for the data transfer, there are a significant number of ACK packets.
One would expect T2DFFT to also send large messages and therefore
exhibit a trimodal distribution of packet sizes.  However, a different
PVM mechanism is used to assemble messages in T2DFFT.  As described in
section~\ref{sec:commmech}, PVM internally stores messages as a fragment
list and generates packets for each fragment separately.  Because
of the way messages are assembled in T2DFFT, many fragments result,
explaining the variety of packet sizes.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
SOR    & 58 & 1518 & 473 & 568 \\
2DFFT  & 58 & 1518 & 969 & 678 \\
T2DFFT & 58 & 1518 & 912 & 663 \\
SEQ    & 58 & 90   & 75  & 14  \\
HIST   & 58 & 1518 & 499 & 575 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
SOR    & 58 & 1518 & 577 & 591 \\
2DFFT  & 58 & 1518 & 977 & 667 \\
T2DFFT &134 & 1518 & 1442 & 158 \\
SEQ    & - & -   & -  & -  \\
HIST   & - & - & - & - \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet size statistics for Fx kernels}
\label{fig:fxpacketstat}
\end{figure*}


\subsubsection{Interarrival time statistics}


Figure~\ref{fig:fxintstat} shows the minimum, maximum, average, and
standard deviation of the packet interarrival times for each of the
five programs.  The first table shows the statistics for all the
connections, while the second concentrates on a single representative
connection.  Notice that ratio of maximum to average interarrival time
for each program is quite high.  This is due to the aggregate bursty
nature of the traffic, as we discuss below.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
SOR    & 0.0 & 1728.7 & 82.1 & 234.9 \\
2DFFT  & 0.0 & 1395.8 & 1.3  & 10.8    \\
T2DFFT & 0.0 & 1301.6 & 1.5  & 14.3   \\
SEQ    & 0.0 & 218.6  & 1.3  & 8.6  \\
HIST   & 0.0 & 449.9  & 16.5 & 45.5 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
SOR    & 0.0 & 1797.0 & 614.2& 590.8 \\
2DFFT  & 0.0 & 2732.6 & 15.1 & 120.5    \\
T2DFFT & 0.0 & 4216.7 & 9.5  & 127.3 \\
SEQ    & -   & -      & -    & -  \\
HIST   & -   & -      & -    & - \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet interarrival time statistics for Fx kernels}
\label{fig:fxintstat}
\end{figure*}

\subsubsection{Bandwidth}


Figure~\ref{fig:fxavgbw} shows the aggregate and per-connection
average bandwidth used over the lifetime of each of the five
programs.  It is somewhat counter-intuitive (and quite
promising!)  that even the most communication intensive Fx programs
such as 2DFFT do not consume all the available bandwidth.  However,
recall that Fx programs synchronize via their global communication
phases, so there are stretches of time where every processor is
computing.  Each of these periods is followed by an intense burst
of traffic, as every processor tries to communicate.

It is important to note that this synchronization is inherent in the
Fx model and is not merely a result of serialization due to
the Ethernet MAC protocol.  In fact, in several new communication
strategies optimized for compiler-generated SPMD programs the global
synchronization is {\em enforced} by a separate barrier synchronization
before each communication phase~\cite{DEPOSIT-ATM,STRICKER-THESIS}.

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|}
\hline
{\bf Program} & {\bf KB/s} \\
\hline
SOR    & 5.6 \\
2DFFT  & 754.8 \\
T2DFFT & 607.1 \\
SEQ    & 58.3 \\
HIST   & 29.6 \\
\hline
\end{tabular}
&
\begin{tabular}{|l|c|}
\hline
{\bf Program} & {\bf KB/s} \\
\hline
SOR    & 0.9 \\
2DFFT  & 63.2 \\
T2DFFT & 148.6 \\
SEQ    & - \\
HIST   & - \\
\hline
\end{tabular} \\
(aggregate) & (connection) \\
\end{tabular}
\end{center}
\caption{Average bandwidth for Fx kernels}
\label{fig:fxavgbw}
\end{figure}

The effect of this inherent synchronization is made clear by examining
figure~\ref{fig:fxwinbw}, which plots the instantaneous bandwidth
averaged over a 10 ms window for the each of the kernel.  This was
computed using a sliding 10 ms averaging window which moves a
single packet at a time.  For the SOR, 2DFFT, and T2DFFT kernels, the
aggregate bandwidth is plotted on the left and the bandwidth of the
representative connection on the right.  Since HIST and SEQ have no
representative connection, only their aggregate bandwidths are
plotted.  In each case, we show a ten second span of time, enough to
include several iterations of the kernel.  The complete traces are
between 50 and several hundred seconds long.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\hardypsfig{SOR.all.patch.time.winbw.chop.10.ps} &
\hardypsfig{SOR.ba.patch.time.winbw.chop.10.ps} \\
(SOR - aggregate) &
(SOR - connection) \\
\hardypsfig{FFT.all.patch.time.winbw.chop.10.ps} &
\hardypsfig{FFT.ba.patch.time.winbw.chop.10.ps} \\
(2DFFT - aggregate) &
(2DFFT - connection) \\
\hardypsfig{TFFT.all.patch.time.winbw.chop.10.ps} &
\hardypsfig{TFFT.ba.patch.time.winbw.chop.10.ps} \\
(T2DFFT - aggregate) &
(T2DFFT - connection) \\
\hardypsfig{SEQ.all.patch.time.winbw.chop.10.ps} &
\hardypsfig{HIST.all.patch.time.winbw.chop.10.ps} \\
(SEQ - aggregate) &
(HIST - aggregate) \\
\end{tabular}
\end{center}
\caption{Instantaneous bandwidth of Fx kernels (10ms averaging interval)}
\label{fig:fxwinbw}
\end{figure*}

The most remarkable attribute of each of the kernels is that the
bandwidth demand is highly periodic.  Consider the 2DFFT.  Both plots
show about five iterations of the kernel.  Notice that there are
substantial portions of time where virtually no bandwidth is used (all
the processors are in a compute phase).  The reason the third and
fourth burst are short is because they are, in fact, a single
communication phase where some processor descheduled the program.
Because the all-to-all communication schedule is fixed and synchronous,
the communication phase stalled until that processor was able to send
again.

Figure~\ref{fig:fxpsd} shows the corresponding power spectra
(periodograms) of the instantaneous average bandwidth.  The power
spectra show the frequency-domain behavior of the bandwidth, and are
very useful for characterizing it, as we will explore in
Section~\ref{sec:discussion_freq}.  It is important to note that the
power spectra capture the periodicity of the bandwidth demands these
applications place on the network.  

For these calculations, the entire trace of each kernel was used, not
just the first 10 seconds displayed in figure~\ref{fig:fxwinbw}.
Because a power spectrum computation requires evenly spaced input
data, the input bandwidth was a computed along static 10 ms intervals
by including all packets that arrived during the interval.  This is a
close approximation to the sliding window bandwidth, and more feasible
than correctly sampling the sliding window bandwidth data, which would
require a curve fit over a massive amount of data.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\hardypsfig{SOR.all.patch.time.swb.chop.psd.all.ps} &
\hardypsfig{SOR.ba.patch.time.swb.chop.psd.all.ps} \\
(SOR - aggregate) &
(SOR - connection) \\
\hardypsfig{FFT.all.patch.time.swb.chop.psd.5.ps} &
\hardypsfig{FFT.ba.patch.time.swb.chop.psd.10.ps} \\
(2DFFT - aggregate) &
(2DFFT - connection)\\
\hardypsfig{TFFT.all.patch.time.swb.psd.5.ps} &
\hardypsfig{TFFT.ba.patch.time.swb.chop.psd.5.ps} \\
(T2DFFT - aggregate) &
(T2DFFT - connection)\\
\hardypsfig{SEQ.all.patch.time.swb.psd.10.ps} &
\hardypsfig{HIST.all.patch.time.swb.chop.psd.all.ps} \\
(SEQ - aggregate) &
(HIST - aggregate)\\
\end{tabular}
\end{center}
\caption{Power spectrum of bandwidth of Fx kernels (10ms averaging interval)}
\label{fig:fxpsd}
\end{figure*}

Not surprisingly, SEQ, in which processor 0 repeatedly broadcasts a
single word, is extremely periodic, with the four Hz harmonic being
the most important.  HIST has a 5 Hz fundamental with linearly
declining harmonics at 10, 15, etc. Hz.

SOR and 2DFFT display opposite relationships between the connection
and aggregate power spectra.  For SOR, the connection power spectrum
shows great periodicity, with a fundamental of about 5 Hz and
interestingly modulated harmonics, but the aggregate power spectrum
shows far less clear periodicity.  For 2DFFT, the relationship is the
reverse, although less strong, with a clear fundamental of $1/2$ Hz
and exponentially declining harmonics.  There are two explanations for
this.  First, 2DFFT transfers more data per message than SOR ($O(\left
(
\frac{N}{P} \right )^2)$ versus $O(N)$, $N=512$, $P=4$), so has a better
chance of being descheduled (as discussed above).  Second, 2DFFT's
communication pattern more closely synchronizes all the processors
than SOR's.  Thus a single SOR connection has a better chance of being
periodic because the sending processor is less likely to be
descheduled.  On the other hand, SOR's aggregate traffic will be less
periodic because the processors are less tightly synchronized.  Notice,
however, that in both cases, the representative connection's power
spectrum {\em does} show considerable periodicity.

T2DFFT's power spectra have the least clear periodicity of all the Fx
kernels.  However, the aggregate spectrum seems slightly cleaner than
the spectrum of the representative connection.  The fact that neither
spectrum is very clean is surprising given the synchronizing nature of
this pattern, the balanced message sizes, and the communication
schedule (shift) used for it.  We believe the problem arises from
PVM's handling of the message as a cluster of fragments.



%\begin{figure*}
%\begin{center}
%\begin{tabular}{cc}
%\hardypsfig{SOR.all.patch.time.bwp.chop.ps} &
%\hardypsfig{SOR.ba.patch.time.bwp.chop.ps} \\
%(SOR - aggregate) &
%(SOR - connection) \\
%\hardypsfig{FFT.all.patch.time.bwp.chop.ps} &
%\hardypsfig{FFT.ba.patch.time.bwp.chop.ps} \\
%(2DFFT - aggregate) &
%(2DFFT - connection)\\
%\hardypsfig{TFFT.all.patch.time.bwp.chop.ps} &
%\hardypsfig{TFFT.ba.patch.time.bwp.chop.ps} \\
%(T2DFFT - aggregate) &
%(T2DFFT - connection)\\
%\hardypsfig{SEQ.all.patch.time.bwp.chop.ps} &
%\hardypsfig{HIST.all.patch.time.bwp.chop.ps} \\
%(SEQ - aggregate) &
%(HIST - aggregate)\\
%\end{tabular}
%\end{center}
%\caption{D-BIND characterization of bandwidth of Fx kernels}
%\label{fig:fxdbind}
%\end{figure*}


%Figure~\ref{fig:fxdbind} shows the Deterministic-Bounding Interval
%Dependent (D-BIND) traffic model characterization~\cite{D-BIND} of
%each for each of the kernels.  The D-BIND model characterizes traffic
%by the maximum bandwidth used over any interval of particular length
%versus interval length.  For the Fx kernels, we examined every
%interval from 10 ms to 10 seconds, with a granularity of 10 ms.  Each
%graph corresponds to the instantaneous average bandwidth graph in
%figure~\ref{fig:fxwinbw}, although it represents the D-BIND
%characterization for the entire trace, not just ten seconds of it.
%Notice that periodicity made evident by these graphs match precisely
%that discovered via the power spectra of figure~\ref{fig:fxpsd},
%further bolstering that the approximation to the sampled instantaneous
%average bandwidth used to compute the spectra is reasonable.  For
%example, the $1/2$ Hz fundamental of 2DFFT is clearly visible.


\subsection{AIRSHED Simulation}

For AIRSHED, we examined both the aggregate traffic as well as the traffic
of one connection.  The format of the data we present mirrors that of
the previous section.

\subsubsection{Packet size statistics}


Figure~\ref{fig:airshedpacketstat} shows the minimum, maximum, average,
and standard deviation of packet sizes for the AIRSHED application
(for all connections and for the representative connection).  We observe
that the packet size distribution for the single connection is very
similar to the aggregate packet distribution, which supports the argument
that the traffic from the single connection
is representative of the aggregate traffic.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 58 & 1518 & 899 & 693 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Packet Size (Bytes)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 58 & 1518 & 889 & 688 \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet size statistics for AIRSHED}
\label{fig:airshedpacketstat}
\end{figure*}

\subsubsection{Interarrival time statistics}

Figure~\ref{fig:airshedintstat} shows the minimum, maximum, average,
and the standard deviation of packet interarrival times.  Note that
both the maximum and average interarrival times are of an order of
magnitude greater than that of the kernel applications.  As in the case
of the kernel applications, the
ratio of maximum to average interarrival time is quite high, which is
characteristic of bursty traffic.


\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 0.0 & 23448.6 & 26.8 & 513.3 \\
\hline
\end{tabular} &
\begin{tabular}{|l|c|c|c|c|}
\hline
 & \multicolumn{4}{c|}{\bf Interarrival Time (ms)} \\
\cline{2-5}
{\bf Program} & Min & Max & Avg & SD \\
\hline
AIRSHED & 0.0 & 37018.5 & 317.4 & 2353.6 \\
\hline
\end{tabular}\\
(aggregate) &
(connection)\\
\end{tabular}
\end{center}
\caption{Packet interarrival time statistics for AIRSHED}
\label{fig:airshedintstat}
\end{figure*}

\subsubsection{Bandwidth}

The average aggregate and per-connection bandwidths for the AIRSHED
application are 32.7 KB/s and 2.7 KB/s, respectively.
Figure~\ref{fig:airshedwinbw}
shows the instantaneous bandwidth averaged over a 10 ms window (over
a 500 sec interval, and a 60 sec interval).  It
is clear that the bandwidth demand is highly periodic, and is periodic
over {\em three} different time scales.
The simulation is divided into a sequence
of $h$ simulation-hours ($h = 100$ in the simulation),
each of which involves a sequence of $k$ simulations steps ($k = 5$).
Each simulation hour starts with a preprocessing stage, where the stiffness
matrix is computed.  Once the stiffness matrix is computed,
the program moves on to the simulation stages.  Such simulation is
characterized by (1) a local horizontal transport computation phase,
(2) a subsequent global {\em all-to-all} transpose traffic,
(3) a local chemical/vertical transport
computation phase, and finally (4) a global {\em all-to-all} transpose
traffic in the reversed direction.

\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\hardypsfig{AIRSHED.all.patch.time.winbw.chop.1000.1500.ps} &
\hardypsfig{AIRSHED.ba.patch.time.winbw.chop.1000.1500.ps} \\
(aggregate, 500 seconds) &
(connection, 500 seconds) \\
\hardypsfig{AIRSHED.all.patch.time.winbw.chop.1000.1060.ps} &
\hardypsfig{AIRSHED.ba.patch.time.winbw.chop.1000.1060.ps} \\
(aggregate, 60 seconds) &
(connection, 60 seconds) \\
\end{tabular}
\end{center}
\caption{Instantaneous bandwidth of AIRSHED (10ms averaging interval)}
\label{fig:airshedwinbw}
\end{figure*}


A total of 100 bursty periods are observed, corresponding to the 100
simulation hours.  The bandwidth utilization between each bursty
period is very low because no communication is involved during the
preprocessing stage at the beginning of each simulation hour.  Each
bursty period can be further divided into 5 pairs of peaks, with each
pair of peaks corresponding to one simulation step.  The time between
each pair of peaks reflects the time spent in the chemical/vertical
transport computation stage, whereas the time interval between
adjacent pairs -- which is slightly shorter -- corresponds to the time
spent in the horizontal transport computation.  Such periodicity
becomes very clear when we observe the power spectra for the AIRSHED
simulation (figure~\ref{fig:airshedpsd}).  There are three peaks (plus
their harmonics) in the power spectrum at approximately 0.015 Hz (66
sec, corresponding to a simulation hour), 0.2 Hz (5 sec, corresponding
to the length of the chemical/vertical transport phase), and 5 Hz (200
ms, corresponding to that of the horizontal transport phase),
respectively.  Section \ref{sec:discussion_freq} discusses the use of
power spectra for characterizing the network traffic of these
programs.


\begin{figure*}
\begin{center}
\begin{tabular}{cc}
\hardypsfig{AIRSHED.all.patch.time.psd.0.01.ps} &
\hardypsfig{AIRSHED.ba.patch.time.psd.0.01.ps} \\
(aggregate, 0 -- 0.1 Hz) &
(connection, 0 -- 0.1 Hz) \\
\hardypsfig{AIRSHED.all.patch.time.psd.0.1.ps} &
\hardypsfig{AIRSHED.ba.patch.time.psd.0.1.ps} \\
(aggregate, 0 -- 1 Hz) &
(connection, 0 -- 1 Hz) \\
\hardypsfig{AIRSHED.all.patch.time.psd.0.20.ps} &
\hardypsfig{AIRSHED.ba.patch.time.psd.0.20.ps} \\
(aggregate, 0 -- 20 Hz) &
(connection, 0 -- 20 Hz) \\
\end{tabular}
\end{center}
\caption{Power spectrum of bandwidth of AIRSHED (10ms averaging interval)}
\label{fig:airshedpsd}
\end{figure*}

%Finally, figure~\ref{fig:airsheddbind}
%shows the D-BIND characterization of AIRSHED for
%both the aggregate traffic and the single-connection traffic.

% \begin{figure*}
%\begin{center}
%\begin{tabular}{cc}
%\hardypsfig{AIRSHED.all.patch.time.bwp.chop.ps} &
%\hardypsfig{AIRSHED.ba.patch.time.bwp.chop.ps} \\
%(AIRSHED - aggregate) &
%(AIRSHED - connection) \\
%\end{tabular}
%\end{center}
%\caption{D-BIND characterization of bandwidth of AIRSHED}
%\label{fig:airsheddbind}
%\end{figure*}

