
\section{Discussion}

\subsection{SPMD parallel programs}

The measurement and analysis of the Fx kernels and the AIRSHED program
point to several important characteristics of SPMD parallel programs.
First, SPMD programs exhibit some global, collective communication
patterns which may not necessarily be characterized by the behavior of
single connection.  For example, the SEQ (broadcast pattern) and HIST
(tree pattern) kernels are not symmetric --- in SEQ only the
connection from processor 0 to the every other processor (and the
symmetric connections back to processor 0) see traffic.  Further,
characterizing the symmetric patterns such as neighbor, all-to-all,
and partition by a single connection ignores the fact that these
patterns are very different in the number of connections that are
used.  For example, each of the patterns may communicate the same size
message along a connection, but while all-to-all sends such a message
along {\em all} $P(P-1)$ connections, neighbor sends a message along only
at most $2P$ connections.  The partition pattern is in the middle at
$\frac{P^2}{4}$ connections for an equal partition into two halves.

A second important characteristic of SPMD programs is that their
communication phases are synchronized, either explicitly or
implicitly.  This means that the traffic along the active connection
is {\em correlated} and any traffic model must capture this.  Further,
the stronger the synchronization, the more likely it is that the connections
are {\em in phase.}

Thirdly, the traffic along a connection is bursty with a clear period.
However, unlike with a variable bit rate video source, where the
periodicity is known, but the burst size is variable, a parallel
program's burst size is usually known a priori (in the case of Fx, at
compile-time), but the period of the bursts depends on the number of
processors and the bandwidth the network can provide to the
application during the burst.  Consider a simple example.  Suppose
there is $W$ work during a compute phase, and each processor must send
messages of length $N$ along one connection (for example, a shift
pattern).  Further, suppose the communication pattern is symmetric, so
each of the $P$ processors will attempt to do the same, and that each
processor can send and receive at the same time (a reasonable
assumption with, for example, a HIPPI network.)  If the network can
allocate a burst bandwidth of $B$ for each active connection without
congestion, then the burst length $t_{b}=\frac{N}{B}$ and the burst
interval $t_{bi}=\frac{W}{P}+\frac{N}{B}$. Notice that the burst
interval, which certainly plays into the decision of what $B$ the
network can commit to, is a function of $B$ itself (as well as of
the other commitments the network has made.)  

At this point it must pointed out that the parallel program clearly
wants to minimize $t_{bi}$ in order to minimize its total execution
time.  One way it can do this is to increase the number of processors
$P$ it runs on.  However, this has a natural tension with the
bandwidth $B$ that the network can commit to, and, less obviously, the
communication pattern determines how strong that tension is.  Thus
getting the best performance from a parallel program on a network is
essentially an optimization problem, {\em where the number of processors
plays a role.}  We suggest that a SPMD parallel program should characterize
its traffic with three parameters, $[l(),b(),c]$, where $c$ is the
communication pattern, $l$ is a function from the number of processors $P$
to the local computation time $t_local$ on each processor, and $b$ is
a function from $P$ to the burst size $N$, along each connection.  In order
to meet the ``guarantee'' of minimizing $t_{bi}$, the network is allowed
to return the number of processors $P$ the program should run on.

\subsection{AIRSHED Simulation}

The analysis of the AIRSHED simulation is interesting because it is a
``real'' application which involves complex manipulation of data and
communication.  AIRSHED is similar to the kernels described above in the
sense that it can be characterized by an alternating sequence of
computation and communication phases.  However, the communication
pattern in the
AIRSHED simulation is more complicated, and it cannot be characterized
by a just single period.  From figures~\ref{fig:airshedwinbw} and
\ref{fig:airshedpsd}, we learn that the AIRSHED simulation, as in the
case of the kernels, displays a bursty, but periodic, traffic pattern.
Each basic period corresponds to one simulation hour, and the fundamental
frequency is clearly shown in the power spectrum (at 0.015 Hz, or 66 sec
per cycle).
A closer look at the bursty period for the connection
shows that the bursty traffic can be further
divided into $k$ pairs of peaks (recall that $k$ is the number of simulation
steps within a simulation hour), with each pair of peaks corresponding to
the traffic due to the pair of distribution transposes within the
same simulation step.
In addition, both the intra-peak pair and the inter-peak pair silence
periods are the same across the whole trace, and the corresponding
frequencies of 0.2 Hz and 5 Hz are also clearly visible from the power
spectrum.

On the other hand, each peak takes up approximately the same
bandwidth because the total size of the message (per processor)
is always $O(\frac{l \times s \times p}{P})$.
Hence, AIRSHED is a real-life application in which we have
{\em a priori} knowledge about both the burst size and the pattern of
the bursts, and they are closely related to how the program is structured.
It should be clear that the traffic and the quality-of-service guarantees
for this application are fundamentally different from those of media
applications.

\subsection{DSD}

DSD is quite different from the other parallel programs we have 
examined.  It is meant to be used as a utility for other parallel
programs.  Whereas these other programs are more-or-less willing to
take whatever bandwidth the network is able to provide, DSD takes
pains to reduce both network and CPU usage.  It is therefore quite
interesting to compare it to the other SPMD programs.

DSD relies mainly on neighbor communication, although during an
event (which is defined as a workstation failing or coming back
online) it will use tree communication.  Since it mainly uses neighbor
communication, we would expect it to show some similarities to the
SOR Fx application.  This seems to be most noticeable in the two
programs' respective D-BIND characterizations.  Although at first glance
DSD appears to show texture other than a smooth curve, notice that the
Bandwidth scale is much smaller than that for the SOR curves.  Also notice
that the two sets of graphs have been plotted over different time scales,
yet they still show no pattern other than a fairly smooth curve.

Although DSD was designed to use very little resources, it still
relies on network guarantees.  DSD must be offered latency guarantees
for its message passing - if a message is not ACKed in time, then a
node is assumed to be down.  This sets DSD apart from other applications.
Many applications would like to get a bandwidth guarantee.  DSD, however,
does not use much bandwidth at all, but requires that its messages be
delivered promptly.  Networks must be careful not to block messages
produced by such applications, as this would seriously limit their
utility.  Indeed, it would make the diagnosis of the system state
incorrect.  However, the periodicity of DSD and the bandwidth used by DSD
is known apriori (with the exception of messages that result from events).
This should make the job of providing guarantees much easier on the network,
which sets DSD apart from VBR video and the other parallel programs examined
in this paper.

Programs like DSD also make a tradeoff between data latency and network
utilization.  When an event occurs, we can either attempt to pass the
information to all other nodes as quickly as possible, or we can let
it get passed around at each normal testing interval.  In the former
case, we use more bandwidth when an event occurs.  In the latter, there
is no change in bandwidth used, but it takes much longer for all nodes
to learn of the event.  This tradeoff must also be taken into account
when setting aside bandwidth for such an application.  
