
\section{Discussion}
\label{sec:discussion}
The measurement and analysis of the Fx kernels and the AIRSHED program
point to several important characteristics of the network traffic of
Fx parallel programs.  The most important of these is that their
periodicity is well characterized by their power spectra, and can be
emulated by simplifying the Fourier series implied by the spectra.
Finally, we suggest a negotiation model for QoS which would allow both
the network and the program to co-optimize for performance.

\subsection{Elementary characteristics}

Fx programs exhibit some global, collective communication patterns
which may not necessarily be characterized by the behavior of single
connection.  For example, the SEQ (broadcast pattern) and HIST (tree
pattern) kernels are not symmetric --- in SEQ only the connection from
processor 0 to the every other processor (and the symmetric
connections back to processor 0) see traffic.  Further, characterizing
the symmetric patterns such as neighbor, all-to-all, and partition by
a single connection ignores the fact that these patterns are very
different in the number of connections that are used.  For example,
each of the patterns may communicate the same size message along a
connection, but while all-to-all sends such a message along {\em all}
$P(P-1)$ connections, neighbor sends a message along only at most $2P$
connections.  The partition pattern is in the middle at
$\frac{P^2}{4}$ connections for an equal partition into two halves.

Another important characteristic of Fx programs is that their
communication phases are synchronized, either explicitly or
implicitly.  This means that the traffic along the active connection
is {\em correlated} and any traffic model must capture this.  Further,
the stronger the synchronization, the more likely it is that the connections
are {\em in phase.}

\subsection{Characterizing periodicity}
\label{sec:discussion_freq}
As stated above, the synchronized communication phases of a Fx program
imply that its connections act in phase.  Thus, the power spectra of
Figures~\ref{fig:fxpsd} and~\ref{fig:airshedpsd} fully characterize
the bandwidth demands of the applications discussed in this paper.
Furthermore, it should be realized that the power spectrum is the
square of the Fourier transform of the time-domain instantaneous
average bandwidth.  Since this underlying signal is periodic, the transform is a Fourier series: 
\begin{equation}
X(\omega) = \sum_{k = - \infty }^{ \infty } 2 \pi a_k \delta(\omega - \omega_0)
\end{equation}
where the $a_k$ are the coefficients which can be read off of the
power spectrum graphs.  The time-domain instantaneous bandwidth can
then be reconstructed as:
\begin{equation}
x(t) = \sum_{k=-\infty }^{\infty } a_k e^{jk \omega_0 t}
\end{equation}
While the summation may appear analytically daunting, note that $x(t)$
can be approximated by choosing some number of the ``spike'' $a_k$s
from the spectra (those with the greatest magnitude).  As the number
of spikes chosen increases, the approximation will converge to the
actual signal.

\subsection{QoS negotiation model}

Consider a simple parallel program where each processor generates
periodic bursts along one of its connections (a shift pattern.)
Unlike  variable bit rate video source, where the periodicity is
known, but the burst size is variable, the parallel program's burst
size is usually known a priori (in the case of Fx, at compile-time),
but the period between bursts depends on the number of processors and
the bandwidth the network can provide to the application during the
burst.  Suppose the program performs $W$ work during a compute phase,
and each processor send a message of length $N$. If the network can
allocate a burst bandwidth of $B$ for each active connection without
congestion, then the burst length is $t_{b}=\frac{N}{B}$ and the burst
interval is $t_{bi}=\frac{W}{P}+\frac{N}{B}$. Notice that the burst
interval, which certainly plays into the decision of what $B$ the
network can commit to, is a function of $B$ itself (as well as of the
other commitments the network has made.)

It must pointed out that the parallel program clearly
wants to minimize $t_{bi}$ in order to minimize its total execution
time.  One way it can do this is to increase the number of processors
$P$ it runs on.  However, there is a natural tension with the
bandwidth $B$ that the network can commit to, and, less obviously, the
communication pattern determines how strong that tension is.  Thus
getting the best performance from a parallel program on a network is
essentially an optimization problem, {\em where the number of processors
plays a role.}  We suggest that a SPMD parallel program should characterize
its traffic with three parameters, $[l(),b(),c]$, where $c$ is the
communication pattern, $l$ is a function from the number of processors $P$
to the local computation time $t_{local}$ on each processor, and $b$ is
a function from $P$ to the burst size $N$, along each connection.  In order
to meet the ``guarantee'' of minimizing $t_{bi}$, the network is allowed
to return the number of processors $P$ the program should run on.



