\documentstyle[leqno,twoside,11pt,ltexproc]{article} %You must set up your
                                              %\documentstyle line like this.

\newcommand{\Chem}[0]{{\bf CHEM}}
\newcommand{\Trans}[0]{{\bf TRANS}}
\newcommand{\C}{{\bf C}}

\input{psfig}

\begin{document}
\bibliographystyle{siamproc}

\cleardoublepage
\pagestyle{myheadings}

\title{Environmental Modeling on the\\
C90-T3D Heterogeneous System (DRAFT)\thanks{This research was supported by the National Science Foundation under contract CCR-9217365.}}
\author{Edward Segall\thanks{Department of Computer Science,  Carnegie
Mellon University, Pittsburgh, PA 15213.}~\thanks{Currently with 
Dept. of Computer and Information Sciences, Univ. of
Delaware, Newark, DE 19716.}
\and 
Kip Walker\footnotemark[2]
\and
Peter Steenkiste\footnotemark[2]
\and
Armistead Russell\thanks{Department of Mechanical Engineering,
Carnegie Mellon University, Pittsburgh, PA 15213.}
}
\date{}
\maketitle
\markboth{Segall et al. (DRAFT)}{SIAM Proceedings Series Macros} % See section 5 above
                                                       % for explanation.
\pagenumbering{arabic}

\begin{abstract}
We describe how the heterogeneous nature of parallelism in the
Urban-to-Regional Multiscale (URM) Airshed air quality model makes it
desirable to execute the model on a heterogeneous parallel
computer. The chemistry/vertical transport phase contains a
significant level of coarse-grain data parallelism, and so can take
advantage of massively parallel processors (MPPs).  In contrast, the
horizontal transport phase has a modest level of data parallelism and
is vectorizable; it maps well to a more traditional shared-memory
supercomputer.  We describe how we optimized communication on the
C90-T3D heterogeneous supercomputer, which consists of a parallel
vector supercomputer and a closely-coupled MPP, specifically for
models like URM. 
\end{abstract}

\section{A heterogeneous URM for C90-T3D}

The Urban-to-Regional Multiscale (URM) Airshed air quality model is
described in a companion paper\cite{segall:scale:epa:local}.  \C, the main
data structure, is described there, as are the main computational
tasks, \Chem\ and \Trans.   Both tasks are parallelizable,
but have different characteristics.  \Chem, the
chemistry/vertical transport phase, has a very high degree of
outer-loop parallelism (equal to the number of chemical species, $nspec$, times the number of horizontal grid points, $nnodes$), while
\Trans, the horizontal transport phase, has more limited outer-loop
parallelism (equal to the number of atmospheric layers, $nlayers$,
times $nspec$). However, inner-loop parallelism, in the form of
vectorizable operations, is also available from \Trans, with vector
lengths of approximately $\sqrt{nnodes}$.  

As the size of the domain increases, the amount of work performed by
\Trans\ increases faster than the amount performed by \Chem\ (due to
the same $\sqrt{nnodes}$ term). This results in approximately equal
division of work for regional-size domains, which are the targets of
this application\footnote{Incorporating aerosol and aqueous-phase
chemistry will push the balance back towards \Chem.}.  Thus, we need
to scale the two phases roughly equally, even though they have
different characteristics.  This distinction makes the URM model a
good candidate for execution on a heterogeneous platform consisting of
a traditional shared memory vector supercomputer (Cray C90) and a
massively parallel processor (Cray T3D).

Our PVM \cite{pvm:dmcc6} version of the
application\cite{segall:scale:epa:local} is a good starting point for
the C90-T3D implementation since both the C90 and T3D support PVM for
internal communication.  For communication between the C90 and T3D, a
number of options exist:
\begin{itemize}
\item PVM: the most portable option.
\item Parallel File I/O (PFIO): developed by Sanielevici at PSC
(Pittsburgh Supercomputing Center), based on files that are shared by
processes on the C90 and T3D.
\item Distributed Heterogeneous Supercomputing Communication (DHSC)
\cite{dhsc:ipps}: a package developed at PSC for communication between
the C90 and the CM2 or T3D MPPs.
\item Direct shared-memory access between the C90 and T3D.  A
Cray-proprietary interface, this would have been considered only if
none of the others proved suitable.
\end{itemize}

Because \C\ is transposed and redistributed between \Chem\ and \Trans,
data from each C90 process is sent to every T3D node, and
vice-versa. Two ways to do this are:
\begin{enumerate}
\item Each sending process sends data directly to each receiving
process; hence, we call this the {\em direct} method. Advantages are
that each datum is transferred exactly once, and that all send and
receive operations can be performed in parallel.  However, a
disadvantage is that the size of each message is quite small.
\item One process on the sending system gathers data from all sending
processes, and sends it as one large message to a single receiving
process, which then scatters (distributes) the data to all receiving
processes.  Each datum is transfered three times, but only one (large)
message is exchanged between the C90 and T3D.  We call this the {\em
funnel} method.
\end{enumerate}

An intermediate solution is the {\em parallel funnel} method, in
which a small number (larger than one) of processes on each system
perform the C90-T3D communication.  To understand the tradeoffs
between these methods, we analyze their communication costs.  The
cost of sending or receiving a message of length $L$ bytes is
approximated by $\alpha + \beta L$, where $\alpha$ is the per-message
cost and $\beta$ is the cost for each byte in the message.  $\alpha$
and $\beta$ depend both on the source and destination (C90-T3D,
T3D-T3D, or C90-C90) and on the communication library used.  In
general, it costs more to communicate between systems than within a
system.  Also, between a given source and destination, it is more
efficient to send a single large message than a stream of short
messages, because $\alpha$ is paid only once.

With the direct method, each datum is sent and received only once, but
a large number of messages have to be sent over the C90-T3D link.  If
$P_{s}$ and $P_{d}$ are the number of processes on the source and
destination systems, then a total of $P_{s} * P_{d}$ messages are
sent, each of size $D / (P_{s} * P_{d})$, where $D$ is the size of \C.
$D$ ranges from a few megabytes (MB) now to several tens of MB in the
near future.  Assuming on the order of 10 C90 processors and 100 T3D
processors, we can expect message sizes of a few kilobytes (KB) to a
few tens of KB, potentially too small for efficient C90-T3D
communication.

In the funnel methods, much larger messages are exchanged between
the C90 and T3D, making that communication step more efficient.  The
drawback is that they also require scatter/gather operations inside
the C90 and T3D, which involves small messages. Note, however, that
the cost of this additional communication will typically be lower than
that of C90-T3D communication.  It is clear that which method is
most efficient will depend on the specific values of the per-message
and per-byte overheads, as well as hard-to-predict second-order
effects such as resource contention.

\section{Experimental results} To further evaluate the different
methods, we implemented both the direct and funnel methods. We
concentrated on the communication requirements for URM, as applied to
the Northeastern U.S. (NE) region; for this region $D$ is 4.7 MB.  By
comparison, for the Los Angeles Basin, $D$ is about 1~MB. We expect
future runs to use larger arrays, e.g.  tens of MB, both because of an
increase in the number of layers and species and because larger
regions will be modeled.  Our goal is to do a 24 hour simulation of
the NE region in 15 minutes.  This will require a round-trip
communication of \C\ between the C90 and T3D in 0.67 seconds or less.
This is a relatively modest requirement: it corresponds to a
one-way sustained bandwidth of under 15 MB/second (a small fraction of
what the hardware is designed to support at its maximum
throughput). The challenge is to achieve this bandwidth end-to-end
(from a distributed source to a distributed destination, with a
transpose along the way), for the array sizes of interest.


\begin{figure}[htb]
\centerline{\psfig{figure=fig/pvm.ps,height=1.8in}
\psfig{figure=fig/pfio.ps,height=1.8in}}
\caption{Round-trip time for direct method using PVM (left) and PFIO (right)}
\label{fig:pvm:pfio}
\end{figure}

Figure \ref{fig:pvm:pfio} shows our results for the direct method,
using PVM and PFIO to communicate between equal numbers of PEs (processing
elements) on the C90 and T3D.  For each curve, the x-axis shows the
amount of data; the y-axis shows the transfer time.  Note that the PVM
communication times are well above our goal of 0.67 seconds.  It is
not possible to fit these measurement directly to the proposed linear
model, apparently due to hidden system complexities (e.g., some parts
of the communication operation are performed in parallel, while others
are contend for sequential resources). However, it is apparent from
the steep slopes of the PVM curves that the bandwidth of PVM
communication between the C90 and T3D is much too low for our needs.

PFIO appears fast enough for use with two or four processors, but no
more.  In the curves in Figure \ref{fig:pvm:pfio}, time increases
rapidly with number of messages and slowly with message size,
suggesting that PFIO's poor performance as PEs increase is at least in
part due to high per-message overhead. It is encouraging, however,
that the curves are quite flat, suggesting that plenty of bandwidth is
available if we can avoid the overhead, i.e. a parallel funnel may be
practical (perhaps using PVM for the gather/scatter).  We can
conclude, however, that the direct method is not fast
enough using either PFIO or PVM. 

\begin{figure}[htb]
\centerline{\psfig{figure=fig/dhsc1.ps,height=1.5in}
\psfig{figure=fig/dhsc.ps,height=1.5in}}
\caption{Round-trip time for funnel method using DHSC}
\label{fig:dhsc}
\end{figure}

DHSC currently supports communication only between a single C90
process and a single T3D PE, so we can only use it in the funnel
method.  Figure \ref{fig:dhsc} (left) shows the DHSC round-trip time,
as a function of data size, for the $1-1$ case. This provides a lower
bound for the total communication time of the funnel method, and shows
that most of the 0.67 sec. is left for gather/scatter within each
system.  Figure \ref{fig:dhsc} (right) shows multiple C90 processes
communicating with a single C90 process (using PVM for gather/scatter)
which in turn uses DHSC to communicate with a single T3D PE. Even
including the gather/scatter operations on the C90, round-trip time is
under 0.6 sec. for up to 16 processes.  Other measurements have shown
that gather/scatter on the T3D is under 0.1 sec.  As a result, this is
the method we selected for the heterogeneous implementation of the URM
Airshed model.

Based on these results, we implemented the DHSC funnel in the URM
Airshed application, and executed is on the C90-T3D system at PSC,
using five C90 processes and different numbers of T3D PEs.  The model
results have been verified against homogeneous runs.  Detailed
performance analysis requires measurements of the application running
on a dedicated system, which we have not been able to collect.
However, the measurements obtained in the multiuser environment show
that the basic performance measures vary as expected --- computation
time decreases and communication times increase as the number of T3D
processing elements increases.
 

\section{Conclusion}

Using a distributed version of the URM Airshed model, we explored
heterogeneous computing across the C90 and T3D, and learned that
achieving good communication performance between the C90 and T3D is
difficult. Since the main array, \C, is transposed and redistributed
when it is sent between the C90 and T3D, direct communication results
in many small messages and therefore high overhead. After
investigating several options, we settled on a method in which C90-T3D
communication is implemented as a three-step ``funnel'' method consisting of a
gather operation on the sending system, a point-point transfer, and a
scatter operation on the receiving system. This method minimized
overhead and thereby maximized effective bandwidth.  It is the only
method we evaluated that met the time budget required for a 15 minute
turnaround time for a 24 hour simulation of NE.  Further, we believe
the utility of this ``funnel'' technique is not limited to
this specific application or architecture, but should instead be part
of the set of general techniques available to developers of
heterogeneous applications. Therefore, we plan to develop it further
as a reusable software component.

\section*{Acknowledgements}

The authors thank Sean Friel for his programming efforts on this
project, and  acknowledge the use of computational resources
provided by the Pittsburgh Supercomputing Center.

\bibliography{hetero}

\end{document}
