%% This document created by Scientific Word (R) Version 2.0
%% Starting shell: sw20siam.shl


\documentstyle[11pt,sw20siam]{siamltex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%TCIDATA{TCIstyle=Article/art1.lat,siam,sw20siam}

\input tcilatex
\begin{document}

\title{ENVIRONMENTAL MODELING ON THE C90-T3D HETEROGENEOUS SYSTEM\thanks{%
This research was supported by the National Science Foundation under
contract CCR-9217365. The authors thank Sean Friel for his programming
efforts on this project and gratefully acknowledge the use of computational
resources provided by the Pittsburgh Supercomputing Center.}}
\author{Edward Segall\thanks{%
Department of Computer Science, Carnegie Mellon University, 5000 Forbes
Avenue, Pittsburgh, Pennsylvania 15213-3891 ({\tt segall@n3.sp.cs.cmu.edu}).
(now with the Department of Computer and Information Sciences, University of
Delaware, STREET\ ADDRESS, Newark, Delaware 19716.)} \and Kip Walker\thanks{%
Department of Computer Science, Carnegie Mellon University, 5000 Forbes
Avenue, Pittsburgh, Pennsylvania 15213-3891.} \and Peter Steenkiste%
\footnotemark[3]  \and Armistead Russell\thanks{%
Department of Mechanical Engineering, Carnegie Mellon University, STREET
ADDRESS, Pittsburgh, Pennsylvania 15213.}}
\maketitle

\begin{abstract}
We describe how the heterogeneous nature of parallelism in the
Urban-to-Regional Multiscale (URM) Airshed air quality model makes it
desirable to execute the model on a heterogeneous parallel computer. The
chemistry/vertical transport phase contains a significant level of
coarse-grain data parallelism, and so can take advantage of massively
parallel processor (MPP) architecture. In contrast, the horizontal transport
phase has a modest level of data parallelism and is vectorizable; it maps
well to a more traditional shared-memory supercomputer. We describe how we
optimized communication on the C90-T3D heterogeneous supercomputer, which
consists of a parallel vector supercomputer and a closely-coupled MPP,
specifically for models like URM.
\end{abstract}

\section{A heterogeneous URM for the C90-T3D}

The Urban-to-Regional Multiscale (URM) Airshed air quality model is
described in a companion paper\cite{segall:scale:epa:local}. {\bf C}, the
main data structure, is described there, as are the main computational
tasks, {\bf CHEM}\ and {\bf TRANS}. Both tasks are parallelizable, but have
different characteristics. {\bf CHEM}, the chemistry/vertical transport
phase, has a very high degree of outer-loop parallelism (equal to the number
of chemical species, $nspec$, times the number of horizontal grid points, $%
nnodes$), while {\bf TRANS}, the horizontal transport phase, has more
limited outer-loop parallelism (equal to the product of atmospheric layers, $%
nlayers$, and $nspec$). However, inner-loop parallelism, in the form of
vectorizable operations, is also available from {\bf TRANS}, with vector
lengths of approximately $\sqrt{nnodes}$.

As the size of the domain increases, the amount of work performed by {\bf %
TRANS}\ increases faster than the amount performed by {\bf CHEM}\ (due to
the same $\sqrt{nnodes}$ term). This results in approximately equal division
of work for regional-size domains, which are the targets of this application%
\footnote{%
Incorporating aerosol and aqueous-phase chemistry will push the balance back
towards {\bf CHEM}.}. Thus, we need to scale the two phases roughly equally,
even though they have different characteristics. This distinction makes the
URM model a good candidate for execution on a heterogeneous platform
consisting of a traditional shared memory vector supercomputer (Cray C90)
and a massively parallel processor (Cray T3D).

Our Parallel Virtual Machine (PVM) \cite{pvm:dmcc6} version of the
application \cite{segall:scale:epa:local} is a good starting point for the
C90-T3D implementation since both the C90 and T3D support PVM for internal
communication. For communication between the C90 and T3D, a number of
options exist:

\begin{itemize}
\item  PVM which is the most portable option.

\item  Parallel File I/O (PFIO) which was developed by Sanielevici at the
Pittsburgh Supercomputing Center (PSC) and is based on files that are shared
by processes on the C90 and T3D.

\item  Distributed Heterogeneous Supercomputing Communication (DHSC) \cite
{dhsc:ipps} which is a package developed at PSC for communication between
the C90 and the CM2 or T3D MPPs.

\item  Direct shared-memory (SHMEM) access between the C90 and T3D which is
a Cray-proprietary interface. This would have been considered only if none
of the others proved suitable.
\end{itemize}

Because {\bf C}\ is transposed and redistributed between {\bf CHEM}\ and 
{\bf TRANS}, data from each C90 process is sent to every T3D node, and
vice-versa. This is done in two ways as follows:

\begin{enumerate}
\item  Each sending process sends data directly to each receiving process;
hence, we call this the {\em direct} method. Advantages are that each datum
is transferred exactly once, and that all send and receive operations can be
performed in parallel. However, a disadvantage is that the size of each
message is quite small.

\item  One process on the sending system gathers data from all sending
processes, and then sends it as one large message to a single receiving
process, which then scatters (distributes) the data to all receiving
processes. Each datum is transfered three times, but only one (large)
message is exchanged between the C90 and T3D. We call this the {\em funnel}
method.
\end{enumerate}

An intermediate solution is the {\em parallel funnel} method, in which a
small number (larger than one) of processes on each system perform the
C90-T3D communication. To understand the tradeoffs between these methods, we
analyze their communication costs. The cost of sending or receiving a
message of length $L$ bytes is approximated by $\alpha +\beta L$, where $%
\alpha $ is the per-message cost and $\beta $ is the cost for each byte in
the message. $\alpha $ and $\beta $ depend both on the source and
destination (C90-T3D, T3D-T3D, or C90-C90) and on the communication library
used. In general, it costs more to communicate between systems than within a
system. Also, between a given source and destination, it is more efficient
to send a single large message than a stream of short messages, because $%
\alpha $ is a one-time cost.

With the direct method, each datum is sent and received only once, but a
large number of messages have to be sent over the C90-T3D link. If $P_s$ and 
$P_d$ are the number of processes on the source and destination systems,
then a total of $P_s*P_d$ messages are sent, each of size $D/(P_s*P_d)$,
where $D$ is the size of {\bf C}. $D$ ranges from a few megabytes (MB) at
the present time to several tens of MB in the near future. Assuming on the
order of 10 C90 processors and 100 T3D processors, we can expect message
sizes of a few kilobytes (KB) to a few tens of KB. Such message sizes are
potentially too small for efficient C90-T3D communication.

In the funnel methods, much larger messages are exchanged between the C90
and T3D, making that communication step more efficient. The drawback is that
these also require scatter/gather operations inside the C90 and T3D, which
involves small messages. Note, however, that the cost of this additional
communication will typically be lower than that of C90-T3D communication.
Which method is most efficient will depend on the specific values of the
per-message and per-byte overheads, as well as hard-to-predict second-order
effects such as resource contention.

\section{Experimental results}

To further evaluate the different methods, we implemented both the direct
and funnel methods. We concentrated on the communication requirements for
URM, as applied to the Northeastern U.S. (NE) region where $D$ is 4.7 MB
compared to $\sim $1~MB for the Los Angeles Basin. We expect future runs to
use larger arrays, of the order of tens of MB, because of both an increase
in the number of layers or species and because larger regions will be
modeled. Our goal is to do a 24 hour simulation of the NE region in 15
minutes. This will require a round-trip communication of {\bf C}\ between
the C90 and T3D in 0.67 seconds or less. This is a relatively modest
requirement since it corresponds to a one-way sustained bandwidth of under
15 MB/second (a small fraction of what the hardware is designed to support
at its maximum throughput). The challenge is to achieve this bandwidth
end-to-end (from a distributed source to a distributed destination, with a
transpose along the way), for the array sizes of interest.\FRAME{ftbpFU}{%
4.3491in}{2.5495in}{0pt}{\Qcb{C90-T3D round-trip time for the direct method
using PVM.}}{}{seg21s.tif}{\special{language "Scientific Word";type
"GRAPHIC";maintain-aspect-ratio TRUE;display "USEDEF";valid_file "F";width
4.3491in;height 2.5495in;depth 0pt;cropleft "0";croptop "1.0012";cropright
"1.0003";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL2/SEG21S.TIF';file-properties "XNPEU";}}\FRAME{ftbpFU%
}{4.1701in}{2.5495in}{0pt}{\Qcb{C90-T3D round-trip time for the direct
method using PFIO.}}{}{seg22s.tif}{\special{language "Scientific Word";type
"GRAPHIC";maintain-aspect-ratio TRUE;display "USEDEF";valid_file "F";width
4.1701in;height 2.5495in;depth 0pt;cropleft "0";croptop "1.0102";cropright
"1.0022";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL2/SEG22S.TIF';file-properties "XNPEU";}}

Figures 1 and 2 show our results for the direct method, using PVM and PFIO,
respectively, to communicate between equal numbers of processing elements
(PEs) on the C90 and T3D. For each curve, the x-axis shows the amount of
data; the y-axis shows the transfer time. Note that the PVM communication
times are well above our goal of 0.67 seconds. It is not possible to fit
these measurements directly to the proposed linear model, apparently due to
hidden system complexities. The latter include some parts of the
communication operation being performed in parallel, while others contend
for sequential resources. However, it is apparent from the steep slopes of
the PVM curves that the bandwidth of PVM communication between the C90 and
T3D is much too low for our needs.

PFIO appears fast enough for use with two or four processors, but no more.
In the curves in Figure 2, time increases rapidly with number of messages
and slowly with message size, suggesting that PFIO's poor performance as the
number of PEs increase is due, in part, to high per-message overhead. It is
encouraging, however, that the curves are quite flat, suggesting that ample
bandwidth is available if overhead can be avoided and a parallel funnel may
be practical (perhaps using PVM for the gather/scatter). However, we may
safely conclude that the direct method is not fast enough using either PFIO
or PVM.\FRAME{ftbpFU}{4.1701in}{2.5495in}{0pt}{\Qcb{C90-T3D round-trip time
for sequential communication using DHSC.}}{}{seg23s.tif}{\special{language
"Scientific Word";type "GRAPHIC";maintain-aspect-ratio TRUE;display
"USEDEF";valid_file "F";width 4.1701in;height 2.5495in;depth 0pt;cropleft
"0";croptop "1.0063";cropright "0.9971";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL2/SEG23S.TIF';file-properties "XNPEU";}}\FRAME{ftbpFU%
}{4.1995in}{2.5287in}{0pt}{\Qcb{C90-T3D round-trip time for the funnel
method using DHSC.}}{}{seg24s.tif}{\special{language "Scientific Word";type
"GRAPHIC";maintain-aspect-ratio TRUE;display "USEDEF";valid_file "F";width
4.1995in;height 2.5287in;depth 0pt;cropleft "0";croptop "1.0009";cropright
"1.0008";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL2/SEG24S.TIF';file-properties "XNPEU";}}

DHSC currently supports communication only between a single C90 process and
a single T3D PE, so we can only use it in the funnel method. Figure 3 shows
the DHSC round-trip time, as a function of data size, for the $1-1$ case.
This provides a lower bound for the total communication time of the funnel
method, and shows that most of the 0.67 seconds is left for gather/scatter
within each system. Figure 4 shows multiple C90 processes communicating with
a single C90 process (using PVM for gather/scatter) which, in turn, uses
DHSC to communicate with a single T3D PE. Even including the gather/scatter
operations on the C90, round-trip time is under 0.6 sec. for up to 16
processes. Other measurements have shown that gather/scatter on the T3D is
under 0.1 sec. As a result, this is the method we selected for the
heterogeneous implementation of the URM Airshed model.

Based on these results, we implemented the DHSC funnel in the URM Airshed
application, and executed it on the C90-T3D system at the PSC, using five
C90 processes and differing numbers of T3D PEs. The model results have been
verified against homogeneous runs. Detailed performance analysis requires
measurements of the application running on a dedicated system, which we have
not been able to collect. However, the measurements obtained in the
multiuser environment show that the basic performance metrics vary as
expected --- computation time decreases and communication times increase as
the number of T3D processing elements increases.

\section{Conclusion}

Using a distributed version of the URM Airshed model, we explored
heterogeneous computing across the C90 and T3D, and learned that achieving
good communication performance between the C90 and T3D was difficult. Since
the main array, {\bf C}, was transposed and redistributed when it was sent
between the C90 and T3D, direct communication resulted in many small
messages and therefore, high overhead. After investigating several options,
we settled on a method in which C90-T3D communication was implemented as a
three-step ``funnel'' method consisting of a gather operation on the sending
system, a point-point transfer, and a scatter operation on the receiving
system. This method minimized overhead and thereby maximized effective
bandwidth. It was the only method we evaluated that met the time budget
required for a 15 minute turnaround time for a 24 hour simulation of NE.
Further, we believe the utility of this ``funnel'' technique is not limited
to this specific application or architecture, but should instead be part of
the set of general techniques available to developers of heterogeneous
applications. Therefore, we plan to develop it further as a reusable
software component.

\begin{thebibliography}{9}
\bibitem{segall:scale:epa:local}  E.~Segall, P.~Steenkiste, N.~Kumar, and
A.~Russell, {\em Scalability of a portable, distributed multiscale air
quality model}. Preceeding paper in these proceedings.

\bibitem{pvm:dmcc6}  G.A. Geist and V.S. Sunderam, {\em The {PVM} system:
Supercomputer level concurrent computation on a heterogeneous network of
workstations}, in Proceedings of the Sixth Distributed Memory Computing
Conference, LOCATION, IEEE, April 1991, pp.~258--261.

\bibitem{dhsc:ipps}  J.~Mahdavi, G.~L. Huntoon, and M.~B. Mathis, {\em %
Deployment of a {\sc \ hippi}-based distributed supercomputing environment
at the Pittsburgh Supercomputing Center}, in International Parallel
Processing Symposium, Los Angeles, IEEE, April 1992, pp. \_\_-\_\_\_.
\end{thebibliography}

\end{document}
