31-Jan-96 22:34:23-GMT,15665;000000000001
Return-Path: <gfl@alder.nesc.epa.gov>
Received: from alder.nesc.epa.gov by N3.SP.CS.CMU.EDU id aa14090;
          31 Jan 96 17:35:00 EST
Received: by alder.nesc.epa.gov (5.4R3.10T/140.2)
	id AA23017; Wed, 31 Jan 1996 17:34:23 -0500
Date: Wed, 31 Jan 1996 17:34:23 -0500
From: gfl@alder.nesc.epa.gov (George Delic)
Message-Id: <9601312234.AA23017@alder.nesc.epa.gov>
Content-Type: text
Content-Length: 15218
Apparently-To: segall@N3.SP.CS.CMU.EDU

%% This document created by Scientific Word (R) Version 2.0
%% Starting shell: sw20siam.shl


\documentstyle[11pt,sw20siam]{siamltex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%TCIDATA{TCIstyle=Article/art1.lat,siam,sw20siam}

\input tcilatex
\begin{document}

\title{SCALABILITY OF A PORTABLE, DISTRIBUTED MULTISCALE AIR QUALITY MODEL\thanks{%
This research was supported by the National Science Foundation under
contract CCR-9217365. The authors gratefully acknowledge the use of
computational resources provided by the Pittsburgh Supercomputer Center.}}
\author{Edward Segall\thanks{%
Department of Computer Science, Carnegie Mellon University, 5000 Forbes
Avenue, Pittsburgh, Pennsylvania 15213-3891 ({\tt segall@n3.sp.cs.cmu.edu}).
(now with the Department of Computer and Information Sciences, University of
Delaware, STREET\ ADDRESS, Newark, Delaware ZIP CODE.)} \and Peter Steenkiste%
\thanks{%
Department of Computer Science, Carnegie Mellon University, 5000 Forbes
Avenue, Pittsburgh, Pennsylvania 15213-3891.} \and Naresh Kumar\thanks{%
Department of Mechanical Engineering, Carnegie Mellon University, STREET
ADDRESS, Pittsburgh, Pennsylvania ZIP\ CODE. (Now with Sonoma Technology,
Inc., STREET\ ADDRESS, Santa Rosa, California, ZIP CODE.)} \and Armistead
Russell\thanks{%
Department of Mechanical Engineering, Carnegie Mellon University, STREET
ADDRESS, Pittsburgh, Pennsylvania ZIP\ CODE.}}
\maketitle

\begin{abstract}
We have converted the Urban-to-Regional Multiscale (URM) Airshed Model from
a sequential to a message-based distributed program. In the process, we
learned that multiscale models are inherently harder to parallelize than
uniform-grid models, due to their increased structural complexity.
Consequently, straightforward data parallelism (such as supported by
Fortran-90) does not lead to high parallel speedups for the current
generation of distributed multiscale models. This leads to two interesting
results: (a) even small parallel speedup leads to a fast time-to-solution
(due to the inherent efficiency of the multiscale approach) and (b) mixed
task and data parallelism can be used to improve the scalability of
multiscale models to achieve even faster solutions.
\end{abstract}

\section{Application overview}

The Urban-to-Regional Multiscale (URM) model (a multiscale grid version of
the CIT Airshed model~\cite{Harley:Russell:McRae:Cass:Seinfeld}), is a
three-dimensional, time-dependent Eulerian model of the transport,
deposition and chemical evolution of atmospheric pollutants. It is based on
the atmospheric diffusion equation 
$$
\frac{\partial c_i}{\partial t}~+~\nabla .({\bf {u}}c_i)~=~~\nabla \cdot (%
{\bf {K}}\nabla c_i)~+~f_i~+~S_i\eqno(1)
$$
where, $c_i$ is the concentration of the {\it i}th pollutant among {\it p}
species $(i=1,...,p)$, {\bf {u}} describes the velocity field, {\bf {K}} is
the diffusivity tensor, {\it f$_i$(c$_1$, ..., c$_p$)} is the chemical
reaction term and {\it S$_i$} is the net source term. The URM solves (1)
using operator splitting, advancing in time as 
$$
c^{n+1}~=~L_{xy}~(\Delta t/2)~L_{cz}~(\Delta t)~L_{xy}~(\Delta t/2)~c^n%
\eqno(2)
$$
where $L_{xy}$ is the two-dimensional horizontal transport operator and $%
L_{cz}$ is the chemistry and vertical transport operator\cite{grand:cce}.

The Streamline Upwind Petrov-Galerkin (SUPG) finite element method is used
for $L_{xy}$\cite{odman:russell:multiscale}. For $L_{cz}$, the hybrid scheme
of \cite{young:boris:numerical} for stiff systems of ordinary differential
equations is used. The use of a two-dimensional horizontal transport
operator is a direct result of the use of multiscale grid, which is
significantly more efficient from a computational standpoint than use of a
uniform grid, because it requires evaluation of $L_{cz}$ at fewer points.%
\FRAME{ftbpFU}{2.84in}{2.2693in}{0pt}{\Qcb{Code structure for the
Urban-to-Regional Multiscale model.}}{}{seg11c.tif}{\special{language
"Scientific Word";type "GRAPHIC";maintain-aspect-ratio TRUE;display
"USEDEF";valid_file "F";width 2.84in;height 2.2693in;depth 0pt;cropleft
"0";croptop "0.9980";cropright "1.0008";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL1/SEG11C.TIF';file-properties "XNPEU";}}\FRAME{ftbpFU%
}{4.9977in}{2.1551in}{0pt}{\Qcb{Main data structure for the
Urban-to-Regional Multiscale model.}}{}{seg12c.tif}{\special{language
"Scientific Word";type "GRAPHIC";maintain-aspect-ratio TRUE;display
"USEDEF";valid_file "F";width 4.9977in;height 2.1551in;depth 0pt;cropleft
"0";croptop "1.0492";cropright "0.9994";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL1/SEG12C.TIF';file-properties "XNPEU";}}

\section{Distributed URM: data parallelism}

Figures 1 and 2, respectively, show the code and main data structure of the
sequential URM hourly loop. There are two main computational phases: {\bf %
CHEM}\ (which corresponds to $L_{cz}$) and {\bf TRANS}\ ($L_{xy}$). The loop
first reads the hourly meteorological and emissions input data. Next, for
each time step, {\bf CHEM}\ and {\bf TRANS}\ update the concentration array (%
{\bf CHEM}\ once, {\bf TRANS}\ twice), using emissions or meteorological
data as inputs. Finally, the hourly output data is saved.

The main data structure is a three-dimensional array, {\bf C}\ (called {\sc %
conc} in Figure 1), representing chemical concentrations in the volume being
modeled. Its dimensions are nodes (horizontal grid points), vertical layers,
and chemical species. Note that {\bf C}\ does not contain separate $x$ and $y
$ dimensions. Instead, each node has a sequence number, and is indexed by
that number. Routines {\bf CHEM}\ and {\bf TRANS}\ both access all of {\bf C}%
, but traverse its dimensions in different order. {\bf TRANS}\ contains a
nested loop that runs over layers (outermost), species, and nodes
(innermost), so it performs best if {\bf C}\ is declared with its dimensions
stored (in order of decreasing locality) with nodes first, then species,
then layers. In {\bf CHEM}, the node loop is outermost, and the best storage
order is: species, layers, nodes. Thus, locality of reference is improved by
transposing {\bf C}\ between the two phases.

Within {\bf CHEM}, iterations of the node loop are independent, allowing
data-parallel execution. Similarly, in {\bf TRANS}, the layer/species loops
contain two-level nested data-parallelism. There is no obvious outer-loop
parallelism within the node loop, because node iterations in {\bf TRANS}\
are not independent.

We have developed a distributed version of URM that exploits outer-loop data
parallelism in {\bf CHEM}\ and {\bf TRANS } \cite{grand:cce}. It contains a
master process (for I/O and control), and two sets of parallel processes ---
one for {\bf CHEM}\ and one for {\bf TRANS}. The main source of interprocess
communication in this version is the repeated redistribution of {\bf C}\
between {\bf CHEM}\ and {\bf TRANS}. The transpose of {\bf C}, mentioned
above, is implemented as part of this communication.

\section{Performance of the data-parallel implementation}

The data-parallel implementation uses PVM \cite{pvm:dmcc6} for
communication. It is portable and has been successfully executed on a
variety of workstation clusters, the Cray C90, and the Cray C90-T3D
heterogeneous supercomputer \cite{segall:hetero:NGEMCOM:thiscollection}. On
workstation clusters, we have measured speedups of about six on ten
processors using this implementation. We expect the parallel efficiency to
drop below the current $60\%$ as the number of processors increases.
Further, analysis of the time taken by different parts of the program
suggests a 25-fold speedup is about the maximum possible even if many more
processors are used (assuming no change in the size of the domain or solvers
used). In contrast, Dabdub and Seinfeld \cite
{seinfeld:parallelization:paragon} have recently reported speedups of about
30 and 87, respectively (on a 256-processor Intel Touchstone Delta) for two
different transport schemes, on a parallel version of the CIT airshed model
(using the same inputs).

One obstacle to scaling URM to hundreds of processors is the relatively low
degree of outer-loop parallelism in $L_{xy}$. It is on the order of $%
nspecies*nlayers$ (less than $200$ for current domains, though this will
increase in the future). In contrast, uniform-grid models multiply this
factor by the size of the $x$ or $y$ dimensions. In compensation, URM
requires significantly less computation to model a given region than do
uniform-grid models\footnote{%
For example, the sequential URM time-to-solution for the Los Angeles basin
is $1/2$ that of the CIT time.}. As a result, parallel time-to-solution is
shorter even though speedup is lower\footnote{%
A 24-hour URM simulation of Los Angeles takes 505 seconds on ten DEC Alpha
3000/400 and 3000/500 workstations (using a DEC Gigaswitch network),
compared to 1300-1800 seconds for the parallel CIT model on a 256-processor
Intel Touchstone Delta. Note that while the Alpha CPUs are several times
faster than the Intel 860 CPUs of the Delta, the Delta is still more
powerful overall than the Alpha cluster.}. This illustrates the fact that
when evaluating the performance of a distributed computation, not only
speedup should be considered, but also the algorithms used. Further,
detailed comparison of different models requires a means of equating
algorithms that exhibit different numerical behavior.

\section{Distributed URM -- mixed task and data parallelism}

Other factors also reduce parallel URM speedup, including sequentiality in
data I/O and hourly preprocessing/post-processing, as well as communication
latency between {\bf CHEM}\ and {\bf TRANS}. These are in fact the current
major performance issues, and must be addressed before $L_{xy}$ parallelism
becomes a limitation.

I/O can be parallelized using both data parallelism (each array can be input
by a separate processor) and pipelining (the next hour's data can be input
while the computation for the current hour is being performed). These
techniques are not supported by current data-parallel versions of Fortran.
They are straightforward to implement in a MIMD programming model such as
PVM, and are also supported by the task extensions of Fx (a research HPF
compiler that supports both task and data parallelism \cite{gross94a}).
Post-processing can be similarly pipelined.

Critical-path communication time is ultimately a function of platform
selection. Its impact on speedup is strongly dependent on problem size as
larger problems are less sensitive to overhead\footnote{%
For example, in our simulations of the northeast U.S. A. region (about five
times the size of the Los Angeles region), communication overhead did not
measurably affect performance on the Alpha cluster.}. Thus, modeling larger
regions results in better scalability and increasing the amount of
computation per communication also improves speedup. Future aerosol and
aqueous-phase solvers will have this effect.

One issue that requires more detailed discussion is the hourly preprocessing
required by the multiscale method. Use of a two-dimensional transport
operator results in only one finite element stiffness matrix per layer.
Thus, during preprocessing, outer-loop parallelism is equivalent to the
number of layers\footnote{%
In contrast, within each layer, a one-dimensional finite-element solver $%
L_{x}$ forms one matrix for every $y$ index value, and $L_{y}$ forms one for
each $x$ value.}. This matrix is formed and factored once per hour. The time
to do this is small compared to the overall sequential time. However,
parallelization of {\bf CHEM}\ and {\bf TRANS}\ reduces their execution
time, and increases the relative significance of preprocessing time. Our
current implementation factors the stiffness matrices for different layers
in parallel. Scalability can be further improved using pipelining, as
described for inputs. This results in a multi-level pipeline --- while hour $%
n$ is in the time step loop, the stiffness matrix for hour $n+1$ may be
factored, and at the same time, the data for hour $n+2$ may be input.
Figure~3 is a schematic illustration of these techniques.\FRAME{ftbpFU}{%
4.9995in}{2.1793in}{0pt}{\Qcb{Mixed task and data parallelism in the
Urban-to-Regional model.}}{}{seg13c.tif}{\special{language "Scientific
Word";type "GRAPHIC";maintain-aspect-ratio TRUE;display "USEDEF";valid_file
"F";width 4.9995in;height 2.1793in;depth 0pt;cropleft "0";croptop
"1.0118";cropright "0.9950";cropbottom "0";filename
'D:/SW20/NGEMCOM/SEGALL1/SEG13C.TIF';file-properties "XNPEU";}}

\section{Conclusion}

We have parallelized the URM airshed model and characterized issues related
to scaling any similar multiscale model to many parallel processors. We have
identified short-term obstacles to scalability that can be addressed through
the use of mixed task and data parallelism. A longer-term issue, the
transport layer/species loop, will require inner-loop as well as outer-loop
parallelism for high scalability --- an issue we have not yet addressed.
Larger problem domains and new, more compute-intensive solvers for aqueous
and aerosol-phase chemistry will improve grain size. Most significantly, we
have found that although multiscale models are less scalable than
uniform-grid models, they nonetheless achieve excellent time-to-solution and
appear to be scalable to hundreds of processors.

\begin{thebibliography}{9}
\bibitem{seinfeld:parallelization:paragon}  D.~Dardub and J.~Seinfeld, {\em %
Air quality modeling on massively parallel computers}, Atmospheric
Environment, {\bf 28} (1994), pp.~1679--1687.

\bibitem{pvm:dmcc6}  G.A. Geist and V.S. Sunderam, {\em The {PVM} system:
Supercomputer level concurrent computation on a heterogeneous network of
workstations}, in Proceedings of the Sixth Distributed Memory Computing
Conference, {IEEE}, April 1991, pp.~258--261.

\bibitem{gross94a}  T.~Gross, D.~O'Hallaron, and J.~Subhlok, {\em Task
parallelism in a {H}igh {P}erformance {F}ortran framework}, {IEEE} Parallel
\& Distributed Technology, {\bf 2} (1994), pp.~16--26.

\bibitem{Harley:Russell:McRae:Cass:Seinfeld}  R.A. Harley, A.G. Russell,
G.J. McRae, G.R. Cass, and J.H. Seinfeld, {\em \ Photochemical modeling of
the Southern California air quality study}, Envir. Sci. Technol., {\bf 27}
(1993), pp.~378--388.

\bibitem{grand:cce}  N.~Kumar, A.~Russell, E.~Segall, and P.~Steenkiste, 
{\em Parallel and distributed application of an urban-to-regional multiscale
model}. Computers and Chemical Engineering, in press.

\bibitem{odman:russell:multiscale}  M.~T. Odman and A.G. Russell, {\em A
multiscale finite element pollutant transport scheme for urban and regional
modeling}, Atmospheric Environment, {\bf 25A} (1991), pp.~2385--2394.

\bibitem{segall:hetero:NGEMCOM:thiscollection}  E.~Segall, K.~Walker,
P.~Steenkiste, and A.~Russell, {\em Environmental modeling on the C90-T3D
heterogeneous system}. Following paper in these proceedings.

\bibitem{young:boris:numerical}  T.~R. Young and J.~P. Boris, {\em A
numerical technique for solving ordinary differential equations associated
with the chemical kinetics of reactive flow problems}, Journal of Physics
and Chemistry, {\bf 81} (1977), pp.~2424--2427.
\end{thebibliography}

\end{document}
