\documentstyle[leqno,twoside,11pt,ltexproc]{article} %You must set up your
                                              %\documentstyle line like this.
\newcommand{\Chem}[0]{{\bf CHEM}}
\newcommand{\Trans}[0]{{\bf TRANS}}
\newcommand{\C}{{\bf C}}

\input{psfig}

\begin{document}
\bibliographystyle{siamproc}
\cleardoublepage
\pagestyle{myheadings}

\title{Chapter 1\\
Portability and Scalability of a\\
Distributed Multiscale Air Quality Model\thanks{This research was supported by the National Science Foundation under contract CCR-9217365.  The authors acknowledge the use of computational resources provided by the
Pittsburgh Supercomputer Center.}}
\author{Edward Segall\thanks{Department of Computer Science,  Carnegie Mellon University, Pittsburgh, PA.}
\and 
Peter Steenkiste\footnotemark[2]
\and
Naresh Kumar\thanks{Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA}
\and
Armistead Russell\footnotemark[3]
}
\date{}
\maketitle
\markboth{Segall et al.}{SIAM Proceedings Series Macros} % See section 5 above
                                                       % for explanation.
\pagenumbering{arabic}

\begin{abstract}
We have converted the Urban-to-Regional Multiscale (URM) Airshed Model
from a sequential to a message-based distributed program.  In the
process, we learned that multiscale models are inherently harder
to parallelize than uniform-grid models, due to their increased
structural complexity.  Consequently, straightforward data parallelism
(such as supported by Fortran-90) does not lead to high parallel
speedups for the current generation of distributed multiscale models.
This leads to two interesting results: first, even
small parallel speedup leads to a fast time-to-solution (due to
the inherent efficiency of the multiscale approach) and second, 
mixed task and data parallelism can be used to improve the scalability
of multiscale models to achieve even faster solutions.  
\end{abstract}

%=============================================================================
%While part of HTRANS (back-substitution) scales
%moderately well, the degree of parallelism for some of the
%preprocessing steps it requires (matrix assembly, factorization and
%distribution) is limited. As a result, data parallelism
%(such as is available in Fortran-90) can achieve only limited overall
%application speedup.  Further, these are sequential in the currentIn this paper, we discuss 
%
%basis.  The major obstacle to this goal is that data input,
%preprocessing, postprocessing and output are all sequential in the
%current code.  As the number of processors increases, the time taken
%by the two main computational phases decreases. Since the time taken
%by the sequential tasks remains constant, their relative cost is
%increased, resulting in reduced parallel efficiency.  Clearly,
%achieving scalability of the major computational phases is just one
%step on the road to making the entire application scalable.  In our
%talk we will describe how parallelizing and pipelining this
%application's component tasks can improve the overall parallel speedup
%and efficiency.
%=============================================================================


\section{Application overview}

The Urban-to-Regional Multiscale (URM) model, a multiscale grid
version of the CIT Airshed
model~\cite{Harley:Russell:McRae:Cass:Seinfeld}, is a 3-dimensional,
time-dependent Eulerian model of the transport, deposition and
chemical evolution of atmospheric pollutants.  It is based on the
atmospheric diffusion equation, $$\frac{\partial c_i}{\partial
t}~+~\nabla .  ({\bf{u}}c_i)~=~~\nabla \cdot ({\bf{K}} \nabla
c_i)~+~f_i~+~S_i \eqno(1)$$ Here, $c_i$ is the concentration of the
{\it i}th pollutant among {\it p} species $(i = 1,...,p)$, {\bf{u}}
describes the velocity field, {\bf{K}} is the diffusivity tensor, {\it
f$_i$(c$_1$, ..., c$_p$)} is the chemical reaction term and {\it
S$_i$} is the net source term. URM solves (1) using operator
splitting, advancing in time as
$$c^{n+1}~=~L_{xy}~(\Delta t/2)~L_{cz}~(\Delta t)~L_{xy}~(\Delta
t/2)~c^{n} \eqno(2)$$ 
where $L_{xy}$ is the two-dimensional horizontal transport operator
and $L_{cz}$ is the chemistry and vertical transport operator\cite{grand:first}.

The Streamline Upwind Petrov-Galerkin (SUPG) finite element method is
used for $L_{xy}$\cite{odman:russell:multiscale}.  For $L_{cz}$, the
hybrid scheme of \cite{young:boris:numerical} for stiff systems of
ordinary differential equations is used.  The use of a 2-dimensional
horizontal transport operator is a direct result of the use of
multiscale grid, which is significantly more efficient from a
computational standpoint than use of a uniform grid, because it
requires evaluation of $L_{cz}$ at fewer points.

\begin{figure}[htb]
\centerline{\psfig{figure=fig/structure.ps,height=1.6in}
\hspace{0.3in}
\psfig{figure=fig/phases.ps,height=1.5in}}
\caption{URM code structure (left) and main data structure (right)}
\label{fig:structure}
\end{figure}

\section{Distributed URM --- Data Parallelism}

Figure \ref{fig:structure} shows the structure of the sequential URM
hourly loop.  There are two main computational phases --- \Chem\
(which corresponds to $L_{cz}$) and \Trans\ ($L_{xy}$). The loop first
reads the hourly meteorological and emissions input data.  Next, for
each time step, \Chem\ and \Trans\ update the concentration array
(\Chem\ once, \Trans\ twice), using emissions or meteorological data
as inputs.  Finally, the hourly output data is saved.

The main data structure is a 3-dimensional array, \C\ (called {\it
conc} in Figure \ref{fig:structure}), representing chemical
concentrations in the volume being modeled.  Its dimensions are nodes
(horizontal grid points), vertical layers, and chemical species. Note
that \C\ does not contain separate $x$ and $y$ dimensions. Instead,
each node has a sequence number, and is indexed by that number.
\Chem\ and \Trans\ both access all of \C, but traverse its dimensions
in different orders. \Trans\ contains a nested loop that runs over
layers (outermost), species, and nodes (innermost), so it performs
best if \C\ is declared with its dimensions stored (in order of
decreasing locality) with nodes first, then species, then layers.  In
\Chem, the node loop is outermost, and the best storage order is:
species, layers, nodes. Thus, locality of reference is improved by
transposing \C\ between the two phases.

Within \Chem, iterations of the node loop are independent, allowing
data-parallel execution. Similarly, in \Trans, the layer/species loops
contain two-level nested data-parallelism.  There is no obvious
outer-loop parallelism within the node loop, because node iterations
in \Trans\ are not independent. 

We have developed a distributed version of URM that exploits
outer-loop data parallelism in \Chem\ and \Trans \cite{grand:first}.
It contains a master process (for I/O and control), and two sets of
parallel processes --- one for \Chem\ and one for \Trans. The main
source of interprocess communication in this version is the repeated
redistribution of \C\ between \Chem\ and \Trans.  The transpose of \C,
mentioned above, is implemented as part of this communication.

\section{Performance of the Data-Parallel Implementation} 

The data-parallel implementation uses PVM \cite{pvm:dmcc6} for
communication.  It is quite portable: it has been successfully run on
a variety of workstation clusters, on the Cray C90, and on the Cray
C90-T3D heterogeneous supercomputer\cite{segall:hetero:epa:local}.  On
workstation clusters, we have measured speedups of about six on ten
processors using this implementation. We expect the parallel
efficiency to drop below the current $60\%$ as the number of
processors scales up.  Further, analysis of the time taken by
different parts of the program suggests a 25-fold speedup is about the
maximum possible even if many more processors are used (assuming no
change to the size of the domain or to the solvers used).  In
contrast, Dabdub and Seinfeld \cite{seinfeld:parallelization:paragon}
have recently reported speedups of about 30 and 87, respectively (on a
256-processor Intel Touchstone Delta) for two different transport
schemes, on a parallel version of the CIT Airshed model (using the
same inputs).

One obstacle to scaling URM to hundreds of processors is the
relatively low degree of outer-loop parallelism in $L_{xy}$.  It is on
the order of $nspecies*nlayers$ (less than $200$ for current domains,
though this will increase in the future). In contrast, uniform-grid
models multiply this factor by the size of the $x$ or $y$ dimensions.
In compensation, URM requires significantly less computation to model
a given region than do uniform-grid models\footnote{For example, the
sequential URM time-to-solution for the LA basin is $1/2$ that of the
CIT time.}. As a result, parallel time-to-solution is shorter even
though speedup is lower\footnote{A 24-hour URM simulation of LA takes
505 sec. on ten DEC Alpha 3000/400 and 3000/500 workstations (using a
DEC Gigaswitch network), compared to 1300-1800 sec. for the parallel
CIT model on a 256-processor Intel Touchstone Delta.  Note: while the
Alpha CPUs are several times faster than the Intel 860 CPUs of the
Delta, the Delta is still more powerful overall than the Alpha
cluster.}.  This illustrates that when evaluating the performance of a
distributed computation, not only speedup should be considered, but
also the algorithms used. Further, detailed comparison of different
models requires a means of equating algorithms that exhibit different
numerical behavior.

\section{Distributed URM -- Mixed Task and Data Parallelism}

Other factors also reduce parallel URM speedup, including
sequentiality in data I/O and hourly preprocessing/post-processing, as
well as communication latency between \Chem\ and \Trans. These are in
fact the current major performance issues, and must be addressed
before $L_{xy}$ parallelism becomes a limitation.

I/O can be parallelized using both data parallelism (each array can be
input by a separate processor) and pipelining (the next hour's data
can be input while the computation for the current hour is being
performed).  These techniques are not supported by current
data-parallel versions of Fortran.  They are straightforward to
implement in a MIMD programming model such as PVM, and are also
supported by the task extensions of Fx (a research HPF compiler that
supports both task and data parallelism\cite{gross94a}).
Post-processing can be similarly pipelined.

Critical-path communication time is ultimately a function of platform
selection. Its impact on speedup is strongly dependent on problem size
--- larger problems are less sensitive to overhead\footnote{For
example, in our simulations of the Northeast U.S. region (about five
times the size of the LA region), communication overhead did not
measurably affect performance on the Alpha cluster.}.  Thus, modeling
larger regions results in better scalability. Increasing the amount of
computation per communication also improves speedup. Future aerosol
and aqueous-phase solvers will have this effect. 

One issue, the hourly preprocessing required by the multiscale method,
requires more detailed discussion: Use of a two-dimensional transport
operator results in only one finite element stiffness matrix per
layer; thus, during preprocessing, outer-loop parallelism is equal to
only the number of layers\footnote{In contrast, within each layer, a
one-dimensional finite-element solver $L_{x}$ forms one matrix for
every $y$ index value, and $L_{y}$ forms one for each $x$ value.}.
This matrix is formed and factored once per hour.  The time to do this
is small compared to the overall sequential time; however,
parallelization of \Chem\ and \Trans\ reduces their execution time,
increasing the relative significance of preprocessing time.  Our
current implementation factors the stiffness matrices for different
layers in parallel.  Scalability can be further improved using
pipelining, as described for inputs.  This results in a multi-level
pipeline --- while hour $n$ is in the time step loop, the stiffness
matrix for hour $n+1$ may be factored, and at the same time, the
data for hour $n+1$ may be input.  Figure~\ref{fig:mixed}
illustrates these techniques.

\begin{figure}[htb]
\centerline{\psfig{figure=fig/hetero.eps,height=2in}}
\caption{Mixed Task and Data Parallelism in the URM Model}
\label{fig:mixed}
\end{figure}


\section{Conclusion}

We have parallelized the URM Airshed model and characterized issues
related to scaling any similar multiscale model to many parallel
processors. We have identified short-term obstacles to scalability
that can be addressed through the use of mixed task and data
parallelism.  A longer-term issue, the transport layer/species loop,
will require inner-loop as well as outer-loop parallelism for high
scalability --- an issue we have not yet addressed. Larger problem
domains and new, more compute-intensive solvers for aqueous and
aerosol-phase chemistry will improve grain size.  Most significantly,
we have found that although multiscale models are less scalable than
uniform-grid models, they nonetheless achieve excellent
times-to-solution and appear scalable to hundreds of processors.

\bibliography{t3d}

\end{document}


