\documentstyle[leqno,twoside,11pt,ltexproc]{article} %You must set up your
                                              %\documentstyle line like this.
\newcommand{\Chem}[0]{{\bf chem}}
\newcommand{\Trans}[0]{{\bf trans}}
\newcommand{\C}{{\bf conc}}

\input{psfig}

\begin{document}
\bibliographystyle{siamproc}
\cleardoublepage
\pagestyle{myheadings}

\title{Scalability Issues in Multiscale Air Quality Models\thanks{This research was supported by the National Science Foundation under contract CCR-9217365.  The authors acknowledge the use of computational resources provided by the
Pittsburgh Supercomputer Center.}}
\author{Edward Segall\thanks{Department of Computer Science,  Carnegie
Mellon University, Pittsburgh, PA.}~\thanks{Currently with 
Dept. of Computer and Information Sciences, University of
Delaware, Newark, DE.}
\and 
Peter Steenkiste\footnotemark[2]
\and
Naresh Kumar\thanks{Department of Mechanical Engineering, Carnegie
Mellon University, Pittsburgh, PA}~\thanks{Currently with Sonoma
Technology, Inc., Santa Rosa, CA}
\and
Armistead Russell\footnotemark[4]
}
\date{}
\maketitle
\markboth{Segall et al.}{Scalability Issues in Multiscale...} 
\pagenumbering{arabic}

\section{Introduction}
In parallelizing the Urban-to-Regional Multiscale (URM) Airshed Model,
we have learned that multiscale models are inherently harder to
parallelize than uniform-grid models. Straightforward data parallelism
(e.g. Fortran-90) does not lead to high parallel speedups with current
generation solvers.  However, the efficiency of the multiscale
approach provides good time-to-solution with moderate
speedups. Further, proper data distribution and mixed task and data
parallelism can be used to achieve higher speedups where required.
Finally, the different characteristics of the chemistry and transport
solvers can be exploited to take advantage of heterogeneous
distributed computing platforms.

\section{Application overview}

URM is a multiscale grid version of the CIT Airshed
model~\cite{Harley:Russell:McRae:Cass:Seinfeld}.  It solves the
atmospheric diffusion equation \cite{grand:cce} using operator
splitting, advancing in time as
$$c^{n+1}~=~L_{xy}~(\Delta t/2)~L_{cz}~(\Delta t)~L_{xy}~(\Delta
t/2)~c^{n} \eqno(2)$$ 
where $L_{xy}$ is the horizontal transport operator
and $L_{cz}$ is the chemistry/vertical transport operator.  The
Streamline Upwind Petrov-Galerkin (SUPG) finite element method is used
for $L_{xy}$\cite{odman:russell:multiscale}.  The use of a
2-dimensional horizontal transport operator is a direct result of the
use of multiscale grid, which is more efficient from a computational
standpoint than use of a uniform grid, because it requires evaluation
of $L_{cz}$ at fewer points. For $L_{cz}$, the hybrid scheme of
\cite{young:boris:numerical} for stiff systems of ordinary
differential equations is used.


\begin{figure}[htb]
\centerline{\psfig{figure=fig/structure.ps,height=1.6in}
\hspace{0.3in}
\psfig{figure=fig/phases.ps,height=1.5in}}
\caption{URM code structure (left) and main data structure (right)}
\label{fig:structure}
\end{figure}

\section{Distributed URM --- Data Parallelism}

Fig.~\ref{fig:structure} shows the structure of the sequential URM
hourly loop.  \Chem\ corresponds to $L_{cz}$ and \Trans\ to
$L_{xy}$. The loop first reads hourly meteorological and emissions
input data.  Next, for each time step, \Chem\ and \Trans\ update the
concentration array, \C. Finally, hourly output data is saved.
Within \Chem, node loop iterations are independent, allowing
data-parallel execution. Within \Trans, the layer/species loops
contain two-level nested data-parallelism.  Node iterations in \Trans\
are not independent. Consequently, parallelism in \Chem\ and \Trans\
requires orthogonal data distributions
(Fig.~\ref{fig:structure}).

We have developed a distributed version of URM that exploits
outer-loop data parallelism in \Chem\ and \Trans \cite{grand:cce}.
It uses PVM \cite{pvm:dmcc6} for communication.  The main source of
interprocess communication is the repeated redistribution of \C\
between \Chem\ and \Trans.  It has been successfully run on a variety
of workstation clusters, the Cray C90, and the Cray C90-T3D
heterogeneous supercomputer\cite{segall:hetero:NGEMCOM}.  It has
achieved speedups $\geq 6.0$ on ten workstations. Analysis
suggests that 25 is about the maximum possible speedup regardless of
the number of processors used (assuming no change to the size of the
domain or to the solvers used).  In contrast, Dabdub and Seinfeld
\cite{seinfeld:parallelization:paragon} have recently reported
speedups of about 30 and 87, respectively (on a 256-processor Intel
Touchstone Delta) for two different transport schemes, on a parallel
version of the CIT Airshed model (using the same inputs).

Part of the reason for this difference is the relatively low degree of
outer-loop parallelism in $L_{xy}$, compared to that available from
separate $L_{x}$ and $L_{y}$ operators.  In compensation, multiscale
methods require significantly less computation to model a given
region, so time-to-solution is shorter even though speedup is
lower\footnote{A 24-hour URM simulation of LA takes 505 sec. on ten
DEC Alpha 3000/400 and 3000/500 workstations (using a DEC Gigaswitch
network), compared to 1300-1800 sec. for the parallel CIT model on a
256-processor Intel Touchstone Delta.  Note: while the Alpha CPUs are
several times faster than the Intel 860 CPUs of the Delta, the Delta
is still more powerful overall than the Alpha cluster.}.  This
illustrates that when evaluating the performance of a distributed
computation, not only speedup should be considered, but also the
algorithms used. Further, detailed comparison of different models
requires a means of equating algorithms that exhibit different
numerical behavior.

Significantly, we observed in this work that a cyclic
distribution of nodes to processors results in improved load balance,
by breaking up spatial locality in the load distribution.

An interesting opportunity results from the observation that $L_{xy}$
contains moderate outer-loop and vector parallelism, whereas $L_{cz}$
contains massive outer-loop parallelism and little vector
parallelism. In ~\cite{segall:hetero:NGEMCOM}, we discuss tradeoffs
involved in a heterogeneous implementation that places each phase on
an architecture best suited to its solution.

\section{Distributed URM -- Mixed Task and Data Parallelism}

Other speedup-reducing factors include sequentiality in data I/O,
hourly pre- and post-processing, and communication latency between
\Chem\ and \Trans. For example, $L_{xy}$ uses only one finite element
stiffness matrix per layer (per hour): during preprocessing,
outer-loop parallelism is equal to only the number of layers. Although
preprocessing time is small compared to the overall sequential time,
parallelization of \Chem\ and \Trans\ increases its relative
significance.  These factors are currently more significant than
$L_{xy}$ parallelism. They can be addressed by combining task with
data parallelism (Fig.~\ref{fig:mixed}).  The basic ideas are
summarized here --- see~\cite{segall:scale:NGEMCOM} for more details.

I/O can be parallelized using data parallelism (input each array using
a separate processor) plus pipelining (input data for hour $n+1$ while
computing the time step for hour $n$). Preprocessing: our current
implementation factors the stiffness matrices for different layers in
parallel.  Pipelining can also be used, as for inputs.  Critical-path
communication time is ultimately a function of platform selection.
Its impact is strongly dependent on problem size and the 
computation/communication ratio (grain size) of the solvers used.
\begin{figure}[htb]
\centerline{\psfig{figure=fig/hetero.eps,height=2in}}
\caption{Mixed Task and Data Parallelism in the URM Model}
\label{fig:mixed}
\end{figure}


\section{Future}

Larger problem domains and more compute-intensive aqueous and
aerosol-phase chemistry solvers will improve efficiency by increasing
grain size. Data parallelism provides good efficiency for multiscale
models on tens of processors now. Adding task parallelism will
increase this to hundreds in the near future.  For very high
scalability (desirable for extremely large problem domains), the
transport layer/species loop may require inner-loop as well as
outer-loop parallelism, which we have not yet addressed.

\bibliography{AMS}

\end{document}


