\begin{flushleft}
{\bf 3. Parallelization Approach}
\end{flushleft}

The overview of the model structure in the previous section shows the
major loops and arrays of the URM model.  This section describes in
more detail what parallelism is available from this structure, and how
it is implemented.


\begin{flushleft}
{\bf 3.1 Parallelism in URM}
\end{flushleft}

In the URM model most of the computation time is spent in the CHMVRT
modules, followed by the HORIZ modules.  The relative time spent in
CHMVRT vs. HORIZ depends on the region size. For a small simulation
(such as the Los Angeles region), about 70-75\% of the CPU time is
spent in CHMVRT, and HORIZ takes about 25\% of the CPU time in the
current implementation. For a region the size of the Northeastern
United States, the proportion is closer to 50-50. The remaining CPU
time is spent in input/output and other miscellaneous operations.

CHMVRT  performs  integration  of  the  chemistry   and  the  vertical
transport equation  on  a  column-by-column  basis.  Here, ``column''
refers to a horizontal grid cell extending in  the vertical direction.
As the application of $L_{cz}$ to each column is independent of the others, 
the degree of  outer-loop parallelism  in CHMVRT is equal to the
number  of horizontal  grid  cells.   For  a  typical  regional  
simulation this  number  is  in the range of 3000-5000.   

For each atmospheric layer, HORIZ performs horizontal transport of all
chemical species. In a  given layer, different species are transported
independently. Thus, the degree of  outer-loop parallelism in HORIZ is
several hundredfold; specifically, it is 
equal to the number of vertical layers times the number of  species (a
typical simulation  has about 50 species and 10 layers).  The URM model
is very similar in its structure to other regional air quality models,
so  the  parallelism  explored  here  should be  applicable  to  other
regional  models.  Parallelization of  these  regional  models,  which
generally  use  one   2-dimensional   horizontal  transport   operator
$L_{xy}$,  is   more  challenging  than   for  models  that  use   two
1-dimensional operators $L_{x}$  and  $L_{y}$  because  in the  latter
case,  many  instances  of  $L_{x}$  and  $L_{y}$ correspond  to  each
instance   of  $L_{xy}$.    All  $L_{x}$   instances   may  be  solved
simultaneously, as may all  $L_{y}$ instances, providing a large degree
of parallelism to  the application of 1-dimensional operators.  Hence,
there  is   a  time-to-solution tradeoff  between   the  computational   efficiency  of
2-dimensional  multiscale operators  and  the  outer-loop  parallelism
available from one-dimensional uniform grid operators.

The parallelism in both HORIZ  and CHMVRT is data parallelism.  In the case of
CHMVRT,  the parallelism can easily be exploited using a DOALL loop across the
grid cells.  The nature of the parallelism in HORIZ is more  complicated, as can
be explained  with the help of Figure  7.2, which shows the loop structure
within this subroutine.   The outermost loop is for the number of vertical
layers (NVRT) in the model.  Once per hour, before performing the actual
advective transport, the model inputs new meteorology data for each layer $l$
and assembles from it the stiffness matrix $K_l$. It then factors all the $K_l$
into  LDU (Lower-Diagonal-Upper) triangular form\footnote{The L and U components
are sparse, due to the local connectivity of the grid.}.  For each time step,
the LDU factors are used to  perform advective transport on the concentration
vector for each species.  

During  a given time  step, advective  transport  can be performed  in
parallel  for  all  species  and  all  layers.   Unlike  CHMVRT,  this
parallelism  is based on a nested  loop (layers and species); further,
one of the  parameters in  the  computation (the  factored  matrix) is
different for each layer. Nested parallelism such as this is not supported
by typical parallel DO loops such as found in FORTRAN 90.  Effective implementation
requires hand parallelization or a langauge that supports this loop structure. 

Matrix assembly and factorization are also data parallel: each layer
is independent of the others. However, the degree of parallelism here
is only equal to the number of layers, compared with layers $\times$
species for the advective transport.  This makes it harder to
parallelize the transport phase effectively because, depending on the
number of processors used, the LDU factors might have to be replicated
on several processor.  Computation of LDU factors can be sequential,
partially distributed (one one processor per layer), or fully
distributed (on every transport processor). Fully distributed
computation is currently preferable because it reduces communication
(the meteorology data is more compact than the resultant factored
matrix), but it causes some redundant computation when the number of
processors is larger than the number of layers.

\begin{flushleft}
{\bf 3.2 Implementation using PVM}
\end{flushleft}

The original program has been functionally partitioned into different
modules. In the simplest PVM implementation, there are three: a host
module (which controls execution), a horizontal transport module, and
a chemistry module.  The horizontal transport and chemistry modules
are themselves data-parallel, being partitioned via domain
decomposition.

The host module reads the simulation parameters and options from a
command file, spawns the transport and chemistry tasks, and
distributes the initial data.  It also handles all input and output,
as well as input preprocessing and inter-task communication and
synchronization.

The sequential  chemistry routine was parallelized by partitioning the
outer (column) loop.   This module  is SPMD  (Single  Process Multiple
Data) rather than SIMD (Single Instruction  Multiple Data) because the
number   of  iterations  per   time  step   of   the  inner
(convergence) loop varies independently for each column.   The  number
of  columns  assigned to  each  process is  selected at  runtime  to
facilitate  load balancing.  Once per simulated hour,  each  chemistry
processor  receives information  on temperature, chemistry  time  step
size,  pollutant emissions, and  other hourly-varying inputs from  the
host  processor.   Once per time  step,  each  process  receives its
portion  of the concentration array from the  host, operates on  it to
generate the new concentration values for its subdomain, and sends the
results back to the host process.

The horizontal transport module is parallelized  across the layer loop
and  the  chemical  species  loop.   Each  task  does  the   transport
computation for one or more  species within a single  layer.  Once per
hour, each transport process receives the stiffness matrix for that hour
from  the host.  For each time step within a given hour, it solves
for the new  concentration vector, based  on  the stiffness matrix and
forcing vector (which incorporates the previous concentration vector).
Because multiple solutions are computed (one for each species and time
step) before the stiffness matrix  changes,  a direct method  based on
LDU  decomposition  and  back-substitution is used.

Besides the ``basic'' distributed model described above, several further
optimizations were also implemented:
\begin{itemize}
\item Fully distributed parallelization of matrix assembly and LU
factorization (incorporated into HORIZ).

\item Prefetching the hourly data onto the host node (with the basic 
distributed version  the  data  is stored on a remote AFS (Andrew File
System) file server).

\item 
In the basic distributed  URM model,  each  node operates on the  same
number of grid points.  A semi-dynamic load balancing was added to CHMVRT: a
nonuniform partitioning based on the  cost of convergence at each grid
point is used.
\end{itemize}
