\begin{flushleft}
{\bf 5. Discussion}
\end{flushleft}

From the results in the  previous section, it is seen that the
speed-up obtained for the northeastern United States is about 3.7 
using 5 processors  and 6.3  using 10
processors.  According to Amdahl's law [{\it Amdahl}, 1967], the ideal
speed-up, {\it s}, that can be achieved using N processors in parallel
is given  by $$s~=~\frac{1}{(1-p)~+~p/N},\eqno(3)$$ where  {\it  p} is
the fraction of total time spent on the parts of the program that have
been  parallelized.   In  the  present  application  the  parallelized
portions of  the program  take about 96\% of the total CPU time for that 
input set. Using
Equation (3), the ideal speed-up is 4.31 for  5 processor and 7.35 for
10 processors.  Thus,  the actual  performance is about 82-86\% of the
ideal speed-up. There are two reasons for this reduced speedup.

First,  there is  a  high  communication  cost  associated  with  this
application  since the concentration array has to be communicated back
and forth between the  chemistry module, the transport module, and the
host  module.  Between the  horizontal transport
and   the  chemistry  computations,  the  URM  model   transposes  the
concentration  matrix,  since  the  two  phases  access  the  data  in
orthogonal  directions.  At  present the communication between the two
phases  and  the  transpose  are  done  centrally  through   the  host
processor.   This  could  be  done  in  a  distributed  fashion  using
all-to-all communication (a more scalable approach).

The second  factor in  reducing the  observed  speed-up  is  the  load
imbalance that occurs in the chemistry phase.   Using the semi-dynamic
load-balancing optimization  improves  the  performance  but  does not
completely solve  the  problem, as the  time  taken  by the  chemistry
integrator changes non-uniformly over time.
