\begin{flushleft}
{\bf 4. Performance Results}
\end{flushleft}

One  of the primary  objectives  of  this study was to explore how the
parallel implementation could improve  model execution time on various
platforms and code portability.  Performance  of the  distributed URM
model on several parallel and distributed systems is discussed here.

\newpage
\begin{flushleft}
{\bf 4.1 Experimental setup}
\end{flushleft}

Our  experimental  setup   consists  of  a  variety  of  computational
platforms.  The application is organized in such a way that it is very
easy  to reconfigure the execution  from one platform  to another with
one or two commands. The application was run on a variety of platforms,
but  results  are presented  only for the following representative
systems:
\begin{itemize}
\item The Pittsburgh Supercomputing Center (PSC) Alpha cluster: 14
DEC$^\copyright$ Alpha systems (8 AXP 500 and 6 AXP 400), 12  of which
are connected by a DEC$^\copyright$ Gigaswitch.

\item The CMU Alpha cluster: 10 DEC$^\copyright$ Alpha systems (all AXP 400) connected by an Ethernet.

\item The PSC Cray 90: a 16 processor shared memory system.
\end{itemize}

All Alpha systems run DEC/OSF1 as the operating system, the Cray runs
UNICOS, and PVM version 3.2 is used for communication.

All these systems are (nearly) homogeneous systems (all processors are
equivalent, though the AXP 500 and AXP 400 differ slightly in clock
rate), but the characteristics of their interconnects are very
different, ranging from shared memory to a slow Ethernet.

\begin{flushleft}
{\bf 4.2 Measurements}
\end{flushleft}

Table  1 shows  the performance  comparisons for the  distributed  URM
model, as described in Section 3.   The  model has been applied to the
Los Angeles Region, and the simulation is performed for a 24-hour time
period.  The first set of measurements  shows  the execution times for
the sequential URM  application (i.e., not for the distributed version
running  on a single processor).   It is observed that the  AXP 400 is
about 12\% slower than the AXP 500 on this application.

The second  set of measurements shows  the execution  times on the PSC
Alpha cluster.   For the  basic parallelization, the speed-up  seen is
about 3,  on  5  processors,  and  close  to 5  on  10 (heterogeneous)
processors (2784/604 = 4.6 if the average performance of a single node
in the  cluster is used).  Two different optimizations were applied to
the application next.  The first one is  the  parallelization  of  the
factorization in  the horizontal transport.  This does not pay off for
the 5-node  system, but  improves performance  by about 30 seconds for
the 10-node case.  This can be explained by the fact that the  10-node
case creates a communication bottleneck by sending the factored matrix
to  the 9  other  nodes, while  doing the factorization  on  all nodes
eliminates  this.  In  the 5-node case,  only  half that  much time is
spent on communication, so the payoff is significantly lower.

The  second  optimization tested  is to  apply  load  balancing to the
chemistry  phase.   The   load  imbalance  occurs   because   of   the
non-linearity of the atmospheric chemistry. Initially, the same number
of  columns  are  sent  to  each  chemistry  processor,  but different
processors take different CPU times to reach convergence,  so this CPU
time varies from hour to hour.  One simple solution to this problem is
to make the  number of columns  given to  each  processor be inversely
proportional to the computational  cost  of their convergence observed
in previous  runs, instead of giving the same number every time.  This
reduces the execution time by another 70 seconds in  the 10-node case,
resulting in  a  speed-up  of  5.5.  For the  5-node  case, using this
optimization results in an overall speed-up of 3.4.

The next  measurements  are  for  the  Alpha  cluster  at  CMU,  which
currently uses an Ethernet for  communication.  Performance of the CMU
cluster is worse than that of the PSC cluster, which uses a Gigaswitch
(a faster communication network).  For the 5-node case, the execution
time is about 80 seconds greater for the CMU  cluster, whereas for the
10-node case it is about 300 seconds greater.   Since inter-processor
communication  becomes  a  bigger  issue  with  a   larger  number  of
processors, this  shows  how a slower network can adversely impact the
performance.

The  results  on the  CMU  cluster  are explored  using local I/O  via
prefetching from AFS (Andrew  File System).  Moving either the outputs
or the inputs to AFS adds about 100 seconds, i.e., AFS is very slow in
the current setup.  As a result, even this simpler form of prefetching
pays off. It should be mentioned that all the PSC results  are for the
case when all data is read from  or  written to the local  disk on the
host processor.

The  final set of measurements is for  the Cray.   These measurements
were  taken  before  any  optimizations were  added  to  the code. The
speed-up observed  on the Cray is  similar to that obtained using the
PSC cluster.

Table  2 shows the performance comparisons for the case when  the  URM
model  was applied  to the northeastern  United States. The number  of
computational  nodes for this case is about  five  times the  previous
case,  where the model was  applied  to the Los  Angeles domain.   The
distributed  application  in  this  case  included  the  optimizations
described previously,  i.e., parallelization  of the  LU factorization
and the load balancing for the chemistry phase.  The speed-up observed
is about 3.7 for the 5-node  case and about 6.3 for  the 10 node case.
These numbers are higher than those observed for the Los Angeles case, 
because the grain size of the problem is larger by a factor of
5 for the Northeast case.
