\begin{flushleft}
{\bf 6. Further Parallel Speedup}
\end{flushleft}

Examining the speedup graphs for parallel LDU factorization and
semi-static load balancing, it is clear that at the 10-processor
level, speedup is still increasing at a promising rate (except for the
ethernet runs). Thus, when a fast network is used, the current
approach appears to be effective well beyond 10 processors. However,
efficiency is at 60\% (speedup of about 6 on 10 processors), and for
the LA region, visibly dropping at this point. We would like to
improve efficiency so that the parallel URM model can be profitably
extended to tens of processors using the current solvers.  This will
have two benefits. First, it will enable significantly faster
execution for the current code (which is already quite fast, as it is
comparable in wall-clock speed to other codes that exhibit much higher
parallel speedups).  Second, improved efficiency will enable near-term
versions of the URM model (which will incorporate much more
compute-intensive chemical solvers) to scale to many hundreds of
processors with high efficiency and proportionate speedup.

Approaches we are investigating at this time include:

\begin{itemize}
\item{Fully parallel point-to-point communication}
\item{Pipelined, parallel execution of data input, preprocessing
(including LDU factorization), postprocessing and output}
\item{Load balancing}
\item{Copy reduction, message reordering, and other performance tuning techniques}
\item{Pipelined back-substitution in the transport phase}
\end{itemize}

New versions of the parallel URM model using many of these
optimizations are nearing completion, and promise significantly higher
speedup potential in the near future. 
