\documentstyle[times,epsf]{article}
%\documentstyle{article}

\textheight =  9.0in
\topmargin = -.5 in 
%\oddsidemargin = -.25 in
%\evensidemargin = -.25 in
\oddsidemargin = 0 in
\evensidemargin = 0 in
\textwidth=6.5in

\def\AAPAIR{AAPAIR}
\def\AABLK{AABLK}
\def\DMRLE{DMRLE}
\def\DMRLEC{DMRLEC}


\title{Fast Message Assembly Using Compact Address Relations}

\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron}

\begin{document}

\bibliographystyle{acm}

\begin{abstract}

The performance of message assembly and disassembly is an important
part of the performance of distributed data structures on parallel
computers.  Message assembly involves both address generation and data
copying.  For high level parallel languages, address generation is
typically done by on-line address computation which is optimized using
compile-time knowledge of the data distribution.  We attack the
problem of how to achieve fast message assembly {\em without}
compile-time knowledge --- for example, in run--time libraries.  Our
approach is address relation reuse --- we precompute the mapping
between sender addresses and receiver addresses (the address relation)
and store it.  Message assembly is then performed using the stored
address relation and the cost of the initial address computation is
amortized over multiple assembly steps.

We present an analytical model which predicts the threshold above
which address relation reuse is effective, how quickly it breaks even
with on-line address computation, and the ultimate speedup.  A
significant observation is that the threshold is typically very low,
meaning that address relation reuse is effective in a wide variety of
situations.  We verify the predictions the Intel iWarp and the Intel
Paragon.

The address relation representation we model ignores the structure of
the address relation.  However, in practice, address relations induced
by operations on distributed data structures are likely to have
considerable regularity.  We highlight the regularities of address
relations induced by redistributions of regular block and cyclic
distributed arrays, such as those in High Performance Fortran.  Four
representative redistributions are used to develop three compact
address relation representations that exploit these regularities to
shrink storage costs and increase message assembly throughput, , while
remaining able to encode arbitrary relations.  Our final
representation reduces space requirements by three to four orders of
magnitude.  Our innovation is not in simple data compression schemes,
but in their application to address relations.  Another important
distinction is that the compact representations are used directly in
message assembly and disassembly --- relations are {\em not}
decompressed before use.

We measure the message assembly and disassembly throughput of the new
representations for the representative redistributions on six
machines: the DEC 3000/400 (Alpha), the Intel Paragon (i860), the IBM
RS/6000 Model 250 (PowerPC), the IBM SP-2 (POWER2), a Gateway Pentium
90 system, and a Micronics Pentium 133 system.  Except on the Pentium
133, we find that the compact representations achieve nearly the copy
bandwidth possible for the memory access patterns encoded.  In
essence, we have identified a technique for generalized copy
operations that has near ideal throughput for regular copies.

\end{abstract}

\begin{center}
\begin{minipage}[t]{5.5in}
\footnotesize
This research was sponsored in part by the Advanced Research Projects
Agency/CSTO monitored by SPAWAR under contract N00039-93-C-0152, in
part by the National Science Foundation under Grant ASC-9318163, and
in part by a grant from the Intel Corporation.

The authors' email addresses are: pdinda@cs.cmu.edu and
droh@cs.cmu.edu.  
\end{minipage}
\end{center}


\maketitle

%\setlength{\baselinestretch}{2} 

\section{Introduction}

It is a challenge to build fast but flexible libraries for distributed
data structures on parallel computers.  An important part of this
challenge is achieving fast message assembly (and disassembly) in the
absence of compile-time knowledge.  Our goal is to support message
assembly with memory access patterns resulting from arbitrary
distributions while achieving nearly the copy bandwidth for patterns
that arise from common distributions.

\subsection{Background}

Parallel programs on distributed memory multicomputers manipulate
distributed data structures.  Operations on distributed data
structures result in messages being sent between the processors.  The
fixed startup cost of sending a message is amortized by including many
data items in the message.  For most communication systems, this
requires that the data items be gathered to a contiguous buffer by the
processor.  The performance of this {\em message assembly} step is an
important component of the overall performance of message passing.
This is also true of the symmetric {\em message disassembly} step
performed on the receiver to scatter the data items.  In the
following, we concentrate on the message assembly step and discuss
message disassembly only when the issues are different.

\subsection{On-line address computation}

When writing message assembly routines, an important consideration is
address generation --- how the addresses of the data items to be
copied will be generated.  One approach is {\em on-line address
computation}, computing addresses as needed in the assembly routine.
However, even for regular block--cyclic distributed arrays such as
those supported by High Performance Fortran~\cite{HPF}, computing the
addresses on-line can be complex.  The approach taken by the Fx
parallel Fortran compiler~\cite{gross94a} and other compilers is to
generate specialized, optimized message assembly routines using
compile time knowledge.  For example, if the compiler knows both
arrays in an array assignment statement are block distributed, the
general block--cyclic message assembly loop nest can drastically
simplified~\cite{stichnoth93a}.

\subsection{Address relation reuse}

An alternative approach to address generation is the
inspector/executor method introduced in CHAOS~\cite{saltz91}.  Instead
of computing the mapping of sender addresses to receiver addresses
on-line during message assembly, the mapping is precomputed (the
inspector) and stored. This {\em address relation} is then used to
provide addresses whenever the message assembly is performed (the
executor.)  We refer to this scheme as {\em address relation reuse},
as opposed to on-line address computation.  The receiver also
precomputes the address relation and uses it in message disassembly.
The reason for storing both sender and receiver addresses is to
support other communication styles, such as direct
deposit~\cite{ICS-DECOUPLE} where receiver addresses are needed on the
sender.

\subsection{Modelling}

In previous work~\cite{AR-CACHING}, we modelled the performance of
address relation reuse and on-line address computation in the context
of the Fx compiler~\cite{gross94a} in order to better understand the
tradeoff between these two methods.  A shortened form of the analysis
appears in this paper. The address relation representation we model is
a simple array of sender, receiver address pairs.  We assumed
direct deposit message passing~\cite{ICS-DECOUPLE} because this is the
communication style Fx targets.  The models predict, given machine
parameters, the regimes where address relation reuse can improve the
average message assembly latency and how many times a relation must be
reused to break even with computing it on-line.  The predictions were
validated on the Intel iWarp and Intel Paragon.  Surprisingly, even
with this simple representation, we discover that if on-line address
computation in the assembly loop involves more than a few instructions
(1 on the iWarp, 16 on the Paragon), address relation reuse can do
better.  Further, beyond this threshold, the cost of precomputing the
address relation can be quickly amortized.

\subsection{Run-time libraries}

Using our model, compile--time knowledge (for example, data
distributions and loop bounds) that determines the cost of address
computation can be used to choose between the two message assembly
methods.  In this paper we attack the problem of how to achieve fast
message assembly {\em without} compile--time knowledge.  Our primary
motivation is run--time libraries: we want to build fast user
libraries that support complex distributed data structures.  Our
context is a portable run--time library for redistributing distributed
arrays.  Array transposes are also included in the library.  Our goal
is to support message assembly with memory access patterns arising
from arbitrary distributions while achieving nearly the copy bandwidth
for patterns that arise from common distributions such as regular
block--cyclic.  The approach we take is to always use the address
relation reuse method for message assembly.


\subsection{Compact representations}

Unfortunately, the simple address relation representation requires
memory proportional to the number of tuples in the relation.  Further,
because the message assembly loop requires one extra load (two for
direct deposit) for each data item copied, it is significantly slower
than it could be for relations induced by common regular block--cyclic
array distributions.  We attack both problems by introducing three
compact representations, \AABLK, \DMRLE, and \DMRLEC, which
significantly reduce storage costs and improve assembly throughput.
These representations apply generalizations of run-length encoding and
dictionary substitution~\cite{DATA-COMPRESSION-OVERVIEW} to storing
address relations.  Our innovation is not in simple data compression
schemes, but in their application to address relations.  Another
important distinction is that the compact representations are used
directly in message assembly and disassembly --- relations are {\em
not} decompressed before use.  The final form,
\DMRLEC, compacts relations induced by the common cases, including
transposes, by three to four orders of magnitude, making address
relation reuse a practical approach for these common cases.

\subsection{Measurements}

We measure the message assembly and disassembly throughput of address
relation reuse using all four address relation representations on six
machines for four representative redistributions of two dimensional
arrays.  All measurements are of a portable library for array
redistribution.  Because this library is intended to be used with
conventional message passing systems such as PVM~\cite{sund90},
NX~\cite{pierce94}, or MPI~\cite{walker94}, message assembly and
disassembly is different from the direct deposit scheme we model.
In particular, receiver addresses are not packed into the message as
with direct deposit.  The machines we test are the DEC 3000/400 (133
MHz Alpha 21064), the Intel Paragon (50 MHz i860XP), the IBM RS/6000
Model 250 (80 MHz PowerPC 601), the IBM SP-2 (66 MHz POWER2), a
Gateway 90 MHz Pentium system, and a Micronics 133 MHz Pentium system
These machines are either parallel supercomputer nodes, or
workstations that are likely to see use in a high speed cluster
environment.

Except on the Pentium systems, we find that \DMRLEC ~achieves almost
the same throughput as a stride-one copy loop when blocks are
assembled and a strided copy loop for strided access patterns.  The
other representations are about as fast, except for pathological
cases where their throughput plummets.  On the Pentium, only the
simplest compact representation, \AABLK, is effective --- which we
believe is due to the extremely limited register set of this
processor.

These results demonstrate that message assembly with address relation
reuse using compact address relations can support arbitrary copy
patterns while still achieving optimal assembly throughput for common
patterns.

\subsection{Overview}

The paper is divided into six sections.  In Section~\ref{sec:models},
we present analytical models to predict the tradeoff between message
assembly with address relation reuse and message assembly with on-line
address computation for direct deposit message passing.
Section~\ref{sec:represent} describes the address relations induced by
redistribution of regular block--cyclic distributed arrays and selects
four representative redistributions.  In Section~\ref{sec:compact},
these redistributions are used to develop compact address relation
representations.  Section~\ref{sec:measure} presents measurements of
the throughput of message assembly and disassembly using the compact
representations on the representative redistributions for conventional
message passing systems.  Section~\ref{sec:discuss} discusses other
useful properties of address relation reuse.  Finally,
Section~\ref{sec:conclude} concludes with some open questions.

%
% Superscalar plateau?
%

% \section{Address Relation Reuse}
% 
% \begin{figure}
% %\epsfig
% \caption{Levels of an operation on distributed data structure}
% \label{fig:hpfassign}
% \end{figure}
% 
% \begin{figure}
% %\epsfig
% \caption{Decoupling address computation and message assembly}
% \label{fig:decouple}
% \end{figure}
% 
% 
% Parallel programs for distributed memory multicomputers, whether
% written in a high level parallel language such as HPF or Fx, or in a
% sequential language with message passing, operate on data structures
% distributed over many processors.  These operations can be examined 
%at
% many levels.  The language level defines what data structures are
% possible, how they can be distributed, and the operations that can 
%be
% performed on them.  For a particular program, the programmer or
% compiler chooses the appropriate data structures, distributions, and
% operations to accomplish his goals.  These choices interact to 
%produce
% some global communication pattern between the processors.  Between
% each pair of processors, communication involves mapping sender 
%address
% to receiver addresses and transfer data according to this mapping.  
%We
% refer to the act of computing this mapping as {\em address
% computation} and to the mapping itself as an {\em address relation}.
% 
% 
% For example, consider figure \ref{fig:hpfassign}.  The innocuous
% language level HPF array assignment \verb.B=A. is between two
% dimensional arrays with different data distributions (block
% distributed columns and rows for B and A, respectively).  This 
%induces
% an all-to-all communication operation.  Between each pair of
% processors, an address relation is computed mapping blocks of $k$
% addresses from a sender stride of $Pk$ to a receiver stride of $k$.
% To communicate data according to the relation, the strided data is
% gathered at the sender either into a buffer or directly to the wire
% and scattered at the receiver.
% 
% Typically, for regular data distributions and access patterns, 
%address
% computation is done in--line with {\em message assembly} (gathering
% the data) on the sending processor, and {\em message disassembly}
% (scattering the data) on the receiving processor.  Compile--time
% knowledge (or programmer knowledge of the strengths and weaknesses 
%of
% the compiler) is used to reduce the cost of address
% computation~\cite{jim}.  
% 
% We {\em decouple} address computation and message
% assembly/disassembly.  As shown in figure~\ref{fig:decouple}, 
%address
% computation generates an address relation which is stored in memory
% and then reused for assembling and disassembling messages whenever 
%the
% associated communnication is done.  The address relation can be
% represented in many different ways to minimize memory use or to
% maximize performance.  The power of this technique is that message
% assembly/disassembly code need only be optimized for each
% representation and architecture, but not for different data
% distributions.  
% 





\section{Analytic models}
\label{sec:models}
\label{sec:simple}

In this section, we model address relation reuse and on-line address
computation in the context of direct deposit message passing.  After
explaining this communication style, we develop expressions for the
average latency for assembly with on-line address computation and with
address relation reuse as a function of machine parameters.  The
address relation representation we use is a simple list of sender
address, receiver address tuples.  We combine our expressions to
produce predictions for the {\em threshold}: the minimum computation
at which reuse is faster than on-line address computation, the {\em
breakeven point}: how many times a relation must be reused before the
initial address computation is amortized, and the {\em speedup}: the
maximum improvement possible.  We test these values on the iWarp and
Paragon and find close agreement.  An important observation is that
the threshold values are very low, on the order of one instruction on
the iWarp and 16 on the Paragon.  An expanded form of the analysis of
this section, as well as measurements of an HPF~\cite{HPF} library
that uses direct deposit appeared in \cite{AR-CACHING}.  Please note
that the measurements of Section~\ref{sec:measure} are for
conventional message passing.


\subsection{Direct deposit message passing}

In a conventional communication system, such as PVM~\cite{sund90},
NX~\cite{pierce94}, or MPI~\cite{walker94}, address computation is
done both on the sender and the receiver.  The message is assembled
according to the address relation computed by the sender, transported
to the receiver, and disassembled according to the address relation
computed by the receiver.  Obviously, the same total order must be
placed on both address relations.  The message contains only the
actual data.  In direct deposit message passing~\cite{ICS-DECOUPLE},
only the sender performs address computation.  The receiver addresses
are embedded in the message.  In addition to reducing the end-to-end
latency, this approach also allows for more easily separating
synchronization from data transfer, allowing a compiler to amortize
synchronization over many data transfers.  For example, this approach
has lead to factor of two speedups over library functions for
distributed array transposes on the Cray T3D~\cite{ISCA-MEMORY}.  The
message format that we examine is called {\em address-data-pair} since
each data word is accompanied by the address at which it will be
deposited.

\subsection{Algorithm and machine properties}

Clearly, the time spent computing the address relation depends on the
algorithm and its implementation.  We model address computation as
instruction issue rate limited and as delivering the address relation
in an on-line manner.  We assume that address computation is
integer-based and done in registers, so we characterize it by $n_g$,
the number of instructions executed per element of the relation.
Additionally, we must know $r_i$, the number of integer instructions
the machine can issue per second.  Since memory accesses are at the
heart of message assembly and address relation reuse, we characterize
the memory system by four values: $r_{rc}$ and $r_{wc}$, the read and
write bandwidths for unit--stride accesses, respectively; and $r_{rr}$
and $r_{wr}$, the read and write bandwidths for random memory
accesses.

\subsection{On-line address computation}

The time for message assembly using on-line address computation has
two components: $t_g$, the time to compute the address relation;
and $t_{au}$, the message assembly time:
\begin{displaymath}
t_{on-line} = t_g + t_{au}
\end{displaymath}
The time to compute the address relation is largely determined by the
complexity of algorithm used, its implementation, and the instruction
execution rate of the machine.  We characterize the algorithm in terms
of $n_g$, the number of integer instructions executed per tuple of the
address relation.  Thus for an $n$ word message, the time required is
\begin{displaymath}
t_g = \frac{n n_g}{r_i}
\end{displaymath}
The address computation code presents us with sender, receiver address
tuples $(s,r)$ in registers in an on-line manner.  For each of these
tuples, we read the data word at address $s$, write $r$ to the message
buffer, then write the data word to the message buffer.  Note that the
writes to the message buffer are to consecutive addresses, so their
performance can be captured by $r_{wc}$, the unit--stride write
bandwidth of the machine.  On the other hand, the sender side
addresses of the tuples may not be contiguous, so we capture their
performance with $r_{rr}$, the random read bandwidth of the machine.
Furthermore, we capture the loop overhead as a number of instructions,
$n_{au}$.  Thus the time to copy the data is
\begin{displaymath}
t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + 
\frac{n_{au}}{r_i}
          \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.  If
the compiler software-pipelines the assembly loop, we should be close
to the above expression.  Substituting into the expression for
$t_{on-line}$, we get an average assembly latency of 
\begin{equation}
 t_{on-line} = n \left ( \frac{n_g+n_{au}}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
\label{equ:uncached}
\end{equation}


\subsection{Address relation reuse}

The average latency for message assembly with address relation reuse 
has three components: $t_g$, the time to compute the address 
relation; 
$t_o$, the time to store the address relation in memory, and 
$t_{ac}$, the message assembly time using the stored address 
relation.  
The average latency is
\begin{displaymath}
 t_{reuse} = \frac{t_g + t_o}{n_i} + t_{ac}
\end{displaymath}
Note that the first two components are amortized over the number of
times the communication is performed, $n_i$.  $t_g$ is the same as for
the on-line scheme and was determined in the preceding section.  The
address relation representation we examine is a simple array of sender
address, receiver address tuples.  As each tuple of the address
relation is generated, we write it to the array.  Thus storing the
address relation involves two contiguous writes for each in-register
tuple produced by the address relation computation.  In addition, we
execute $n_o$ loop overhead instructions, so
\begin{displaymath}
 t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
\end{displaymath}
When assembling the message from the stored address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the receiver address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\begin{displaymath}
 t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes, but we continue the assumption of the
preceding section.  Substituting into the expression for $t_{reuse}$,
we see that an $n$ word communication between two nodes requires time
\begin{equation}
 t_{reuse} = n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} +
	\frac{n_{ac}}{r_i} + \frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
	\frac{2}{r_{wc}} \right )
\label{equ:cached}
\end{equation}

\subsection{Effects of reuse}

Equations ~\ref{equ:uncached} and ~\ref{equ:cached} provide insight
into the following questions: For a message assembly that occurs many
times at run--time, can address relation reuse ever outperform on-line
address computation?  If so, how many times must the relation be
reused before the reuse strategy wins?  Finally, what is the ultimate
speedup of the reuse strategy over the on-line strategy?

Caching can only be effective if we can assemble the message faster
using the stored relation than by computing it on-line, i.e., 
\begin{displaymath}
t_{ac} < t_g + t_{au}
\end{displaymath}
or
\begin{equation}
% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
\label{equ:necessary}
\end{equation}
The right hand side of equation~\ref{equ:necessary} is a 
machine--specific threshold below which address relation reuse is 
ineffective.  Notice that it is largely determined by the ratio of 
instruction issue rate and memory read bandwidth.  
If~\ref{equ:necessary} is true, then we can also predict how many 
times the address relation must be reused in order to break even with 
the on-line scheme.  This break even point is found by setting 
$t_{reuse} = t_{on-line}$ and solving for $n_i$.  In doing this we 
find that the break even point is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
\right \rceil 
\label{equ:crossover}
\end{equation}
It is important to point out that as $n_g$ moves away from the
threshold, the break even point decreases sharply because the $n_g$
terms in the fraction begin to dominate. Finally, the ultimate speedup
of the reuse strategy over the on-line strategy is
\begin{equation}
\label{equ:speedup}
\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
              + r_ir_{rc}(r_{wc}+2r_{rr})}
             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
\end{equation}
The speedup is linear in $n_g$, so is bounded only by the complexity 
of
the address computation.


\subsection{Model validation}

In this section we validate the models from Section~\ref{sec:simple}
for message assembly on an iWarp system and on a Paragon.  To test the
predictive power of the models, we use (\ref{equ:crossover}) to
predict the number of iterations necessary for address relation reuse
to break even with on-line address computation as a function of $n_g$,
and we use (\ref{equ:speedup}) to predict the ultimate speedup of the
reuse strategy over the on-line strategy as a function of $n_g$.  In
each case, the address relation that is generated is the identity
relation, and $n_g$, the number of instructions to compute the
relation, is varied by inserting nops into the code stream.  The
predictions are compared to actual measurements on the hardware.

\subsubsection*{iWarp}

On the iWarp~\cite{iwarp} system, $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$
words/second, and a node can execute $r_i=2 \times 10^7$
instructions/second.  The memory system parameters were derived
directly from the iWarp specification.  Specifically, the iWarp clock
speed is 20 MHz and all memory operations complete in two cycles.
There is no cache and we assume no dual dispatch of loads and stores.

The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
estimated {\em statically} by simply counting the number of
instructions in each loop, or {\em dynamically} by accounting for
bypass stalls that arise in the loops. The static estimates for iWarp
are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

Figure~\ref{fig:iw}(a) shows the measured and predicted number of
iterations needed on iWarp for the message assembly latency of the
reuse scheme to break even with that of the on-line scheme for
different $n_g$.  The predictions made using the dynamic measurements
are exact.  The predictions made using the static measurements are
pessimistic for small $n_g$, but quickly converge to the actual
numbers as $n_g$ increases.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1        &         13     &        INF        &       
13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &          3        &        3\\
\hline
\end{tabular} 
&
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1       &         1.05    &         1.00     &         1.07 \\
   2       &         1.12    &         1.09     &         1.15 \\
   3       &         1.22    &         1.16     &         1.23 \\
   4       &         1.28    &         1.25     &         1.32 \\
   5       &         1.35    &         1.33     &         1.39 \\
   6       &         1.43    &         1.41     &         1.47 \\
   7       &         1.49    &         1.49     &         1.54 \\
   8       &         1.59    &         1.59     &         1.61 \\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{Model validation on iWarp:  (a) Iterations required to break 
even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message 
assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:iw}
\end{figure}

Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
the reuse scheme over the on-line scheme on iWarp. As in
Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for
all $n_g$ and the static predictions are pessimistic for small $n_g$.
However, it is important to note that even with static measurements,
the predictions only become inaccurate when $n_g$ approaches the
threshold.  Another important point is that the purpose of
Figure~\ref{fig:iw}(b) is to help validate the models when $n_g$ is
small, and not to highlight impressive speedups. In fact, the speedups
grow arbitrarily large as $n_g$ increases (see
Equation~\ref{equ:speedup}.)



\subsubsection*{Paragon}

A Paragon node contains a 50 MHz pipelined, multiple issue RISC
processor~\cite{PARAGON-XPS,I860-PR} .  We estimated the instruction
issue rate by noting that we do not use the floating point pipeline.
Therefore we can issue at most $r_i=5 \times 10^7$
instructions/second.  We directly measured the memory parameters and
found $r_{rc}=9 \times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$
words/second, and $r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only
a static measurement (instruction counting) of the implementation
properties was performed, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.

Figure~\ref{fig:86}(a) shows, for different $n_g$, the measured and
predicted number of iterations needed for message assembly using the
reuse scheme to break even with assembly using the  on-line scheme.
The predictions are pessimistic, but quickly converge to the actual
values with increasing $n_g$.  Notice that $n_g$ must be much larger
to permit break even on the Paragon than on the iWarp.  This is not
surprising given that the Paragon's instruction issue rate is much
higher than its memory bandwidth, while the iWarp's issue rate is well
matched to its memory bandwidth.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline

    14     &        NONE    &        NONE\\
    15     &        NONE    &        NONE\\
    16     &        19      &        NONE\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
&
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline
    14     &        NONE   &         NONE\\
    15     &        NONE   &         NONE\\
    16     &       1.03     &         NONE\\
    17     &       1.06     &         1.01\\
    18     &       1.09     &         1.02\\
    19     &       1.12     &         1.03\\
    20     &       1.15     &         1.04\\
    21     &       1.17     &         1.05\\
    22     &       1.20     &         1.06\\
    23     &       1.22     &         1.07\\
    24     &       1.24     &         1.08\\
    25     &       1.26     &         1.09\\
\hline
\end{tabular}\\
(a) & (b)
\end{tabular}
\caption{Model validation on Paragon: (a) Iterations to break even 
(Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly 
(Eqn. \protect{\ref{equ:speedup}})}
\label{fig:86}
\end{figure}

Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
the reuse scheme over the on-line scheme on the Paragon.  Notice that
speedup does not increase as quickly with $n_g$ as on the iWarp
(Figure~\ref{fig:iw}(b).)  This is the result of the lower memory
bandwidth relative to instruction issue rate on the Paragon.  The
predictions are not as accurate as for the iWarp, reflecting the fact
the Paragon's memory system is more complex.  However, the predictions
are pessimistic and quickly converge to the correct values.


\section{Representative redistributions}
\label{sec:represent}

\begin{figure}
\centerline{\epsfxsize 3in \epsfbox{example.eps}}
\caption{Redistribution of regular block--cyclic distributed arrays}
\label{fig:redist}
\end{figure}

The address relation representation modeled in Section
\ref{sec:models} ignores the structure of the address relation.  However,
in practice, address relations induced by operations on distributed
data structures are likely to have considerable regularity.  In this
section, we describe the regularities of the address relations induced
by redistribution of regular block--cyclic distributed arrays such as
those in High Performance Fortran~\cite{HPF}.  The address relations
induced by redistributions between the particularly common BLOCK($P$)
and CYCLIC($P$) distributions are placed into four equivalence classes
and a representative is chosen for each class.  These representative
redistributions will be used to develop better address relation
representations in Section~\ref{sec:compact} and for throughput
measurements in Section~\ref{sec:measure}.


Each dimension of a regular block--cyclic distributed array has some
size $N$ and is distributed over some number of processors $P$ with
some fixed block size $M$.  Conceptually, the dimension is split into
$\lceil N/M \rceil$ blocks of size $M$ ($\leq M$ for the final block)
which are then distributed round robin over the $P$ processors.  The
most common regular block--cyclic distributions are BLOCK($P$), where
there are only $\lceil N/M \rceil = P$ blocks and CYCLIC($P$), where
$M=1$.  Each dimension of an array can be independently distributed.
Our notation differs slightly from HPF notation in order to include
the number of processors.  In particular, our CYCLIC($P$) corresponds
to an HPF CYCLIC(1) over $P$ processors.

The idea of array redistribution is to change the location (processor
and address) of elements of the array so as to change its
distribution.  Because the redistribution may be onto a disjoint group
of processors, we refer to ``source'' and ``destination''
distributions and processors.  Figure~\ref{fig:redist} shows a source
distribution of (BLOCK(2),*): the shaded array elements are those that
processor 0 owns.  The ``*'' indicates a non-distributed dimension.
The same figure also shows a destination distribution of
(CYCLIC(2),*).  To redistribute the array, for each pair of source and
destination processors, we compute the intersection of the elements
the processors own (which is highlighted in the figure) and map the
source processor address of each element in the intersection to its
address in the destination processor.  Data is gathered from the
source processor addresses to a contiguous buffer, communicated, and
scattered to the destination processor addresses.

For a one dimensional BLOCK($P$) and CYCLIC($P$) arrays, it can
clearly be seen that only simple stride-$j$ to stride-$k$ mappings can
occur in an address relation induced by redistribution.  BLOCK($P$) to
BLOCK($P$) redistributions result in unit strides on both the source
and destination processors while the combination of CYCLIC($P$) and
BLOCK($P$) results in a unit stride on either the source or
destination processor and a non unit stride on the other.  CYCLIC($P$)
to CYCLIC($P$) redistributions result in non unit strides on both
processors, as do certain forms of transposes.  With more than one
dimension, the access pattern is the Cartesian product of the access
patterns generated for each matching dimension~\cite{stichnoth93a}.
This can also be seen in Figure~\ref{fig:redist}.  The basic mapping
is stride-2 to stride-1 repeated four times.  This pattern is then
repeated stride-8 to stride-8 16 times.

We place array redistributions of BLOCK($P$) and CYCLIC($P$)
distributed arrays into four equivalence classes based on whether
their basic source and destination reference patterns are stride-1 or
strided.  The four two dimensional array redistributions (of a
$1024\times 1024$ array) of figure \ref{fig:repdist} are
representatives of the four classes.  Each distribution is over four
processors.  We include a data transpose operation instead of a
CYCLIC($P$) to CYCLIC($P$) redistribution to demonstrate that our
technique extends to transposes as well.  

\begin{figure}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{Redistribution} & \multicolumn{2}{c|}{Assembly} & 
            \multicolumn{2}{c|}{Disassembly} & \multicolumn{2}{c|}{Tuples}  \\ 
\hline
Source & Destination & Read & Write & Read & Write & (0 to all) & (0 to 0)  \\
\hline
(BLOCK(4),*)  & (*,BLOCK(4)) & 1 & 1 & 1 & 1 & 262144 & 65536\\
(BLOCK(4),*)  & (CYCLIC(4),*) & 4 & 1 & 1 & 1 & 262144 & 65536\\
(CYCLIC(4),*) & (BLOCK(4),*) & 1 & 1 & 1 & 4 & 262144 & 65536\\
(*,CYCLIC(4)) & (CYCLIC(4),*)$^T$ & 4 & 1 & 1 & 1024 & 262144 & 65536\\
\hline
\end{tabular}
\end{center}
\caption{Representative redistributions of $1024 \times 1024$ distributed array and basic memory access strides (``T'' represents a transpose)}
\label{fig:repdist}

\end{figure}

\section{Compact address relations}
\label{sec:compact}

In this section, we use the four representative array redistributions
chosen in Section~\ref{sec:represent} (Figure~\ref{fig:repdist}) to
drive the development of good compact address relation representations
that can exploit the regularity of relations induced by
redistributions of regular block--cyclic arrays while remaining able
to represent arbitrary relations.

We begin with \AAPAIR, the representation used in
Section~\ref{sec:models} and incrementally extend the set of patterns
we take advantage of by introducing \AABLK, \DMRLE, and \DMRLEC~---
\AAPAIR$\subset$\AABLK$\subset$\DMRLE$\subset$\DMRLEC.  For each
representation, we place a total order $\leq$ on the tuples of the
address relation: $(a,b) \leq (c,d) \Leftrightarrow (a \leq c) \wedge
(c \leq d)$.  This means that the tuples are ordered by source
processor address first, then by destination processor addresses. 

Each representation is designed so that messages can be efficiently
and quickly assembled and disassembled using it --- the idea is that
we repeatedly load some {\em key} from stored relation, expand it to a
{\em symbol} according to the representation, and use the symbol to
gather or scatter some number of data elements.  We do not consider
representations that are wholly expanded before use.  Our innovation
is not in developing simple data compression techniques, but in
applying them to address relations and message assembly.
Figure~\ref{fig:sizes} shows the total space required for address
relations on source processor 0 for each of the representative
redistributions of Figure~\ref{fig:repdist}.  Note that the chart has
a log scale.  The sizes are in bytes and measured on the DEC 3000/400.
Relative sizes are the same on 32-bit platforms, but the absolute
sizes are halved.

\begin{figure}
\centerline{\epsfxsize 3in \epsfbox{sizes.eps}}
\caption{Total size of all address relations on sending processor 
zero in each representation}
\label{fig:sizes}
\end{figure}


\subsection{\AAPAIR}

The \AAPAIR~representation is simply the array of sender, receiver
address pairs used in section \ref{sec:models}.  It takes advantage of
no patterns and always requires storage space proportional to the
number of data items transfered.  The keys and symbols for \AAPAIR~are
the tuples themselves.  The acronym \AAPAIR~ expands to
``Address, Address PAIR.''

\subsection{\AABLK}

The \AABLK~representation run-length
encodes~\cite{DATA-COMPRESSION-OVERVIEW} blocks of stride-1 to
stride-1 mappings.  The symbols generated are of the form $(a,b,l)$,
which means that addresses $a$ through $a+l-1$ on the sender map to
address $b$ through $b+l-1$ on the receiver.  The keys are simply the
symbols. Notice that this is sufficient for the first representative
redistribution, but not the others.  The acronym
\AABLK~ expands to ``Address, Address BLocK.''

\subsection{\DMRLE}

\DMRLE~ generalizes \AABLK~ to map blocks of address with a sender 
stride $j$ to a receiver stride $k$. To encode the relation in \DMRLE,
we form the difference map sequence $\langle(dx_{i},dy_{i})\rangle$
from the sequence of tuples $\langle(x_{i},y_{i})\rangle$:
\begin{displaymath}
(dx_i, dy_i) = \left \{ 
\begin{array}{ll}
(x_0,y_0) & i=0 \\
(x_i-x_{i-1},y_i-y_{i-1}) & i > 0
\end{array}
\right .
\end{displaymath}
The difference map is then run-length encoded to produce symbols of
the form $(dx,dy,l)$ which are stored as keys.  Notice that this
representation significantly reduces space requirements for all four
redistributions of Figure~\ref{fig:repdist} because it encodes any
stride-$j$ to stride-$k$ mapping efficiently.  The acronym \DMRLE~
expands to ``Difference Map, Run-Length Encoded.''

\subsection{\DMRLEC}
\label{sec:dmrlec}

The \DMRLE~ representation efficiently encodes the basic stride-$j$ to
stride-$k$ mapping, but for multidimensional arrays this will be
repeated many times.  For a redistribution of a $D$-dimensional array,
each of whose dimensions are of size $N$, the mapping (eg, the symbol)
for dimension $d$ will be repeated $O(N^{D-d-1})$ times.  Clearly, a
Huffman-encoding~\cite{HUFFMAN-CODES} of the symbols would greatly
reduce their storage cost.  However, Huffman codes are of variable
length, so keys could span word boundaries and be expensive to decode
at during message assembly.  Instead, we adopt a dictionary
substitution encoding~\cite{DATA-COMPRESSION-OVERVIEW} of the symbols
as fixed length keys.  The key length is the minimum power of two bits
necessary to represent all the unique symbols.  This scheme is
efficient for message assembly because keys cannot span word
boundaries and there is always the same number of keys per word.

To encode in \DMRLEC, the relation is first encoded in \DMRLE~and all
unique $(dx,dy,l)$ symbols are found via hashing.  These symbols are
stored in a table of size $2^k$ for the minimum $k$ possible.  Each
symbol is encoded as a $2^k$-bit key representing its offset in the
table.  As can be seen from Figure~\ref{fig:sizes} this technique
reduces the storage cost of the relations induced by the
representation relations by two orders of magnitude over \DMRLE.
Overall, \DMRLEC~is three to four orders of magnitude more space
efficient than \AAPAIR.  The acronym \DMRLEC~ expands to ``Difference
Map, Run-Length Encoded, Compressed.''


\section{Throughput Measurements}
\label{sec:measure}

In this section, we present measurements of the message assembly and
disassembly throughput possible with the address relation
representations introduced in Section~\ref{sec:compact}. The
measurements are of a portable library designed to be used with
conventional message passing systems.  Because of this, the content of
the message is purely data, as opposed to addresses and data as with
direct deposit message passing, modeled in Section~\ref{sec:models}.
The address relations used are those induced between source processor
zero and destination processor zero for each of the four
representative block--cyclic redistributions of
Figure~\ref{fig:repdist}.  The reason for measuring zero to zero is to
guarantee that the largest address relation is used, although the
address relations between each pair of processors are of identical
size for the representative redistributions.  The distributed array
contains $1024 \times 1024$ 8-byte doubles, so the zero to zero
relation assembly and disassembly steps copy 512 KBytes.  A larger
$2048 \times 2048$ array (2048 KBytes assembled/disassembled) was also
tested with only minor changes in measured throughput.

The identical C code was compiled and measured on the DEC
3000/400~\cite{DEC-3000}, the Intel Paragon~\cite{I860-PR}, the IBM
RS/6000 250~\cite{RS6K-250,PPC601-MICRO}, the IBM SP-2~\cite{IBMSP-2},
a Gateway corp. Pentium~\cite{PENTIUM-OVERVIEW,PENTIUM-MICRO} 90, and a
Micron corp. Pentium~\cite{PENTIUM-OVERVIEW,PENTIUM-MICRO} 133.  We did
not test the code on the Intel iWarp due to memory constraints.  The
details of each testing environment are given in
Figure~\ref{fig:machines}.  
\begin{figure}
\begin{center}
\begin{tabular}{|l|l|l|l|}
\hline
Machine & Processor &  Compiler & Operating System\\
\hline
DEC 3000/400 & Alpha (21064 \@ 133 MHz) & GNU C 2.6.3, DEC CC 3.11 
    & OSF/1 2.0 \\
Intel Paragon & i860 (i860XP \@ 50 MHz) & Intel C R4.5 & OSF/1 1.0.4 \\
IBM RS/6000 250 & PowerPC (601 \@ 80 MHz) & IBM XLC 1.3.0.4 & AIX 3.2 \\
IBM SP-2 & POWER2 (66 MHz) & IBM XLC 1.3.0.33 & AIX 3 \\
Gateway Pentium 90 & i386 (Pentium \@ 90 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
Micron Pentium 133 & i386 (Pentium \@ 133 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
\hline
\end{tabular}
\end{center}
\caption{Machines tested}
\label{fig:machines}
\end{figure}
Figures
\ref{fig:alpha} through \ref{fig:p133} show the results.  Each figure shows the assembly throughput on the left and disassembly on the right.  Each chart
measures the absolute throughput on the same scale.  Except for the
Pentium machines, only the throughput of the
\AAPAIR~(as a baseline) and \DMRLEC~representations are shown.  This
is because the story is the same for these machines: \DMRLEC~is always
the fastest compact representation.  The situation is more complex on
the Pentium machines, so the throughput of all four representations
are shown.  Superimposed on each graph are annotated lines that
represent the throughput of the library \verb.memcpy(). routine and
the throughput of C-coded strided copies of doubles.  A line annotated
``stride-$k$'' on an assembly graph indicates the throughput of a copy
loop that copies from a stride of $k$ doubles to a stride of 1, while
on a disassembly graph it indicates the throughput of copy loop that
copies from a stride of 1 to a stride of $k$ doubles.

The principle observation is that, for most of the machines, message
assembly and disassembly using the \DMRLEC~representation operates at
rates that are at or close to the copy throughput possible for their
memory reference patterns.  The remainder of this section analyzes
each machine in turn.

{\em The final paper will also show stride-1024 measurements for
comparison to the disassembly throughput of the transpose.}

\subsection{DEC 3000/400}

On the DEC 3000/400~\cite{DEC-3000} \DMRLEC~gives ideal assembly and
disassembly throughputs for each of testcases as can be seen in
Figure~\ref{fig:alpha}.  \DMRLEC~ benefits from the separate
load/store pipeline in the Alpha
21064~\cite{ALPHA-MICRO,DEC-ALPHA-21064}: Even on a load stall,
integer instructions associated with \DMRLEC~can continue to issue.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{alpha_assem.eps} & \epsfxsize 2.5in 
\epsfbox{alpha_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR and \DMRLEC~ on DEC 3000/400}
\label{fig:alpha}
\end{figure}


\subsection{Intel Paragon}

On the Intel Paragon~\cite{PARAGON-XPS,I860-PR} \DMRLEC~gives near ideal
assembly throughput for all the testcases as can be seen from
Figure~\ref{fig:i860}.  The disassembly results are somewhat
unexpected, however.  Although the first testcase (stride-1 writes)
operates with ideal throughput, the third testcase (stride-4 writes)
actually does slightly better than a stride-4 copy loop.  We explain
this due to different code generation on the part of the compiler for
the disassembly loop and the stride-4 copy loop.  As can be seen by
the factor of two difference between a compiler generated stride-1
copy loop and the library \verb.memcpy(). routine, there is
considerable room for improving on the stride-$k$ copy loops generated
by the compiler.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{i860_assem.eps} & \epsfxsize 2.5in 
\epsfbox{i860_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using
\AAPAIR~and \DMRLEC~on Paragon}
\label{fig:i860}
\end{figure}

{\em The final paper will include a stride-1024 measurement for
disassembly}


\subsection{IBM RS/6000 250}


The RS/6000 250~\cite{RS6K-250}, achieves near ideal assembly and
disassembly throughput with \DMRLEC, as can be seen in
Figure~\ref{fig:ppc}.  The IBM XLC compiler has software-pipelined
each of the assembly, disassembly, and strided copy loops.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{ppc_assem.eps} & \epsfxsize 2.5in 
\epsfbox{ppc_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR~and
\DMRLEC~on IBM RS/6000 Model 250}
\label{fig:ppc}
\end{figure}


\subsection{IBM SP-2}

The throughput measurements of message assembly and disassembly using
\DMRLEC~on the IBM SP-2\cite{IBMSP-2,PWR2} are also close to ideal,
tracking the throughput of the strided copy loops closely.  The IBM
XLC compiler has software-pipelined assembly, disassembly, and each of
the strided copy loops.  Because the POWER2~\cite{PWR2}, like the
Alpha, can continue to issue integer instructions while a load miss
is being handled, instructions associated with \DMRLEC~are free.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{pwr2_assem.eps} & \epsfxsize 2.5in 
\epsfbox{pwr2_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR~
and \DMRLEC~on IBM SP-2}
\label{fig:pwr2}
\end{figure}



\subsection{Gateway Pentium 90}
\label{sec:p90}

The performance of the Pentium~\cite{PENTIUM-OVERVIEW,PENTIUM-MICRO}
systems is more complex than that of the other machines, as can be
seen in Figure~\ref{fig:p90}.  The assembly and disassembly throughput
of \DMRLEC~is lower than \DMRLE~on this system.  For the first
testcase, \AABLK~is much better than either of the other compact
representations.  We believe that three factors contribute to this
result.  First, the GNU C compiler does not support Pentium
optimizations, so clearly, no scheduling is done to improve the
multiple issue rate.  Second, the Pentium is both register starved and
non-orthogonal.  There is a significant number of instructions in the
message assembly loop that do nothing but move values associated with
the compact representations to and from the appropriate registers for
certain operations.  In other cases, memory operands are used where
registers would appear with a larger register set.  Finally, although
the latency of memory operands is shortened by the on-chip cache, the
cache is not actually dual ported, but banked, so cache
references can become serialized.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p90_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p90_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR,
\AABLK, \DMRLE, and \DMRLEC~on Gateway Pentium 90 system}
\label{fig:p90}
\end{figure}

We note that even with this complexity, \DMRLEC~always does better
than \AAPAIR~on the Pentium.  


\subsection{Micron Pentium 133}

As can be seen from Figure~\ref{fig:p133} the performance of this
Pentium~\cite{PENTIUM-OVERVIEW,PENTIUM-MICRO} is complex in the same way
as the Gateway system (Section~\ref{sec:p90}.)  Our explanation is the
same as in that section.  Message assembly and disassembly appears to
be instruction issue limited for \DMRLEC~on both machines: Because the
ratio of memory copy bandwidth (as measured by \verb.memcpy().) to
instruction issue rate is higher on the Micron, it is not surprising
that it performs comparatively worse in message assembly and
disassembly using \DMRLEC.
\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p133_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p133_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR,
\AABLK, \DMRLE, and \DMRLEC~on Micron Pentium 133 system}
\label{fig:p133}
\end{figure}

Note that even here \DMRLEC~does better than \AAPAIR.



\section{Discussion}
\label{sec:discuss}

The preceding sections have demonstrated that message assembly with
address relation reuse is effective in a wide variety of situations
and introduced address relation representations that significantly
reduce the space required to store address relations induced by common
redistributions.  The message assembly and disassembly throughputs of
these representations were measured on a wide variety of machines and
found to be nearly as high as the copy bandwidth.  In this section we
describe other advantages of message assembly with address relation
reuse over on-line address computation, arguing that address relations
are a powerful interface to communication systems. Additionally, we
note that even with compact address relations discussed in this paper,
a address relation caching mechanism is still useful.

By taking address computation out of the critical path of data
transfer, the address relation reuse approach decouples address
computation and message assembly.  One advantage of this is that it
makes it easier to achieve reasonable communication performance
without a large number of special--case optimizations in the compiler
and run--time library.  Instead of optimizing each address computation
algorithm for each architecture, only message assembly from address
relations needs to be optimized for each architecture in order for all
data transfers to benefit.

Another advantage to the address relation reuse approach is that the
entire relation is available to be optimized at a new level of the
software hierarchy. For example, by ordering the relation by
destination processor addresses first, we can optimize for reference
locality on the destination processor instead of on the source
processor.  As shown in ~\cite{memoryperform}, the choice is
architecture--dependent.

Address relations are an ideal interface to the communication system
because they are implementation neutral.  This permits communication
systems to be highly specialized for the available hardware while
still presenting the same interface to the programmer or compiler.
However, to best utilize the hardware, it may be necessary to change
the representation of the address relation before using it.  For
example, the relation may be transformed into a DMA gather map on the
sender to make best use of a network adapter such as the one described
in~\cite{GNECTAR-IFIP}.  Because transforming the relation may be
expensive, it is desirable for the communication system to cache the
(transformed) relation instead of the application.  

Even with the compact address relation representations described
in this paper, it is still useful to be able to bound the amount of
memory used to store address relations.  Address relations can still
have huge space requirements if they do not have regularity to
exploit.  This leads naturally to a caching approach where when the
allotted amount of memory is in use and a new relation needs to be 
stored, a victim relation must be chosen and flushed from memory to
make room.  User or compiler-supplied hints would be helpful to
minimize cache thrashing.


\section{Conclusions and open issues}
\label{sec:conclude}

We have shown that on a wide variety of machines, message assembly and
disassembly with address relation reuse using the \DMRLEC~address
relation representation can support arbitrary address relations while
operating at near copy throughput for address relations induced by
redistributions of regular block-cyclic distributed arrays.  A side
effect is that the storage requirement for these relations is
drastically reduced to a practical level.  

There are many areas for future exploration.  First, there is the
question of how well the address relation representations we
introduced will work on more complex distributions and operations.  We
are currently implementing support for more distribution types in the
library that was measured and will be able to measure the performance
of the compact representations on them.  A second area is developing
and implementing more compact and general-purpose address relation
representations.  We will implement a
Huffman-encoded~\cite{HUFFMAN-CODES} \DMRLE~representation as
mentioned in Section~\ref{sec:dmrlec}.  We are also exploring using
LZW~\cite{LZ,LZW} or variant thereof for further compression.  Early
experiments with the \verb.gzip. LZW-based compressor show dramatic
compression, but we are not confident that LZW-based decompression can
be made to run at copy-bandwidth speeds.

We are also interested in how much the performance of message assembly
and disassembly depends on single-node compilers.  Many parallel
compilers generate sequential source code and rely on the single-node
compiler to produce good results.  However, as can be seen from the
factor of two difference between a compiler--generated copy loop and
the library \verb.memcpy(). routine on the Paragon
(Figure~\ref{fig:i860}), this reliance may be unwarranted.  If this is
true, then message assembly with address relation reuse is even more
advantageous because the assembly routine for each representation
could be low--level coded for each architecture, while low--level
coding would be impractical for the on-line address computation
approach due to the many variants of assembly routines.  Instead, a
parallel compiler could precompute address relations at compile--time,
and rely on low--level coded library message assembly routines to
achieve near copy throughput.



\subsection*{ACKNOWLEDGEMENTS}

We would like to thank Thomas Stricker for illuminating the finer
points of deposit model communication, James Stichnoth for his work
and explanations of array assignments between regular block-cyclic
distributed arrays, and Ed Segall for providing the impetus to look
at run--time approaches to fast message assembly.


\small
\bibliography{/afs/cs/project/iwarp/member/droh/bib/refs,/afs/cs/project/iwarp/member/fx/bib/texbib,local}

\end{document}
