\documentstyle[times,epsf]{article}
%\documentstyle{article}

\textheight =  9.0in
\topmargin = -.5 in 
\oddsidemargin = -.25 in
\evensidemargin = -.25 in
\textwidth=7in

\def\AAPAIR{AAPAIR}
\def\AABLK{AABLK}
\def\DMRLE{DMRLE}
\def\DMRLEC{DMRLEC}


\title{Fast Message Assembly Using Compact Address Relations}

\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron}

\begin{document}

\begin{abstract}

blah blah

\end{abstract}


\maketitle

%\setlength{\baselinestretch}{2} 

\section{Introduction}

Parallel programs on distributed memory multicomputers manipulate 
distributed data structures.  Operations on distributed data 
structures result in messages being sent between the processors.  The 
fixed startup cost of sending a message is amortized by including many 
data items in the message.  For most communication libraries, this 
requires that the data items be gathered to a contiguous buffer.  The 
performance of this {\em message assembly} step is an important 
component of the overall performance of message passing.  This is also 
true of the symmetric {\em message disassembly} step performed on the 
receiver to scatter the data items.  In the following, we concentrate 
on the message assembly step and discuss message disassembly only when 
the issues are different.

When writing message assembly routines, an important consideration is 
address generation --- how the addresses of the data items to copy 
will be generated.  One approach is to {\em on-line address 
computation}, computing addresses as needed in the assembly routine.  
However, even for regular block--cyclic distributed arrays such as 
those supported by High Performance Fortran~\cite{HPF}, computing the 
addresses on-line can be complex.  The approach taken by the Fx 
parallel Fortran compiler~\cite{FX} and other compilers is to generate 
specialized, optimized message assembly routines using compile time 
knowledge.  For example, if the compiler knows both arrays in an array 
assignment statement are block distributed, the general block--cyclic 
assembly loop nest can drastically simplified~\cite{stichnot}.


In this paper we attack the problem of how to achieve fast message 
assembly {\em without} compile--time knowledge.  Our primary 
motivation is run--time libraries: we want to build fast user 
libraries that support complex distributed data structures.  Our 
context is a run--time library that performs generalized distributed 
copies on distributed arrays.  A generalized distributed copy can 
perform any mapping of global indices of the source array to global 
indices of the destination array.  Redistributions and transposes are 
examples of the kinds of mappings that are commonly performed.  Our 
goal is to support message assembly with memory access patterns 
arising from arbitrary mappings while achieving nearly the copy 
bandwidth for patterns that arise from common mappings on regular 
block--cyclicly distributed arrays.


Our approach is similar to the inspector/executor scheme used for 
irregular collective communication in CHAOS~\cite{saltz91}.  When a 
copy operation is first performed, for each pair of processors, we 
precompute the mapping of sender addresses to receiver addresses.  
This {\em address relation} is stored in memory and used to provide 
addresses whenever the message assembly is performed.  We refer to 
this scheme as {\em address relation reuse}, as opposed to on-line 
address computation.  The receiver also precomputes the address relation 
and uses it in message disassembly.  The reason for storing both 
sender and receiver addresses is to support other communication 
styles, such as direct deposit~\cite{stricker95} where receiver 
addresses are needed on the sender.

To understand the tradeoff between address relation reuse and on-line 
address computation, we model the performance of each scheme, using a 
simple address relation representation: an array of sender, receiver 
address pairs.  We assume direct deposit model communication because 
this lets us concentrate on message assembly.  Using the models, we 
predict, given machine parameters, the regimes where address relation 
reuse can improve message assembly latency and how many times a 
relation must be reused to break even with computing it on-line.  The 
predictions are validated on the Intel iWarp and Intel Paragon.  
Surprisingly, even with this simple representation, we discover that 
if on-line address computation in the assembly loop involves more than 
a few instructions (2 on the iWarp, 16 on the Paragon), address 
relation reuse can do better.  Further, beyond this threshold, the 
cost of precomputing the address relation can be quickly amortized.

Unfortunately, the simple address relation representation requires 
memory proportional to the number of tuples in the relation.  Further, 
because the message assembly loop requires two extra loads for each data item 
copied, it is significantly slower than it could be for relations 
induced by common regular block--cyclic array distributions.  We 
attack both problems by introducing three compact representations, 
\AABLK, \DMRLE, and \DMRLEC, which significantly reduce storage costs and improve assembly 
throughput.  The representations are designed to be directly used in 
message assembly and disassembly --- relations are {\em not} 
decompressed before use.  The final form, \DMRLEC, compacts relations 
induced by the common cases, including transposes, by three to four 
orders of magnitude, making address relation reuse a practical 
approach for these common cases.

We measure the message assembly and disassembly throughput of address 
relation reuse using all four address relation representations on six 
machines for several common mappings between regular block and cyclic 
distributed two dimensional arrays.  The mappings we test are 
redistribution and transpose.  All measurements are of compiled C 
code.  The machines we test are the DEC 3000/400 (133 MHz Alpha 
21064), a 90 MHz Pentium system, a 133 Mhz Pentium system, the Intel 
Paragon (50 Mhz i860XP), the IBM RS/6000 Model 250 (80 MHz PowerPC 
601), and the IBM SP-2 (66 MHz POWER2.) These machines are either 
parallel supercomputer nodes, or the workstations that are likely to 
see use in a high speed cluster environment.

Except on the Pentium systems, we find that \DMRLEC achieves the same 
throughput as a stride-one copy loop when blocks are assembled and a 
strided copy loop for strided access patterns.  The other 
representations are about as fast, except for pathological cases, 
where their throughput plummets.  On the Pentium, only the simplest
compact representation, \AABLK, is effective --- which we believe is due
to the extremely limited register set of this processor.  

These results demonstrate that message assembly with address relation 
reuse using compact address relations can support arbitrary copy 
patterns while still achieving optimal assembly throughput for common 
patterns.


%
% Superscalar plateau?
%

% \section{Address Relation Reuse}
% 
% \begin{figure}
% %\epsfig
% \caption{Levels of an operation on distributed data structure}
% \label{fig:hpfassign}
% \end{figure}
% 
% \begin{figure}
% %\epsfig
% \caption{Decoupling address computation and message assembly}
% \label{fig:decouple}
% \end{figure}
% 
% 
% Parallel programs for distributed memory multicomputers, whether
% written in a high level parallel language such as HPF or Fx, or in a
% sequential language with message passing, operate on data structures
% distributed over many processors.  These operations can be examined at
% many levels.  The language level defines what data structures are
% possible, how they can be distributed, and the operations that can be
% performed on them.  For a particular program, the programmer or
% compiler chooses the appropriate data structures, distributions, and
% operations to accomplish his goals.  These choices interact to produce
% some global communication pattern between the processors.  Between
% each pair of processors, communication involves mapping sender address
% to receiver addresses and transfer data according to this mapping.  We
% refer to the act of computing this mapping as {\em address
% computation} and to the mapping itself as an {\em address relation}.
% 
% 
% For example, consider figure \ref{fig:hpfassign}.  The innocuous
% language level HPF array assignment \verb.B=A. is between two
% dimensional arrays with different data distributions (block
% distributed columns and rows for B and A, respectively).  This induces
% an all-to-all communication operation.  Between each pair of
% processors, an address relation is computed mapping blocks of $k$
% addresses from a sender stride of $Pk$ to a receiver stride of $k$.
% To communicate data according to the relation, the strided data is
% gathered at the sender either into a buffer or directly to the wire
% and scattered at the receiver.
% 
% Typically, for regular data distributions and access patterns, address
% computation is done in--line with {\em message assembly} (gathering
% the data) on the sending processor, and {\em message disassembly}
% (scattering the data) on the receiving processor.  Compile--time
% knowledge (or programmer knowledge of the strengths and weaknesses of
% the compiler) is used to reduce the cost of address
% computation~\cite{jim}.  
% 
% We {\em decouple} address computation and message
% assembly/disassembly.  As shown in figure~\ref{fig:decouple}, address
% computation generates an address relation which is stored in memory
% and then reused for assembling and disassembling messages whenever the
% associated communnication is done.  The address relation can be
% represented in many different ways to minimize memory use or to
% maximize performance.  The power of this technique is that message
% assembly/disassembly code need only be optimized for each
% representation and architecture, but not for different data
% distributions.  
% 






\section{Analytic models}
\label{sec:models}

In this section, we model address relation reuse and on-line address 
computation in the context of direct deposit model communication.  
After explaining this communication style, we develop expressions for 
the average latency for assembly with on-line address computation and 
with address relation reuse as a function of machine parameters.  The 
address relation representation we use is a simple list of sender 
address, receiver address tuples.  We combine our expressions to 
produce predictions for the {\em threshold}: the minimum computation 
at which reuse is faster than on-line address computation, the {\em 
breakeven}: how many times a relation must be reused before the 
initial address computation is amortized, and the {\em speedup}: the 
maximum improvement possible.  We test these values on the iWarp and 
Paragon and find close agreement.  An important observatation is that 
the threshold values are very low, on the order of one instruction on 
the iWarp and 16 on the Paragon.


\subsection{Direct deposit model communication}

In a conventional communication system, such as PVM~\cite{sund90},
NX~\cite{pierce94}, or MPI~\cite{walker94}, address computation is
done both on the sender and the receiver.  The message is assembled
according to the address relation computed by the sender, transported
to the receiver, and disassembled according to the address relation
computed by the receiver.  Obviously, the same total order must be
placed on both address relations.  The message contains only the
actual data.  In direct deposit model communication~\cite{stricker95},
only the sender performs address computation.  The receiver addresses
are imbedded in the message.  In addition to reducing the end-to-end
latency, this approach also allows for more easily separating
synchronization from data transfer, allowing a compiler to amortize
synchronization over many data transfers.  The message format that
we examine is called {\em address-data-pair} since each data word
is accompanied by the address at which it will be deposited.

\subsection{Analytic models}
\label{sec:simple}

In this section, we develop expressions for the average latency of 
message assembly with on-line address computation and with address 
relation reuse.  We combine the expressions to predict the threshold 
at which the reuse approach becomes effective, the breakeven point at 
which the reuse is as fast as on-line address computation, and the 
final speedup of the reuse approach.

\subsubsection*{Algorithm and machine properties}

Clearly, the time spent computing the address relation depends on the
algorithm and its implementation.  We model address computation as
instruction issue rate limited and as delivering the address relation
in an on-line manner.  We assume that address computation is
integer-based and done in registers, so we characterize it by $n_g$,
the number of instructions executed per element of the relation.
Additionally, we must know $r_i$, the number of integer instructions
the machine can issue per second.  Since memory accesses are at the
heart of message assembly and address relation reuse, we characterize
the memory system by four values: $r_{rc}$ and $r_{wc}$, the read and
write bandwidths for unit--stride accesses, respectively; and $r_{rr}$
and $r_{wr}$, the read and write bandwidths for random memory
accesses.

\subsubsection*{on-line address computation}

The time for message assembly using on-line address computation has
two components: $t_g$, the time to compute the address relation;
and $t_{au}$, the message assembly time:
\begin{displaymath}
t_{on-line} = t_g + t_{au}
\end{displaymath}
The time to compute the address relation is largely determined by the
complexity of algorithm used, its implementation, and the instruction
execution rate of the machine.  We characterize the algorithm in terms
of $n_g$, the number of integer instructions executed per tuple of the
address relation.  Thus for an $n$ word message, the time required is
\begin{displaymath}
t_g = \frac{n n_g}{r_i}
\end{displaymath}
The address computation code presents us with sender, receiver address
tuples $(s,r)$ in registers in an on-line manner.  For each of these
tuples, we read the data word at address $s$, write $r$ to the message
buffer, then write the data word to the message buffer.  Note that the
writes to the message buffer are to consecutive addresses, so their
performance can be captured by $r_{wc}$, the unit--stride write
bandwidth of the machine.  On the other hand, the sender side
addresses of the tuples may not be contiguous, so we capture their
performance with $r_{rr}$, the random read bandwidth of the machine.
Furthermore, we capture the loop overhead as a number of instructions,
$n_{au}$.  Thus the time to assemble the message is
\begin{displaymath}
t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{n_{au}}{r_i}
          \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.  If
the compiler software-pipelines the assembly loop, we should be close
to the above expression.  Substituting into the expression for
$t_{on-line}$, we get an average assembly latency of 
\begin{equation}
 t_{on-line} = n \left ( \frac{n_g+n_{au}}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
\label{equ:uncached}
\end{equation}


\subsubsection*{Address relation reuse}

The average latency for message assembly with address relation reuse 
has three components: $t_g$, the time to compute the address relation; 
$t_o$, the time to store the address relation in memory, and 
$t_{ac}$, the message assembly time using the stored address relation.  
The average latency is
\begin{displaymath}
 t_{reuse} = \frac{t_g + t_o}{n_i} + t_{ac}
\end{displaymath}
Note that the first two components are amortized over the number of
times the communication is performed, $n_i$.  $t_g$ is the same as for
the on-line scheme and was determined in the preceeding section.  The
address relation representation we examine is a simple array of sender
address, receiver address tuples.  As each tuple of the address
relation is generated, we write it to the array.  Thus storing the
address relation involves two contiguous writes for each in-register
tuple produced by the address relation computation.  In addition, we
execute $n_o$ loop overhead instructions, so
\begin{displaymath}
 t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
\end{displaymath}
When assembling the message from the stored address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the receiver address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\begin{displaymath}
 t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes, but we continue the assumption of the
preceding section.  Substituting into the expression for $t_{reuse}$,
we see that an $n$ word communication between two nodes requires time
\begin{equation}
 t_{reuse} = n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} +
	\frac{n_{ac}}{r_i} + \frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
	\frac{2}{r_{wc}} \right )
\label{equ:cached}
\end{equation}

\subsubsection*{Effects of reuse}

Equations ~\ref{equ:uncached} and ~\ref{equ:cached} provide insight 
into the following questions: For a message assembly that occurs many 
times at run--time, can address relation reuse ever outperform on-line 
address computation?  If so, how many times must the relation be 
reused before the reuse strategy wins?  Finally, what is the ultimate 
speedup of the reuse strategy over the on-line strategy?

Caching can only be effective if we can assemble the message faster
using the stored relation than by computing it on-line, i.e., 
\begin{displaymath}
t_{ac} < t_g + t_{au}
\end{displaymath}
or
\begin{equation}
% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
\label{equ:necessary}
\end{equation}
The right hand side of equation~\ref{equ:necessary} is a 
machine--specific threshold below which address relation reuse is 
ineffective.  Notice that it is largely determined by the ratio of 
instruction issue rate and memory read bandwidth.  
If~\ref{equ:necessary} is true, then we can also predict how many 
times the address relation must be reused in order to break even with 
the on-line scheme.  This break even point is found by setting 
$t_{reuse} = t_{on-line}$ and solving for $n_i$.  In doing this we 
find that the break even point is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
\right \rceil 
\label{equ:crossover}
\end{equation}
It is important to point out that as $n_g$ moves away from the
threshold, the break even point decreases sharply because the $n_g$
terms in the fraction begin to dominate.

Finally, the ultimate speedup of the reuse strategy over the
on-line strategy is 
\begin{equation}
\label{equ:speedup}
\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
              + r_ir_{rc}(r_{wc}+2r_{rr})}
             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
\end{equation}
The speedup is linear in $n_g$, so is bounded only by the complexity of
the address computation.


\subsection{Model validation}

In this section we validate the models from Section~\ref{sec:simple} 
for message assembly on an iWarp system and on a Paragon.  To test the 
predictive power of the models, we use (\ref{equ:crossover}) to 
predict the number of iterations necessary for address relation reuse 
to break even with on-line address computation as a function of $n_g$, 
and we use (\ref{equ:speedup}) to predict the ultimate speedup of the 
reuse strategy over the on-line strategy as a function of $n_g$.  
In each case, the address relation that is generated is the identity 
relation, and $n_g$, the number of instructions to compute the 
relation, is varied by inserting nops into the code stream.  The 
predictions are compared to actual measurements on the hardware.

\subsubsection*{iWarp}

On the iWarp~\cite{iwarp} system, $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$
words/second, and a node can execute $r_i=2 \times 10^7$
instructions/second.  The memory system parameters were derived
directly from the iWarp specification.  Specifically, the iWarp clock
speed is 20 MHz and all memory operations complete in two cycles.
There is no cache and we assume no dual dispatch of loads and stores.

The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
estimated {\em statically} by simply counting the number of
instructions in each loop, or {\em dynamically} by accounting for
bypass stalls that arise in the loops. The static estimates for iWarp
are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

Figure~\ref{fig:iw}(a) shows the measured and predicted number of
iterations needed on iWarp for the message assembly latency of the
reuse scheme to break even with that of the on-line scheme for
different $n_g$.  The predictions made using the dynamic measurements
are exact.  The predictions made using the static measurements are
pessimistic for small $n_g$, but quickly converge to the actual
numbers as $n_g$ increases.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1        &         13     &        INF        &       13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &          3        &        3\\
\hline
\end{tabular} 
&
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1       &         1.05    &         1.00     &         1.07 \\
   2       &         1.12    &         1.09     &         1.15 \\
   3       &         1.22    &         1.16     &         1.23 \\
   4       &         1.28    &         1.25     &         1.32 \\
   5       &         1.35    &         1.33     &         1.39 \\
   6       &         1.43    &         1.41     &         1.47 \\
   7       &         1.49    &         1.49     &         1.54 \\
   8       &         1.59    &         1.59     &         1.61 \\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{Model validation on iWarp:  (a) Iterations required to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:iw}
\end{figure}

Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
the reuse scheme over the on-line scheme on iWarp. As in
Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for
all $n_g$ and the static predictions are pessimistic for small $n_g$.
However, it is important to note that even with static measurements,
the predictions only become inaccurate when $n_g$ approaches the
threshold.  Another important point is that the purpose of
Figure~\ref{fig:iw}(b) is to help validate the models when $n_g$ is
small, and not to highlight impressive speedups. In fact, the speedups
grow arbitrarily large as $n_g$ increases (see Equation~\ref{equ:speedup}.)



\subsubsection*{Paragon}

A Paragon node contains a 50 MHz pipelined, multiple issue RISC
processor~\cite{PARAGON-XPS,I860-PR} .  We estimated the instruction
issue rate by noting that we do not use the floating point pipeline.
Therefore we can issue at most $r_i=5 \times 10^7$
instructions/second.  We directly measured the memory parameters and
found $r_{rc}=9 \times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$
words/second, and $r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only
a static measurement (instruction counting) of the implementation
properties was performed, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.

Figure~\ref{fig:86}(a) shows, for different $n_g$, the measured and
predicted number of iterations needed for message assembly using the
reuse scheme to break even with assembly using the  on-line scheme.
The predictions are pessimistic, but quickly converge to the actual
values with increasing $n_g$.  Notice that $n_g$ must be much larger
to permit break even on the Paragon than on the iWarp.  This is not
surprising given that the Paragon's instruction issue rate is much
higher than its memory bandwidth, while the iWarp's issue rate is well
matched to its memory bandwidth.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline

    14     &        NONE    &        NONE\\
    15     &        NONE    &        NONE\\
    16     &        19      &        NONE\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
&
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline
    14     &        NONE   &         NONE\\
    15     &        NONE   &         NONE\\
    16     &       1.03     &         NONE\\
    17     &       1.06     &         1.01\\
    18     &       1.09     &         1.02\\
    19     &       1.12     &         1.03\\
    20     &       1.15     &         1.04\\
    21     &       1.17     &         1.05\\
    22     &       1.20     &         1.06\\
    23     &       1.22     &         1.07\\
    24     &       1.24     &         1.08\\
    25     &       1.26     &         1.09\\
\hline
\end{tabular}\\
(a) & (b)
\end{tabular}
\caption{Model validation on Paragon: (a) Iterations to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:86}
\end{figure}

Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
the reuse scheme over the on-line scheme on the Paragon.  Notice
that speedup does not increase as quickly with $n_g$ as on the iWarp
(Figure~\ref{fig:iw}(b).)  This is the result of the lower memory
bandwidth relative to instruction issue rate on the Paragon.  The
predictions are not as accurate as for the iWarp, reflecting the fact
the Paragon's memory system is more complex.  However, the predictions
are pessimistic and quickly converge to the correct values.


\section{Compact address relations}

The address relation representation modeled in Section 
\ref{sec:models} is a simple array of sender, receiver address pairs.  
This representation is impractical because it requires as memory 
proportional to the number of data items that are transfered.  
Further, because one extra load (two for direct deposit) is performed 
in the message assembly loop for each data item, it is impossible to 
assemble messages at copy loop throughput.  

For relations induced by redistributions of regular block--cyclic
distributed arrays, we can do better.  These relations exhibit 
repeating patterns that we can take advantage of to compactly 
represent them and to maximize message assembly throughput.  

In this section, we choose four representative array redistributions 
and use them to drive the development of a good compact representation.
We develop three representations, \AABLK, \DMRLE, and \DMRLEC, which
recognize an increasingly broad range of patterns.  \DMRLEC clearly
is the most space efficient of the three representations.  

\subsection{Representative redistributions}

\begin{figure}
\centerline{\epsfxsize 3in \epsfbox{example.eps}}
\caption{Redistribution of regular block--cyclic distributed arrays}
\label{fig:redist}
\end{figure}

Each dimension of a regular block--cyclic distributed array has a size 
$N$ and is distributed over some number of processors $P$ with a fixed 
block size $M$.  Conceptually, the dimension is split into $\lceil N/M 
\rceil$ blocks of size $M$ ($\leq M$ for the final block) which are 
hen distributed round robbin over the $P$ processors.  The Cartesian 
product of the distributions of all the dimensions gives the 
distribution of the whole array.  To redistribute the array, for each 
pair of source and destination processors, we compute the intersection 
of the elements the processors own and map source processor address of 
each element in the intersection to its address in the destination 
processor.  See figure \ref{fig:redist}.

The most common regular block--cyclic distributions are BLOCK, where 
there are only $\lceil N/M \rceil = P$ and CYCLIC, where $M=1$.  For a 
single dimension, it can clearly be seen that only simple stride-$j$ 
to stride-$k$ mappings can occur in an address relation induced by 
redistributions.  BLOCK to BLOCK redistributions result in unit 
strides on both the source and destination while the combination of 
CYCLIC and BLOCK results in a unit stride on one side and a non unit 
stride on the other.  CYCLIC to CYCLIC redistributions result in non 
unit strides on both sides, as do certain forms of transposes.  The 
four two dimensional array redistributions of figure \ref{fig:repdist} 
are representative of these patterns.  We include a data transpose 
operation instead of a CYCLIC to CYCLIC redistribution to demonstrate
that our technique extends to transposes as well.  We will use these 
representative redistribution operations to motivate our schemes for 
compact address relation representations.

\begin{figure}
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Redistribution & Assembly Stride & Disassembly Stride \\
\hline
(BLOCK,*)  to (*,BLOCK) & 1 & 1 \\
(CYCLIC,*) to (BLOCK,*) & 1 & k \\
(BLOCK,*) to (CYCLIC,*) & k & 1 \\
(*,CYCLIC) to (CYCLIC,*) Transpose & k & k \\
\hline
\end{tabular}
\end{center}
\caption{Representative redistributions}
\label{fig:repdist}
\end{figure}

\subsection{Hierarchy}

\begin{figure}
\centerline{\epsfxsize 3in \epsfbox{sizes.eps}}
\caption{Total size of all address relations on sending processor zero 
in each representation}
\label{fig:sizes}
\end{figure}

\subsubsection*{\AAPAIR}

The \AAPAIR representation is simply the array of sender, receiver 
addres pairs used in section \ref{sec:models}.  It requires storage 
space proportional to the number of data items transfered.

\subsubsection*{\AABLK}

For the \AABLK representation, we place a total order on the tuples of
the address relation: $(a,b) < (c,d) $ iff $a<c$ and $(b<d)$.  This 
means that the tuples are ordered by sender side address first, then
by receiver side addresses.  The \AABLK representation run length 
encodes blocks of stride-1 to stride-1 mappings.  The symbols generated
are of the form $(a,c,l)$, which means that addresses $a$ through
$a+l-1$ on the sender map to address $b$ through $b+l-1$ on the 
receiver.

\subsubsection*{\DMRLE}

\DMRLE generalizes \AABLK to map blocks of address with a sender 
stride $j$ to a receiver stride $k$.  We begin with the totally 
ordered relation as described in the previous section.  Then we form
a difference map $(dx_{i},dy_{i})$ from the sequence of tuples 
$(x_{i},y_{i}$:
\begin{displaymath}
x_{-1}=0
y_{-1}=0
dx_{i}=x_{i}-x_{i-1}
dy_{i}=y_{i}-y_{i-1}
\end{displaymath}
The difference map is then run length encoded to produce symbols of
the form $(dx,dy,l)$.  

\subsubsection*{\DMRLEC}


\section{Performance of address relation compression}

\subsection{Tradeoff}

\subsection{Metrics}

\subsection{Measurements}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{alpha_assem.eps} & \epsfxsize 2.5in \epsfbox{alpha_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using 
\AAPAIR and \DMRLE on DEC 3000/400}
\label{fig:alpha}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{i860_assem.eps} & \epsfxsize 2.5in \epsfbox{i860_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using
\AAPAIR and \DMRLE on Paragon}
\label{fig:i860}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{ppc_assem.eps} & \epsfxsize 2.5in \epsfbox{ppc_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using \AAPAIR and
\DMRLE on IBM RS/6000 Model 250}
\label{fig:ppc}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{pwr2_assem.eps} & \epsfxsize 2.5in \epsfbox{pwr2_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using \AAPAIR
and \DMRLEC on IBM SP-2}
\label{fig:pwr2}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p90_assem.eps} & \epsfxsize 2.5in \epsfbox{p90_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using \AAPAIR,
\AABLK, \DMRLE, and \DMRLEC on Gateway Pentium 90 system}
\label{fig:p90}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p133_assem.eps} & \epsfxsize 2.5in \epsfbox{p133_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message assembly (a) and disassembly (b) using \AAPAIR,
\AABLK, \DMRLE, and \DMRLEC on Micron Pentium 133 system}
\label{fig:p133}
\end{figure}


\subsection{Discussion}

%\section{Superscalar plateau}

\section{Discussion}

In this section, we comment on other issues in address relation
caching.  We begin by noting the benefits of decoupling address
computation and message assembly in the communication system.  Next,
we discuss where to make the decision to cache an address
computation, pointing out that compilers play a key role.
Continuing, we comment on cache data structures and replacement
policy.  Finally, we point out that the cache provides an opportunity
for optimizations.


\subsection{Decoupling address computation and message assembly}

By taking address computation out of the critical path of data
transfer, address relation caching decouples address computation and
message assembly.  One advantage of this is that it makes it easier to
achieve reasonable communication performance without a large number of
special--case optimizations in the compiler and run--time library.
Instead of optimizing each address computation algorithm for each
architecture, only message assembly from cache needs to be optimized
for each architecture in order for all data transfers to benefit.

Address relations are an ideal interface to the communication system
because they are implementation neutral.  This permits communication
systems to be highly specialized for the available hardware while
still presenting the same interface to the programmer or compiler.
However, to best utilize the hardware, it may be necessary to change
the representation of the address relation before using it.  For
example, the relation may be transformed into a DMA gather map on the
sender to make best use of a network adapter such as the one described
in~\cite{GNECTAR-IFIP}.  Because transforming the relation may be
expensive, it is desirable for the communication system to cache the
(transformed) relation instead of the application.  


\subsection{Deciding to cache}

Our analytical model and HPF results show that caching can sometimes
be ineffective and that the number of times a cached address relation
must be reused to break even depends on the complexity of address
generation.  Given this, where should the decision to cache or not be
made?  One option is to make this decision in the compiler.  This is
almost certainly the right place if the distributions and machine
characteristics and the number of times the communication is performed
are known at compile--time.  If this is not the case, the decision can
be made at run--time, where at least the distributions and machine
characteristics are known.  However, the number of repetitions may be
data dependent and unknown until after the communication has been
performed.  To deal with this case, the library could keep track of
the number of times a particular communication is performed uncached,
and switch to a cached approach after some number of repetitions.

\subsection{Cache data structures}

There is a continuum of cache data structures that ranges from
individually storing each tuple of the address relation in memory to
simply performing the original address computation.  The cache data
structure should be chosen to minimize the time spent accessing the
cache given memory constraints.  The structure should be implementation
dependent in order to make best use of available hardware.


\subsection{Replacement policy}

The cache replacement policy should make use of compile--time
knowledge.  The HPF library described above uses a strictly
direct--mapped policy at run--time, but the cache entry tags are
selected by the compiler, which knows the run--time policy.  This
means the compiler can assure that useful address relations remain in
the cache.  What to do when there is insufficient memory to hold even
one address relation in cache is unclear.

\subsection{Cache optimizations}

The cache is an intermediate layer between address computation and
message assembly.  As such, it presents opportunities for
optimization.  For example, since the entire address relation is known
to the cache, it could optimize it for memory locality on the sender,
receiver, or both.  As shown in ~\cite{memoryperform}, the choice is
architecture--dependent. If the distributions, machine
characteristics, and repetition count are known at compile--time, it
is reasonable to perform address generation at this point,
``pre-cache'' the results, and always assemble from cache when running
the program.

\section{Related work}

 A similar technique is used to
support irregular data distributions and access patterns in
CHAOS/PARTI~\cite{saltz91} --- in that system, an ``inspector'' is
used to compute all communication for a parallel loop and an
``executor'' performs the loop, including the communication.  Our
main contribution is to demonstrate 

\section{Conclusion}

\end{document}
