\input{caption}
\makeatletter
\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}
\makeatother
\newcommand{\comment}[1]{}

\documentstyle[times,epsf,11pt]{article}

\textheight =  9.0in
\topmargin = -.5 in 
%\oddsidemargin = -.25 in
%\evensidemargin = -.25 in
\oddsidemargin = 0 in
\evensidemargin = 0 in
\textwidth=6.5in

\def\AAPAIR{AAPAIR}
\def\AABLK{AABLK}
\def\DMRLE{DMRLE}
\def\DMRLEC{DMRLEC}


\begin{document}
\bibliographystyle{acm}

\title{
\begin{flushleft}
\vspace{-0.9in}
\scriptsize {\em 
Sigmetrics96 Draft
}
\end{flushleft}
\vspace{0.4in}
\Large\bf Fast Message Assembly Using Compact Address Relations}
\author{\mbox{Paper \#xxx}}
%\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron}

\date{}
\maketitle 
\thispagestyle{empty}

\begin{abstract}
abstract
\end{abstract}

\comment{
\unmarkedfootnote{
Supported in part by the Advanced Research Projects
Agency/CSTO monitored by SPAWAR under contract N00039-93-C-0152, in
part by the National Science Foundation under Grant ASC-9318163, and
in part by a grant from the Intel Corporation.
Authors' email addresses: pdinda@cs.cmu.edu, droh@cs.cmu.edu.
}
}

\section{Introduction}
\label{sec:intro}

Many parallel programs use a message passing system to transfer data
items from the memory of a source node to the memory of a destination
node.  These programs typically amortize the fixed overhead of
transferring a message by packing a collection of data items into a
contiguous message buffer, which the message passing system then
transfers over the network as a single message.  When the message
arrives at the destination, the message is copied from the network to
a buffer, which the user program then unpacks to the appropriate
memory locations.

%
% pack/unpack versus scatter/gather - perhaps p/u is better
%

The process of packing a message buffer on the source node is called {\em
message assembly}, and the symmetric process on the destination
node is known as {\em message disassembly}.  In general, the data items being
assembled and disassembled have arbitrary memory addresses, although
in many cases the addresses follow some pattern, such as a constant
stride or equally spaced contiguous blocks.

The performance of the message assembly/dissembly steps can have a 
significant impact on the overall communication performance of a 
program.  For example, aggressively hand--optimized assembly and 
disassembly loops for simple contiguous and strided accesses account 
for almost 50\% of the total communication time in a large parallel 2D 
fast Fourier transform running on 128 nodes of a CRAY 
T3D~\cite{ICS-DECOUPLE}.  Of course, for computers with 
weak communication networks (e.g., a workstation cluster on a 10Mb 
Ethernet) time on the wire will dominate.  But in many parallel 
systems, the memory system tends to be the bottleneck rather than the 
communication system ~\cite{ISCA-MEMORY}, and for these systems 
assembly/disassembly time is important.  Further, the range
of systems for which message assembly/disassembly performance is
important is increasing with the advent of high speed 
networks~\cite{GNECTAR-IFIP}.

Another consideration is that the running times of different
algorithms for message assembly and disasssembly can vary
significantly.  For example, the running time of the assembly step for
a High Performance Fortran (HPF)~\cite{hpf93} block--cylic assignment
statement can vary by several orders of magnitude, depending on the
nesting of the assembly loops~\cite{gupta94}.  As another example,
because of asymmetry in the Cray T3D memory system, a transpose
operation implemented with contiguous reads and strided writes run
twice as fast as the equivalent operation implemented with strided
reads and contiguous writes~\cite{ISCA-MEMORY}. The point is that in
general, message assembly and disassembly are non--trivial to implement
efficiently, and even when implemented efficiently can represent a
significant fraction of the total communication time.

The general problem addressed here is how to perform fast message
assembly and disassembly in the context of a run--time communication
library.  The library should be fast in the sense that data items
should be transferred between their original locations and the message
buffer at or near the copy throughput of the memory system.
Furthermore, since the assembly and disassembly routines are in a
library, they can rely only on run--time information.
%
% copy throughput of the memory system for the memory reference 
% pattern
%

This paper describes an approach for fast run--time message
assembly/disassembly and provides some initial characterization of its
performance on real systems. The approach is based on the notion of
decoding precomputed {\em address relations} that describe the mapping
of addresses on the sending nodes to addresses on the receiving nodes.
The run--time model is a classic inspector/executor~\cite{saltz91}.
For each data transfer between a source and a destination, the
appropriate address relation is precomputed and stored in some
encoding.  The stored encoding of the address relation is
interpreted by the message assembly and disassembly routines each time
data is transferred.  As with any inspector/executor scheme, the aim
is to exploit the fact that many parallel programs repeatedly perform
the same data transfers. The focus of this paper is on the performance
of the executor (i.e., the assembly/disassembly step.)

%
% This is too abrupt.
%
%After an overview of address relations and their encodings in
%Section~\ref{sec:relations}--\ref{sec:compact}, 

We begin with an overview of address relations and approaches to 
generating them in Section~\ref{sec:relations}.  The next section, 
Section~\ref{sec:represent}, characterizes the address relations 
induced by redistributions of block--cyclic arrays and points out
the regularities these relations exhibit.  Section~\ref{sec:compact} 
introduces address relation encodings that are general, but exploit
the regularities we describe.  At this point, we pursue two main
questions: (1) Can decoding precomputed address relations improve the
performance of message assembly and disassembly?  (2) If so, can we
store the relations compactly and still interpret them efficiently
during the assembly/disassembly steps?

In Section~\ref{sec:models} we consider a simple but
memory--inefficient encoding of address relations (i.e.  an index
array). For this encoding we derive a model that predicts, given
measurable machine parameters, the regimes where decoding address
relations, rather than recomputing them each time, can improve the
average message assembly latency.  For those cases where decoding the
relations is effective, we extend the model to predict the breakeven
point (i.e.  how many times does an address relation have to be reused
to amortize the cost of precomputing it) and the upper bound on the
speedup.  Validation on the Intel iWarp~\cite{iwarpoverview,iwarpcomm}
and Paragon~\cite{I860-PR} systems suggests that the models have good
predictive power. Surprisingly, the results indicate that even with
the simple index array encoding, if recomputing an address in the
assembly loop involves more than a few instructions (1 on the iWarp,
16 on the Paragon), then decoding a precomputed address relation can
do better.  Further, beyond this threshold, the cost of precomputing
the address relation can be quickly amortized.

Next, in Section~\ref{sec:measure}, we investigate the performance of
message assembly/disassembly routines based on more compact encodings
of address relations. Since do not yet have good analytic models for
these encodings, we rely for the moment on measurements of typical
address referencing patterns (HPF whole--array block--cyclic
redistributions) running on some common systems: DEC
Alpha~\cite{DEC-3000}, Intel Paragon, IBM RS/6000
~\cite{RS6000-ARCHITECTURE,PPC601-MICRO}, IBM SP/2~\cite{IBM-SP1}, Gateway
Pentium 90~\cite{PENTIUM-OVERVIEW,PENTIUM-MICRO}, and Micron Pentium
133. The encouraging conclusion is that in most cases, assembly and
disassembly routines based on a highly compact encoding can run at or
near the copy throughput of the memory system.


Finally, Section~\ref{sec:discuss} discusses other implications of
decoding precomputed address relations as well as some open questions
that still need to be addressed before the ultimate usefulness of our
approach can be understood.

\section{Decoding address relations}
\label{sec:relations}

The transfer of data items from the memory of a {\em source} node to
the memory of a {\em destination} node can be characterized with a
binary relation $R$, which is called an {\em address relation}. The
relation $R$ is a set of {\em tuples} $(s_i, d_i) \in R$, where $s_i$
is an address on the source, called a {\em source address}, and $d_i$
is an address on the destination, called a {\em destination address}.

There are two constraints on the elements of an address relation: (1)
source and destination addresses point to valid memory locations, and
(2) destination addresses are unique.\footnote{ In general, it would
be possible to relax this constraint by associating a binary
associative combining operation with each address relation to handle
non--unique destination addresses. However we will assume unique
destination addresses in this paper.} Thus, address relations can
describe arbitrary permutations such as transpose operations, as well
as replication of source data items on the destination, but as 
defined here, they cannot describe operations like reductions or
scans.  For example, the address relation $\{ (1,4), (2,3)\}$
describes a data transfer that takes the pair of elements at locations
1 and 2 on the source, swaps them, and stores them in locations 3 and
4 on the destination.  The relation $\{ (1,1), (1,2)\}$ describes a
data transfer that that copies the data item at location 1 on the
source to locations 1 and 2 on the destination.

For a particular data transfer, the message assembly and disassembly
routines must somehow generate the appropriate address relation.
Where the addresses are computed is determined by the message passing
model.  With a {\em send/receive} model, the source node computes the
source addresses and the destination node computes the destination
addresses.  With a {\em put} model, the source node computes the
entire address relation and passes the destination addresses along
with the data.  With a {\em get} model, the destination node computes
the entire address relation and passes the source addresses along with
the data.

%
% At some point, we should mention that the library attempts to support
% all three models by simply providing the facility to compute an
% address relation.  For send/recv, sender and receiver addresses
% are computed on both the sender and the receiver.
%
% This is probably not the right place.
%

One approach for generating the address relation is to recompute the
addresses on the fly in the assembly routine. For example, at a high
level an assembly routine to transfer $n$ data items using a {\em
send/receive} model has the following form:
\begin{verbatim}
    loop i = 1 to n	
        recompute source address s(i)
        buffer[i] = memory[s(i)]
    endloop	
\end{verbatim}
Even for fairly regular access patterns, such as those induced by HPF
whole array assignment statements, recomputing the addresses each time
can be complex.  Parallelizing compilers can help a great deal by
exploiting compile--time knowledge to generate specialized assembly
loops for each data transfer~\cite{stichnoth94,IRREGULAR-MALAGA}.
However if the data access pattern is irregular (and thus unknown to
the compiler), or if a parallelizing compiler is not available, then a
general--purpose run--time library assembly routine is necessary, and
there is a limit to the number of special--case optimizations that
such a library routine can exploit.  Of course, the compiler is also
limited to some number of optimizations, but it is practical to
support a far larger number because optimizations can be selected at
compile--time instead of run--time.


We are exploring an alternative approach that allows a
general--purpose message assembly routine in a run--time library to
run efficiently without any special--case run--time or compile--time
optimizations. The idea is that instead of recomputing the address
relation from scratch during message assembly, the relation is
precomputed and {\em encoded} (i.e., stored in memory in some
form), either at startup time or the first time a particular data
transfer is encountered.  This takes the following form:
\begin{verbatim}
    loop i = 1 to n	
        compute source address s(i) and receiver address r(i)
        encode s(i), r(i) into relation
    endloop	
\end{verbatim}
The encoded address relation is then {\em
decoded} inside the message assembly loop to provide addresses.
The message assembly routine to transfer $n$ data items
using a {\em send/receive} model would have the following form:
\begin{verbatim}
    loop i = 1 to n	
        decode relation to generate source address s(i)
        buffer[i] = memory[s(i)]
    endloop	
\end{verbatim}
Notice that the two assembly loops have a similar form. The intuition
behind the decoding approach is that it should be faster to decode a
precomputed form of the address relation than to recompute it from
scratch each time. In some sense, precomputing address relations from
scratch inside the assembly loop can be thought of as decoding a
null-encoded relation.

The encoding time depends on the relation itself and its encoding in
memory. In the next section we derive a representative set of
relations that capture the basic memory access patterns induced by
regular HPF block--cyclic array assignment statements. We then use
these relations to derive a family of encodings that provide the basis
for the analytic models in Section~\ref{sec:models} and the empirical
results in Section~\ref{sec:measure}.

\section{Representative address relations}
\label{sec:represent}

In general, the elements of an address relation are random.  However in
many cases the address relations induced by operations on distributed
data structures have considerable regularity. For example, consider
the redistribution of a distributed array in an HPF
program.  Each dimension of a regular block--cyclic
distributed array has some size $N$ and is distributed over some
number of nodes $P$ with some fixed block size $M$.
Conceptually, the dimension is split into $\lceil N/M \rceil$ blocks
of size $M$ ($\leq M$ for the final block) which are then distributed
round robin over the $P$ nodes.  The most common regular
block--cyclic distributions are BLOCK($P$), where there are only
$\lceil N/M \rceil = P$ blocks and CYCLIC($P$), where $M=1$.  Each
dimension of an array can be independently distributed.  Our notation
differs slightly from HPF notation in order to include the number of
nodes.  In particular, our CYCLIC($P$) corresponds to an HPF
CYCLIC(1) over $P$ nodes.

The idea of array redistribution is to change the location (node
and address) of elements of the array so as to change its
distribution.  Because the redistribution may be onto a disjoint group
of nodes, we refer to {\em source} and {\em destination}
distributions and nodes.  Figure~\ref{fig:redist} shows a source
distribution of a 2--dimensional array with a (BLOCK(2),*)
distribution.  The shaded array elements are those that node 0
owns.  The ``*'' indicates a non--distributed dimension.  The same
figure also shows a destination distribution of (CYCLIC(2),*).  To
redistribute the array, for each pair of source and destination
nodes, we compute the intersection of the elements the nodes
own (which is highlighted in the figure) and map the source node
address of each element in the intersection to its address in the
destination node.  Data is gathered from the source node
addresses to a contiguous buffer, communicated, and scattered to the
destination node addresses.

\begin{figure}[htbp]
\centerline{\epsfxsize 3in \epsfbox{example.eps}}
\caption{Redistributing regular block--cyclic distributed arrays}
\label{fig:redist}
\end{figure}

For 1--dimensional BLOCK($P$) and CYCLIC($P$) arrays, it can
clearly be seen that only simple stride--$j$ to stride--$k$ mappings can
occur in an address relation induced by redistribution.  BLOCK($P$) to
BLOCK($P$) redistributions result in unit strides on both the source
and destination nodes while the combination of CYCLIC($P$) and
BLOCK($P$) results in a unit stride on either the source or
destination node and a non unit stride on the other.  CYCLIC($P$)
to CYCLIC($P$) redistributions result in non--unit strides on both
nodes, as do certain forms of transpose operations.  With more
than one dimension, the access pattern is the Cartesian product of the
access patterns generated for each matching
dimension~\cite{stichnoth93a}.  This can also be seen in
Figure~\ref{fig:redist}.  The basic mapping is stride--2 to stride--1
repeated four times.  This pattern is then repeated stride--8 to
stride--8 16 times.

%We place array redistributions of BLOCK($P$) and CYCLIC($P$)
%distributed arrays into four equivalence classes based on whether
%their basic stride-$k$ source and destination reference patterns have
%$k=1$ or $k>1$.
%
% This is confusing since the above paragraph talks about stride-k
% as the receiver stride.
%
We place array redistributions of BLOCK($P$) and CYCLIC($P$)
distributed arrays into four equivalence classes based on whether
their source and destination reference patterns have unit or nonunit
strides. The four two dimensional array redistributions (of a
$1024\times 1024$ array) of Figure~\ref{fig:repdist} are
representatives of the four classes.  Each distribution is over four
nodes.  We include a data transpose operation instead of a
CYCLIC($P$) to CYCLIC($P$) redistribution to demonstrate that our
technique extends to transposes as well.

\begin{figure}[htbp]
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{Redistribution} & \multicolumn{2}{c|}{Assembly} & 
            \multicolumn{2}{c|}{Disassembly} & \multicolumn{2}{c|}{Tuples}  \\ 
\hline
Source & Destination & Read & Write & Read & Write & (0 to all) & (0 to 0)  \\
\hline
(BLOCK(4),*)  & (*,BLOCK(4)) & 1 & 1 & 1 & 1 & 262144 & 65536\\
(BLOCK(4),*)  & (CYCLIC(4),*) & 4 & 1 & 1 & 1 & 262144 & 65536\\
(CYCLIC(4),*) & (BLOCK(4),*) & 1 & 1 & 1 & 4 & 262144 & 65536\\
(*,CYCLIC(4)) & (CYCLIC(4),*)$^T$ & 4 & 1 & 1 & 1024 & 262144 & 65536\\
\hline
\end{tabular}
\end{center}
\caption{Representative redistributions of $1024 \times 1024$ distributed array and basic memory access strides (``T'' represents a transpose)}
\label{fig:repdist}

\end{figure}

\section{Encodings of address relations}
\label{sec:compact}

In this section, we use the four representative address relations from
Section~\ref{sec:represent} (Figure~\ref{fig:repdist}) to drive the
development of compact address relation encodings that can exploit the
regularity of relations induced by redistributions of regular
block--cyclic arrays, while retaining the expressive power to encode
arbitrary relations.

We begin with the \AAPAIR~ encoding, which is a simple index
array that exploits no regularity in the relation, and then introduce
a series of encodings (\AABLK, \DMRLE, and \DMRLEC) that
incrementally exploit more regularity in the relations without loss of
expressive power. For example, \DMRLEC~ exploits more regularity in the
relations than \DMRLE, which exploits more regularity than
\AABLK, which exploits more regularity than the \AAPAIR~ base case.
For each encoding, we place a total order $\leq$ on the tuples
of the address relation:
\begin{displaymath}
(a,b) \leq (c,d) \Leftrightarrow (a \leq c) \wedge (c \leq d).  
\end{displaymath}
In other words, the tuples are ordered by source node address first,
then by destination node addresses.  Each encoding is designed so that
it can be used efficiently to assemble and disassemble messages. The
idea is that we repeatedly load some {\em key} from a stored relation,
expand it to a {\em symbol} according to the encoding, and then use
the symbol to gather or scatter some number of data elements.  We do
not consider encodings that are wholly expanded before use.  The
innovation here is not in developing simple data compression
techniques, but in applying these techniques to address relations and
using them during message assembly and disassembly.

Figure~\ref{fig:sizes} shows the total space required for address
relations on source node 0 for each of the representative relations of
Figure~\ref{fig:repdist}.  The sizes are in bytes and measured on the
DEC 3000/400.  Relative sizes are the same on 32-bit platforms, but
the absolute sizes are halved.  Note that the chart has a log scale.
The difference between the \AAPAIR~ and \DMRLEC~ encodings is over
three orders of magnitude. Further, the absolute size of the
\DMRLEC--encoded relations is roughly 1K bytes, which is quite small.

\begin{figure}[htbp]
\centerline{\epsfxsize 3in \epsfbox{sizes.eps}}
\caption{Total size of all address relations on sending node 
zero in each encoding}
\label{fig:sizes}
\end{figure}

The \AAPAIR~encoding (Address, Address PAIR) is simply an array
of source-destination address pairs.  It takes advantage of no memory
access patterns and always requires storage space proportional to the
number of data items transferred.  The keys and symbols for \AAPAIR~are
the tuples themselves.

The \AABLK~encoding (Address, Address BLocK) run-length
encodes~\cite{DATA-COMPRESSION-OVERVIEW} blocks of stride--1 to
stride--1 mappings.  The symbols generated are of the form $(a,b,l)$,
which means that addresses $a$ through $a+l-1$ on the source map to
address $b$ through $b+l-1$ on the destination.  The keys are simply the
symbols. Notice that this is sufficient for the first representative
redistribution in Figure~\ref{fig:repdist}, but not the others.

The \DMRLE~ encoding (Difference Map, Run-Length Encoded)
generalizes \AABLK~ to map blocks of address with a stride--$j$ source
to a stride-$k$ destination. To encode the relation in \DMRLE, we form
the difference map sequence $\langle(dx_{i},dy_{i})\rangle$ from the
sequence of tuples $\langle(x_{i},y_{i})\rangle$:
\begin{displaymath}
(dx_i, dy_i) = \left \{ 
\begin{array}{ll}
(x_0,y_0) & i=0 \\
(x_i-x_{i-1},y_i-y_{i-1}) & i > 0
\end{array}
\right .
\end{displaymath}
The difference map is then run-length encoded to produce symbols of
the form $(dx,dy,l)$ which are stored as keys.  Notice that this
encoding significantly reduces space requirements for all four
redistributions of Figure~\ref{fig:repdist} because it encodes any
stride--$j$ to stride--$k$ mapping efficiently.  

The \DMRLE~ encoding efficiently encodes the basic stride--$j$
to stride--$k$ mapping, but for multidimensional arrays this will be
repeated many times.  For a redistribution of a $D$-dimensional array,
each of whose dimensions are of size $N$, the mapping (e.g., the
symbol) for dimension $d$ will be repeated $O(N^{D-d-1})$ times.
Clearly, a Huffman-encoding~\cite{HUFFMAN-CODES} of the symbols would
greatly reduce their storage cost.  However, Huffman codes are of
variable length, so keys could span word boundaries and be expensive
to decode during message assembly.  Instead, we adopt a dictionary
substitution encoding~\cite{DATA-COMPRESSION-OVERVIEW} of the symbols
as fixed length keys.  The key length is the minimum power of two bits
necessary to represent all the unique symbols.  This scheme is
efficient for message assembly because keys cannot span word
boundaries and there is always the same number of keys per word.

To encode in \DMRLEC~(Difference Map, Run-Length Encoded, Compressed)
the relation is first encoded in \DMRLE ~and all unique $(dx,dy,l)$
symbols are found via hashing.  These symbols are stored in a table of
size $2^k$ for the minimum $k$ possible.  Each symbol is encoded as a
$2^k$-bit key representing its offset in the table.  As can be seen
from Figure~\ref{fig:sizes} this technique reduces the storage cost of
the relations induced by the encoding relations by two orders of
magnitude over \DMRLE.  Overall, \DMRLEC~is three to four orders of
magnitude more space efficient than \AAPAIR.

\section{Message assembly performance models}
\label{sec:models}

This section introduces some models that characterize the tradeoffs 
between decoding precomputed address relations or recomputing them on 
the fly (which we will refer to as decoding vs recomputing).  In 
particular, the models predict {\em threshold}, {\em break--even}, and 
{\em speed--up}.  {\em Threshold} is the minimum number of 
instructions at which decoding can produce tuples in the relation 
faster than recomputing.  In other words the threshhold tells 
us if decoding the relation will ever be faster than recomputing it.  
For those cases where threshold is defined, {\em break--even} tells us 
the number of times a relation must be reused before the one--time 
cost of encoding the relation is amortized, and {\em speedup} is the 
maximum improvement that is possible.  

The models are designed for a {\em put} message passing model, where 
the source node generates the entire relation.  Only the \AAPAIR~ 
encoding is used because it is independent of the relation.
Validation on both the Intel iWarp and Intel Paragon systems finds 
close agreement between the predicted and actual values.  An important 
observation is that the threshold values are very low, on the order of 
one instruction on the iWarp and 16 on the Paragon.  In other words, 
if it is not possible to produce a tuple every instruction cycle on 
the iWarp, or every 16 instruction cycles on the Paragon, then 
decoding wins over precomputing.

\subsection{Algorithm and machine properties}

Given some address relation, $n_g$ is the average number of machine
instructions required to generate a tuple in the relation (either when
encoding or recomputing the relation).  We model the generation of
address relation tuples as instruction--issue--rate limited and as
providing the address relation in an on--line manner.  Further, we
assume that tuple generation is integer--based and performed in
registers.

Additionally, we must know $r_i$, the number of integer instructions
the machine can issue per second.  Since memory accesses are at the
heart of message assembly, we characterize the memory system by four
values: $r_{rc}$ and $r_{wc}$, the read and write bandwidths for
unit--stride accesses, respectively; and $r_{rr}$ and $r_{wr}$, the
read and write bandwidths for random memory accesses. These
parameters, along with the others in the models, are listed in
Figure~\ref{fig:parms}.
\begin{figure}[htbp]
\centering

\begin{tabular}{|l|l|}
\hline
\multicolumn{2}{|c|}{Problem parameters}\\ 
\hline	
\hline
$n$      & number of data words in the message \\
$n_i$    & number of times the data transfer is performed\\
\hline
\hline
\multicolumn{2}{|c|}{Algorithm parameters}\\ 
\hline	
\hline
$n_{ac}$ & number of loop overhead instructions during decoding\\
$n_{au}$ & number of loop overhead instructions during recomputing\\
$n_g$    & number of instructions required to compute a tuple \\
$n_o$    & number of loop overhead instructions during encoding\\
\hline
\hline
\multicolumn{2}{|c|}{Machine parameters}\\ 
\hline	
\hline
$r_{i}$  & integer instructions/sec the machine can issue\\
$r_{rc}$ & read bandwidth for unit--stride memory accesses\\
$r_{rr}$ & read bandwidth for random memory accesses\\
$r_{wc}$ & write bandwidth for unit--stride memory accesses\\
$r_{wr}$ & write bandwidth for random memory accesses\\
\hline
\end{tabular}

\caption{Summary of model parameters}
\label{fig:parms}
\end{figure}
 
\subsection{Model for recomputing address relations}

The time to recompute an address relation during message assembly
has two components: $t_g$, the time to compute the address relation;
and $t_{au}$, the message assembly time:
\begin{displaymath}
t_{recompute} = t_g + t_{au}
\end{displaymath}
The time to compute the address relation is largely determined by the
complexity of algorithm used, its implementation, and the instruction
execution rate of the machine.  We characterize the algorithm in terms
of $n_g$, the number of integer instructions executed per tuple of the
address relation.  Thus for an $n$ word message, the time required is
\begin{displaymath}
t_g = \frac{n n_g}{r_i}
\end{displaymath}
The address computation code generates source/destination address
tuples $(s,r)$ in registers in an on-line manner.  For each of these
tuples, we read the data word at address $s$, write $r$ to the message
buffer, then write the data word to the message buffer.  Note that the
writes to the message buffer are to consecutive addresses, so their
performance can be captured by $r_{wc}$, the unit--stride write
bandwidth of the machine.  On the other hand, the source side
addresses of the tuples may not be contiguous, so we capture their
performance with $r_{rr}$, the random read bandwidth of the machine.
Furthermore, we capture the loop overhead as a number of instructions,
$n_{au}$.  Thus the time to copy the data is
\begin{displaymath}
t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + 
\frac{n_{au}}{r_i}
          \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.  If
the compiler software--pipelines the assembly loop, we should be close
to the above expression.  Substituting into the expression for
$t_{recompute}$, we get an average assembly latency of 
\begin{equation}
 t_{recompute} = n \left ( \frac{n_g+n_{au}}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
\label{equ:uncached}
\end{equation}

\subsection{Model for decoding address relations}

The average latency for decoding precomputed address relations has
three components: $t_g$, the time to compute the address relation;
$t_o$, the time to store the address relation in memory, and $t_{ac}$,
the message assembly time using the stored address relation.  The
average latency is
\begin{displaymath}
 t_{decode} = \frac{t_g + t_o}{n_i} + t_{ac}
\end{displaymath}
Note that the first two components are amortized over the number of
times the communication is performed, $n_i$.  $t_g$ is the same as for
the case where we recomputed the relation on the fly and was
determined in the preceding section.  We assume the \AAPAIR~ encoding
from Section~\ref{sec:compact}.  As each tuple of the address relation
is generated, we write it to memory.  Thus storing the address
relation involves two contiguous writes for each in-register tuple
produced by the address relation computation.  In addition, we execute
$n_o$ loop overhead instructions, so
\begin{displaymath}
 t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
\end{displaymath}
When assembling the message from the stored address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the destination address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\begin{displaymath}
 t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes, but we continue the assumption of the
preceding section.  Substituting into the expression for $t_{decode}$,
we see that an $n$ word communication between two nodes requires time
\begin{equation}
 t_{decode} = n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} +
	\frac{n_{ac}}{r_i} + \frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
	\frac{2}{r_{wc}} \right )
\label{equ:cached}
\end{equation}

\subsection{Threshhold, break--even, and speedup}

Equations ~\ref{equ:uncached} and ~\ref{equ:cached} can help us answer
the following questions: For a message assembly step that occurs many
times at run--time, can decoding a precomputed address relation ever
outperform simply recomputing it?  If so, how many times must the
relation be reused before the decoding strategy wins?  Finally, what
is the ultimate speedup of the decoding strategy over the recomputing
strategy?

Decoding can only be effective if we can assemble the message faster
using the precomputed relation than by recomputing it on the fly. In
other words, we require that
\begin{displaymath}
t_{ac} < t_g + t_{au}
\end{displaymath}
or
\begin{equation}
% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
\label{equ:necessary}
\end{equation}
The right hand side of equation~\ref{equ:necessary} is a
machine--specific {\em threshold} below which decoding precomputed
address relations is ineffective. Notice that threshhold is largely
determined by the ratio of instruction issue rate and memory read
bandwidth.

If~\ref{equ:necessary} is true, then we can also predict how many
times the address relation must be decoded in order to break--even
with recomputing on the fly.  This {\em break--even} point
is found by setting $t_{decode} = t_{recompute}$ and solving for
$n_i$.  In doing this we find that the break--even point is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
\right \rceil 
\label{equ:crossover}
\end{equation}
It is important to point out that as $n_g$ moves away from the
threshold, the break--even point decreases sharply because the $n_g$
terms begin to dominate. 

Finally, the ultimate speedup of decoding versus recomputing is
\begin{equation}
\label{equ:speedup}
\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
              + r_ir_{rc}(r_{wc}+2r_{rr})}
             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
\end{equation}
The speedup is linear in $n_g$, so is bounded only by the complexity 
of the address computation.

\subsection{Validation on iWarp and Paragon}

This section validates the break--even and speedup models using
measurments from message assembly loops running on the Intel iWarp and
Paragon system.  For each system, the threshhold is defined, so we use
(\ref{equ:crossover}) to predict the number of iterations necessary
for the decoding approach to break--even with the recomputing approach, as
a function of $n_g$, and we use (\ref{equ:speedup}) to predict the
ultimate speedup of the decoding approach over the recomputing
approach, also as a function of $n_g$.  In each case, the address
relation is the identity relation, the encoding scheme is \AAPAIR, and
$n_g$, the number of instructions to compute the relation, is varied
by inserting nops into the code stream.  The predictions are then
compared to actual measurements on the hardware.

For the iWarp~\cite{iwarpoverview,iwarpcomm} system,
$r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$ words/second, and a node can
execute $r_i=2 \times 10^7$ instructions/second.  Specifically, the
iWarp clock speed is 20 MHz and all memory operations complete in two
cycles.  There is no cache and we assume no dual dispatch of loads and
stores.

The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
estimated {\em statically} by simply counting the number of
instructions in each loop, or {\em dynamically} by accounting for
bypass stalls that arise in the loops. The static estimates for iWarp
are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

Figure~\ref{fig:iw}(a) shows the measured and predicted number of
iterations needed on iWarp for the message assembly latency of the
decoding scheme to break even with that of the recomputing scheme for
different $n_g$.  Entries with ``-'' denote that the value is
undefined (i.e., $n_g$ is below the threshhold). The predictions made
using the dynamic measurements are exact.  The predictions made using
the static measurements are pessimistic for small $n_g$, but quickly
converge to the actual numbers as $n_g$ increases. Notice also
that the predicted threshhold (using the dynamic estimates) is
only one instruction cycle!

\begin{figure}[htbp]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|c|} 
\hline
\multicolumn{4}{|c|}{Break-even}\\ 
\hline
$n_g$       & Actual         & Predicted         & Predicted \\
            &                & (static)          &  (dynamic) \\    
\hline
   1        &         13     &        -         &        13 \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &          3        &        3\\
\hline
\end{tabular} 
&
\begin{tabular}{|c|c|c|c|} 
\hline
\multicolumn{4}{|c|}{Speedup}\\ 
\hline
$n_g$       & Actual         & Predicted         & Predicted \\
            &                & (static)          &  (dynamic) \\    
\hline
   1       &         1.05    &         1.00     &         1.07 \\
   2       &         1.12    &         1.09     &         1.15 \\
   3       &         1.22    &         1.16     &         1.23 \\
   4       &         1.28    &         1.25     &         1.32 \\
   5       &         1.35    &         1.33     &         1.39 \\
   6       &         1.43    &         1.41     &         1.47 \\
   7       &         1.49    &         1.49     &         1.54 \\
   8       &         1.59    &         1.59     &         1.61 \\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{(a) Break--even (Eqn. \protect{\ref{equ:crossover}})
and 
(b) speedup (Eqn. \protect{\ref{equ:speedup}}) on iWarp
}
\label{fig:iw}
\end{figure}

Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
the decoding scheme over the recomputing scheme on iWarp. As in
Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for
all $n_g$ and the static predictions are pessimistic for small $n_g$.
However, it is important to note that even with static measurements,
the predictions only become inaccurate when $n_g$ approaches the
threshold.  Another important point is that the purpose of
Figure~\ref{fig:iw}(b) is to help validate the models when $n_g$ is
small, and not to highlight impressive speedups. In fact, the speedups
grow arbitrarily large as $n_g$ increases (see
Equation~\ref{equ:speedup}.)

A Paragon node contains a 50 MHz pipelined, multiple issue RISC
node.  We can estimate the instruction
issue rate by noting that we do not use the floating point pipeline.
Therefore we can issue at most $r_i=5 \times 10^7$
instructions/second.  We directly measured the memory parameters and
found $r_{rc}=9 \times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$
words/second, and $r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only
a static measurement (instruction counting) of the implementation
properties was performed, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.

Figure~\ref{fig:86}(a) shows, for different $n_g$, the measured and
predicted number of iterations needed for decoding to break even with
recomputing.  The predictions are pessimistic, but quickly converge to
the actual values with increasing $n_g$.  Notice that $n_g$ must be
much larger to permit breakeven on the Paragon than on the iWarp.  The
predicted threshhold is 16 instructions/tuple and the actual
threshhold is 15 instructions/tuple.  This is not surprising given
that the Paragon's instruction issue rate is much higher than its
memory bandwidth, while the iWarp's issue rate is well matched to its
memory bandwidth.

\begin{figure}[htbp]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|} 
\hline
\multicolumn{3}{|c|}{Break--even}\\ 
\hline
  $n_g$ & Actual & Predicted (static) \\
\hline

    14     &        -    &        -\\
    15     &        -    &        -\\
    16     &        19      &        -\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
&
\begin{tabular}{|c|c|c|} 
\hline
\multicolumn{3}{|c|}{Speedup}\\ 
\hline
  $n_g$ & Actual & Predicted (static) \\
\hline
    14     &        -   &         -\\
    15     &        -   &         -\\
    16     &       1.03     &         -\\
    17     &       1.06     &         1.01\\
    18     &       1.09     &         1.02\\
    19     &       1.12     &         1.03\\
    20     &       1.15     &         1.04\\
    21     &       1.17     &         1.05\\
    22     &       1.20     &         1.06\\
    23     &       1.22     &         1.07\\
    24     &       1.24     &         1.08\\
    25     &       1.26     &         1.09\\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{(a) Break--even (Eqn. \protect{\ref{equ:crossover}})
and 
(b) speedup (Eqn. \protect{\ref{equ:speedup}}) on Paragon
}
\label{fig:86}
\end{figure}

Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
decoding over recomputing on the Paragon. Notice that speedup does not
increase as quickly with $n_g$ as on the iWarp
(Figure~\ref{fig:iw}(b).)  This is the result of the lower memory
bandwidth relative to instruction issue rate on the Paragon.  The
predictions are not as accurate as for the iWarp, reflecting the fact
the Paragon's memory system is more complex.  However, the predictions
are pessimistic and quickly converge to the correct values.

The significant lesson we can learn from the models developed in this 
section is that the threshhold for the decoding approach is quite low 
(1 instruction/tuple on the iWarp and 16 instructions/tuple on the 
Paragon).  In practice, the time to recompute the relation on the fly 
for simple and regular access patterns, such as those induced by 
BLOCK($P$) and CYCLIC($P$) distributions, falls beneath this 
threshhold and communication based on either approach has roughly the 
same performance.  However, as the access pattern becomes even 
slightly less regular, communication performance using the decoding 
approach becomes attractive.  For example, we have measured the time 
to redistribute block--cyclic HPF arrays (including message 
assembly/disassembly and the actual data transfers) and found that the 
decoding approach was a factor of two faster than the recomputing 
approach~\cite{dinda95}.
%
% I think this is a better way of putting it.  For some reason,
% it just doesn't want to be in passive.
%

\section{Performance using compact encodings}
\label{sec:measure}

The performance models that we developed for the \AAPAIR--encoded
relations in the previous section shed some light on the conditions
where decoding the relation makes sense. However, the \AAPAIR~encoding
is quite memory--inefficient, requiring space proportional to the
number of data items being assembled.  In this section we investigate
the performance of message assembly and disassembly based on the more
compact \AABLK, \DMRLE, and \DMRLEC~ encodings.

Performance for the more compact encodings is difficult to model 
analytically since it depends heavily on the access patterns in the 
relation, unlike the case with the \AAPAIR ~encoding.  Therefore, we 
will rely on measurements for some representative relations running on 
a variety of common computer systems.  The performance measure in all 
cases is the absolute copy throughput (in MBytes/sec) of the assembly 
and disassembly routines.  By copy throughput, we mean the number of 
MBytes that are packed into (unpacked out of) the message buffer each 
second during message assembly (disassembly).

%
% Actually, library is designed for all three comm models,
% but if we explain that here, it'll be confusing as to why we
% choose send/recv (or better yet ``no addresses in buffer'')
% model instead of put, as we modelled.
%
The measurements are taken from a portable library, entirely written 
in C and designed to be used with a conventional message passing system 
based on the {\em send/receive} model.  The address relations used are 
those induced between source node zero and destination node zero for 
each of the four representative block--cyclic redistributions of 
Figure~\ref{fig:repdist}.  The reason for measuring node zero to node 
zero is to guarantee that the largest address relation is used, 
although the address relations between each pair of nodes are of 
identical size for the representative redistributions.  The 
distributed array contains $1024 \times 1024$ 8-byte doubles, so the 
zero to zero relation assembly and disassembly steps copy 512 KBytes.  
A larger $2048 \times 2048$ array (2048 KBytes assembled/disassembled) 
was also tested with only minor changes in measured copy throughput.

The identical C code was compiled and measured on the DEC 3000/400,
the Intel Paragon, the IBM RS/6000 250, the IBM SP-2, a Gateway
Pentium 90, and a Micron Pentium 133.  We did not test the code on the
Intel iWarp due to memory constraints.  The details of each testing
environment are given in Figure~\ref{fig:machines}.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{|l|l|l|l|}
\hline
Machine & Node &  Compiler & Operating System\\
\hline
DEC 3000/400 & Alpha (21064 \@ 133 MHz) & GNU C 2.6.3, DEC CC 3.11 
    & OSF/1 2.0 \\
Intel Paragon & i860 (i860XP \@ 50 MHz) & Intel C R4.5 & OSF/1 1.0.4 \\
IBM RS/6000 250 & PowerPC (601 \@ 80 MHz) & IBM XLC 1.3.0.4 & AIX 3.2 \\
IBM SP-2 & POWER2 (66 MHz) & IBM XLC 1.3.0.33 & AIX 3 \\
Gateway Pentium 90 & i386 (Pentium \@ 90 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
Micron Pentium 133 & i386 (Pentium \@ 133 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
\hline
\end{tabular}
\end{center}
\caption{Machines tested}
\label{fig:machines}
\end{figure}
Figures \ref{fig:alpha} through \ref{fig:p133} show the results.  Each
figure shows the assembly throughput on the left and the disassembly
throughput on the right.  Each chart measures the absolute throughput
on the same scale, so that results can be compared across machines.  
Except for the Pentium machines, only the throughput of the
\AAPAIR~(as a baseline) and \DMRLEC~encodings are shown.  This
is because the story is the same for these machines: \DMRLEC~is always
the fastest compact encoding.  The situation is more complex on the
Pentium machines, so the throughput of all four encodings are shown.
Superimposed on each graph are annotated lines that represent the
throughput of the library \verb.memcpy(). routine and the throughput
of C-coded strided copies of doubles.  A line annotated
``stride--$k$'' on an assembly graph indicates the throughput of a
copy loop that copies from a stride of $k$ doubles to a stride of 1,
while on a disassembly graph it indicates the throughput of copy loop
that copies from a stride of 1 to a stride of $k$
doubles.\footnote{The final paper will also show stride--1024
measurements in each graph for comparison to the disassembly
throughput of the transpose.}

The principle observation is that, for most of the machines, message
assembly and disassembly using the \DMRLEC~encoding operates at
rates that are at or close to the copy throughput possible for their
memory reference patterns.  The remainder of this section analyzes
each machine in turn.

On the DEC 3000/400, \DMRLEC~gives ideal assembly and disassembly
throughputs for each of test cases, as can be seen in
Figure~\ref{fig:alpha}.  \DMRLEC~ benefits from the separate
load/store pipeline in the Alpha 21064
processor~\cite{ALPHA-MICRO,DEC-ALPHA-21064} . Even on a load stall,
integer instructions associated with \DMRLEC~can continue to issue.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{alpha_assem.eps} & \epsfxsize 2.5in 
\epsfbox{alpha_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput (MBytes/sec) of (a) message assembly and (b) message 
disassembly using  \AAPAIR~ and \DMRLEC~ on DEC 3000/400}
\label{fig:alpha}
\end{figure}

On the Paragon, \DMRLEC~gives near ideal assembly throughput for all
the testcases as can be seen from Figure~\ref{fig:i860}.  The
disassembly results are somewhat unexpected, however.  Although the
first testcase (stride--1 writes) operates with ideal throughput, the
third testcase (stride--4 writes) actually does slightly better than a
stride--4 copy loop.  We explain this due to different code generation
on the part of the compiler for the disassembly loop and the stride--4
copy loop.  As can be seen by the factor of two difference between a
compiler generated stride--1 copy loop and the library \verb.memcpy().
routine, there is considerable room for improving on the stride--$k$
copy loops generated by the Paragon node compiler.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{i860_assem.eps} & \epsfxsize 2.5in 
\epsfbox{i860_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput of message (a) assembly and (b) disassembly using
\AAPAIR~and \DMRLEC~on Paragon}
\label{fig:i860}
\end{figure}


The RS/6000 250 also achieves near ideal assembly and
disassembly throughput with \DMRLEC, as can be seen in
Figure~\ref{fig:ppc}.  The IBM XLC compiler has software-pipelined
each of the assembly, disassembly, and strided copy loops.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{ppc_assem.eps} & \epsfxsize 2.5in 
\epsfbox{ppc_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput (MBytes/sec) of (a) message assembly and 
(b) message disassembly using \AAPAIR~and \DMRLEC~on IBM RS/6000 Model 250}
\label{fig:ppc}
\end{figure}

The throughput measurements of message assembly and disassembly using
\DMRLEC~on the IBM SP-2 are also close to ideal, tracking the 
throughput of the strided copy loops closely.  The IBM
XLC compiler has software--pipelined assembly, disassembly, and each of
the strided copy loops.  Because the POWER2 processor, like the
Alpha, can continue to issue integer instructions while a load miss is
being handled, instructions associated with \DMRLEC~are free.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{pwr2_assem.eps} & \epsfxsize 2.5in 
\epsfbox{pwr2_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput (MBytes/sec) of (a) message assembly and 
(b) message disassembly using \AAPAIR~ and \DMRLEC~on IBM SP-2}
\label{fig:pwr2}
\end{figure}

The performance of the Pentium systems is more complex than that of
the other machines, as can be seen in Figure~\ref{fig:p90}.  The
assembly and disassembly throughput of \DMRLEC~is lower than \DMRLE~on
this system.  For the first test case, \AABLK~is much better than
either of the other compact encodings.  We believe that three factors
contribute to this result.  First, the GNU C compiler does not support
Pentium optimizations, so clearly, no scheduling is done to improve
the multiple issue rate.  Second, the Pentium is both register starved
and non-orthogonal.  There is a significant number of instructions in
the message assembly loop that do nothing but move values associated
with the compact encodings to and from the appropriate registers for
certain operations.  In other cases, memory operands are used where
registers would appear with a larger register set.  Finally, although
the latency of memory operands is shortened by the on--chip cache, the
cache is not actually dual ported, but banked, so cache references can
become serialized. Note that even with this complexity, \DMRLEC~always
does better than \AAPAIR~on the Pentium.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p90_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p90_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput (MBytes/sec) of (a) message assembly and (b) 
message disassembly using \AAPAIR, \AABLK, \DMRLE, and \DMRLEC~ on 
Gateway Pentium 90 system}
\label{fig:p90}
\end{figure}


As can be seen from Figure~\ref{fig:p133} the performance of the
Micron Pentium 133 is complex in the same way as the Gateway system
(Figure~\ref{fig:p90}.)  Message assembly and disassembly appears to
be instruction issue limited for \DMRLEC~on both machines: Because the
ratio of memory copy bandwidth (as measured by \verb.memcpy().) to
instruction issue rate is higher on the Micron, it is not surprising
that it performs comparatively worse in message assembly and
disassembly using \DMRLEC.
\begin{figure}[htbp]
\begin{center}
\begin{tabular}{cc}
\epsfxsize 2.5in \epsfbox{p133_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p133_disassem.eps} \\
(a) & (b) \\
\end{tabular}
\end{center}
\caption{Throughput (MBytes/sec) of (a) message assembly and (b) message
disassembly using \AAPAIR, \AABLK, \DMRLE, and \DMRLEC~on 
Micron Pentium 133 system}
\label{fig:p133}
\end{figure}

\section{Discussion}
\label{sec:discuss}

The preceding sections demonstrated that decoding approach to message 
assembly can outperform the recomputing approach for many relations.  
We introduced address relation encodings that significantly reduce the 
space required to store common address relations.  The message 
assembly and disassembly throughputs of these encodings were measured 
on several machines and found to be on par with the copy bandwidth.  
In this section we describe other advantages of the decoding approach, 
arguing that stored, encoded address relations are a powerful 
interface to communication systems.  Additionally, we note that even 
with compact address relations discussed in this paper, a address 
relation caching strategy is still useful.

By taking address computation out of the critical path of data 
transfer, the decoding approach makes message assembly throughput 
depend on the address relation and the memory access pattern it 
encodes instead of how the address relation is computed.  One 
advantage of this is that it makes it easier to achieve reasonable 
communication performance without a large number of special--case 
optimizations in the compiler and run--time library.  Instead of 
optimizing each address computation algorithm for each architecture, 
only message assembly for each address relation encoding needs to be 
optimized for each architecture in order for all data transfers to 
benefit.

Another advantage to the decoding approach is that the entire relation 
is available to be optimized at a new level of the software hierarchy.  
For example, by ordering the relation by destination node addresses 
first, we can optimize for reference locality on the destination node 
instead of on the source node.  As shown in ~\cite{ISCA-MEMORY}, the 
choice is architecture--dependent.

Stored, encoded address relations appear to be a good interface to the 
communication system because they are implementation neutral.  This 
permits communication systems to be highly specialized for the 
available hardware while still presenting the same interface to the 
programmer or compiler.  However, to best utilize the hardware, it may 
be necessary to change the encoding of the address relation before 
using it.  For example, the relation may be transformed into a DMA 
gather map on the source to make best use of a network adapter such as 
the one described in~\cite{GNECTAR-IFIP}.  It is unclear whether this
is the optimal approach, since the main processor could presumably
support better encodings than the network adaptor.  

Even with the compact address relation encodings described
in this paper, it is still useful to be able to bound the amount of
memory used to store address relations.  Address relations can still
have huge space requirements if they do not have regularity to
exploit.  This leads naturally to a caching approach where when the
allotted amount of memory is in use and a new relation needs to be 
stored, a victim relation must be chosen and flushed from memory to
make room.  User or compiler-supplied hints would be helpful to
minimize cache thrashing.


\section{Conclusions and open issues}
\label{sec:conclude}

We have shown that on a wide variety of machines, message assembly and
disassembly with address relation reuse using the \DMRLEC~address
relation encoding can support arbitrary address relations while
operating at near copy throughput for address relations induced by
redistributions of regular block--cyclic distributed arrays.  A side
effect is that the storage requirement for these relations is
drastically reduced to a practical level.  

There are many areas for future exploration.  First, there is the 
question of how well the address relation encodings we introduced will 
work on more complex distributions and operations.  We are currently 
implementing support for more distribution types in the library that 
was measured and will be able to measure the performance of the 
compact encodings on them.  A second area is developing and 
implementing more compact and general-purpose address relation 
encodings.  We will implement a Huffman-encoded~\cite{HUFFMAN-CODES} 
\DMRLE~encoding as mentioned in Section~\ref{sec:compact}.  We are 
also exploring using LZW~\cite{LZ,LZW} or variant thereof for further 
compression.  Early experiments with the \verb.gzip.  LZW-based 
compressor show dramatic compression, but we are not confident that 
LZW-based decompression can be made to run at copy-bandwidth speeds.

We are also interested in how much the performance of message assembly 
and disassembly depends on single-node compilers.  Many parallel 
compilers generate sequential source code and rely on the single-node 
compiler to produce good results.  However, as can be seen from the 
factor of two difference between a compiler--generated copy loop and 
the library \verb.memcpy().  routine on the Paragon 
(Figure~\ref{fig:i860}), this reliance may be unwarranted.  If this is 
true, then message assembly with address relation reuse is even more 
advantageous because the assembly routine for each encoding could be 
low--level coded for each architecture, while low--level coding would 
be impractical for the on-line address computation approach due to the 
many variants of assembly routines.  Instead, a parallel compiler 
could precompute address relations at compile--time, and rely on 
low--level coded library message assembly routines to achieve near 
copy throughput.




\comment{
\subsection*{ACKNOWLEDGEMENTS}

We would like to thank Thomas Stricker for illuminating the finer
points of deposit model communication, James Stichnoth for his work
and explanations of array assignments between regular block-cyclic
distributed arrays, and Ed Segall for providing the impetus to look
at run--time approaches to fast message assembly.
}

\small
\bibliography{/afs/cs/project/iwarp/member/droh/bib/refs}

\end{document}
