\input{caption}
\makeatletter\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}
\makeatother
\newcommand{\comment}[1]{}

%\documentstyle[times,epsf,11pt]{article}
\documentstyle[times,epsf,twocolumn]{article}

\pagestyle{empty}
\setlength {\topmargin}{-.5in}
\setlength {\headheight}{0in}
\setlength {\footheight}{0in}
\setlength {\footskip}{1.25in}
\setlength {\textheight}{9.00in}


\setlength {\oddsidemargin}{-.25in}
\setlength {\textwidth}{7.00in}
\setlength {\columnsep}{.33in}
\setlength {\parskip}{8pt plus 2pt minus 1pt}
\setlength {\parindent}{4ex}



\def\_{\rule{.3em}{.15ex}}  % Get underscore by typing \_.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%  Line Spacing (e.g., \ls{1} for single, \ls{2} for double, even \ls{1.5})
%%

\newcommand{\ls}[1]
   {\dimen0=\fontdimen6\the\font
    \lineskip=#1\dimen0
    \advance\lineskip.5\fontdimen5\the\font
    \advance\lineskip-\dimen0
    \lineskiplimit=.9\lineskip
    \baselineskip=\lineskip
    \advance\baselineskip\dimen0
    \normallineskip\lineskip
    \normallineskiplimit\lineskiplimit
    \normalbaselineskip\baselineskip
    \ignorespaces
   }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\ls{1.19}

\comment{
\textheight =  9.0in
\topmargin = -.5 in 
%\oddsidemargin = -.25 in
%\evensidemargin = -.25 in
\oddsidemargin = 0 in
\evensidemargin = 0 in
\textwidth=6.5in
}

\def\AAPAIR{AAPAIR}
\def\AABLK{AABLK}
\def\DMRLE{DMRLE}
\def\DMRLEC{DMRLEC}

\renewcommand{\dbltopfraction}{.6}
\renewcommand{\topfraction}{.6}
\renewcommand{\textfraction}{.4}
\renewcommand{\floatpagefraction}{.8}
%\renewcommand{\dbltopnumber}{1}
\renewcommand{\dblfloatpagefraction}{.8}


\begin{document}
\bibliographystyle{acm}

\title{
%\begin{flushleft}
%\vspace{-0.9in}
%\scriptsize {\em 
%Sigmetrics96 Draft
%}
%\end{flushleft}
%\vspace{0.4in}
%\Large\bf 
Fast Message Assembly Using Compact Address Relations
}
%\author{\mbox{Paper \#537.98}}
\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron \\
Carnegie Mellon University\\
5000 Forbes Avenue \\
Pittsburgh, PA 15213 \\
\verb!(pdinda,droh)@cs.cmu.edu! \\
}

\date{}
\maketitle 

\unmarkedfootnote{
Supported in part by the Advanced Research Projects
Agency/CSTO monitored by SPAWAR under contract N00039-93-C-0152, in
part by the National Science Foundation under Grant ASC-9318163, and
in part by a grant from the Intel Corporation.
}
{\footnotetext[0]{
Permission to make digital or hard copies of part or all of 
this work for personal or classroom use is granted without 
fee provided that copies are not made or distributed for 
profit or commercial advantage and that copies bear this 
notice and the full citation on the first page. Copyrights 
for components of this work owned by others than ACM must be 
honored. Abstracting with credit is permitted. To copy 
otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. 

\vspace{0.3cm}
\noindent
SIGMETRICS 96-5/96 Philadelphia, PA, USA \\
\copyright 1996 ACM  
}}
   
\Large
{\bf Abstract}
\normalsize

Message assembly and disassembly represent a significant fraction of
 total communication time in many parallel systems.  We introduce a
run--time approach for fast message assembly and disassembly.
The approach is based on generating addresses by decoding a
precomputed and compactly stored address relation that describes the
mapping of addresses on the source node to addresses on the
destination node.  The main result is that relations induced by
redistributions of regular block--cyclic distributed arrays can be
encoded in an extremely compact form that facilitates high throughput
message assembly and disassembly.  We measure the throughput of
decoding--based message assembly and disassembly on several systems
and find performance on par with copy throughput.


\section{Introduction}
\label{sec:intro}

Many parallel programs perform operations on distributed data
structures by using a message passing system to transfer data items
from noncontiguous addresses in the memory of a source node to
noncontiguous addresses in the memory of a destination node.  These
programs typically amortize the fixed overhead of transferring a
message by packing a collection of data items into a contiguous
message buffer, which the message passing system then transfers over
the network as a single message.  When the message arrives at the
destination, the message is copied from the network to a buffer, which
the user program then unpacks to the appropriate memory locations.  We
refer to the process of packing the data as {\em message assembly} and
to unpacking as {\em message disassembly}.

On machines with fast communication systems, the performance of the
message assembly and disassembly steps can have a significant impact
on the overall communication performance of a program.  In such
systems, the memory system tends to be the bottleneck rather than the
communication system~\cite{STRICKER-OPT-MEM-PERF-COMM-INPROCEEDINGS}.
The range of these systems is increasing with the advent of high speed
networks~\cite{STEENKISTE-HOST-INTERFACE-DESIGN-JOURNAL}.  Further,
even on machines with slow communication systems, overly slow message
assembly and disassembly can lead to disappointing overall
performance.

The message assembly and disassembly operations needed to implement
distributed data structures such as distributed arrays are nontrivial
and the running times of different algorithms can vary significantly.
For example, the running time of the assembly step for a High
Performance Fortran (HPF)~\cite{hpf93} block--cyclic assignment
statement can vary by several orders of magnitude, depending on the
nesting of the assembly loops~\cite{gupta94}.  As another example,
because of asymmetry in the Cray T3D memory system, a transpose
operation implemented with contiguous reads and strided writes runs
twice as fast as the equivalent operation implemented with strided
reads and contiguous writes~\cite{STRICKER-OPT-MEM-PERF-COMM-INPROCEEDINGS}.


\comment{
The general problem addressed here is how to perform fast message
assembly and disassembly in the context of a run--time communication
library.  The library should be fast in the sense that data items
should be transferred between their original locations and the message
buffer at or near the copy throughput of the memory system for the
memory reference pattern. Furthermore, since the assembly and
disassembly routines are in a library, they can rely only on run--time
information.
}

This paper describes a purely run--time,
inspector/executor~\cite{saltz91,AGRAWAL-CHAOS-BLOCK-STRUCT-JOURNAL}
approach for message assembly/disassembly and provides an initial
characterization of its performance on real systems.  In our approach,
when a data transfer is first encountered, the inspector precomputes
the mapping of source node addresses to destination node addresses and
stores this {\em address relation} in some encoding.  The executor
actually performs the transfer by reading and decoding the stored
address relation to generate the addresses needed to assemble the
message on the source node and disassemble it on the destination node.
The intuition is that reading and decoding the address relation is
cheaper than recomputing it on the fly and that by repeatedly using a
stored relation, we can amortize the initial cost of computing and
encoding it.

Our innovation is to encode the address relation in new, compact ways
that facilitate fast message assembly/disassembly executors --- they
are ``fast'' in the sense that data transfer rates to and from the
message buffer approach the copy throughput of the memory system for
the reference pattern.  In previous
work~\cite{AR-CACHING-LCR-INPROCEEDINGS} and in the extended version
of this paper~\cite{FAST-ASSEMBLY-ADDRESS-RELATIONS-TECHREPORT}, we
derive and validate a model that predicts, for a simple index array
encoding, that the regime where the inspector/executor approach is
faster than recomputing address relations on the fly is quite large,
and even extends to cases where computing a member of the relation
requires only a few instructions.  Here, we concentrate on the
executor and develop encodings that exploit the attributes of
relations induced by redistributions of HPF regular, block--cyclicly
distributed arrays to achieve low storage costs and fast message
assembly and disassembly.

\comment{
 Our approach is based on
the notion of decoding a precomputed {\em address relation} that
describes the mapping of addresses on the source node to addresses on
the destination node.  The run--time model is a classic
inspector/executor --- our innovation is to encode the address
relation in new, compact ways.  For each data transfer between a
source and a destination, the appropriate address relation is
precomputed and stored in some encoding.  The stored encoding of the
address relation is decoded by the message assembly and disassembly
routines each time data is transferred.  As with any
inspector/executor scheme, the aim is to exploit the fact that many
parallel programs repeatedly perform the same data transfers. The
focus of this paper is on the performance of the executor (i.e., the
assembly or disassembly step.)
}

We begin with an overview of address relations and approaches to
generating them in Section~\ref{sec:relations}.  The next section,
Section~\ref{sec:represent}, characterizes the address relations
induced by redistributions of block--cyclic arrays, points out the
regularities these relations exhibit, and describes four representative
redistributions.  Section~\ref{sec:compact} introduces address
relation encodings that can encode arbitrary relations, but exploit
the regularities we describe.
%Want this to be one paragraph
\comment{
In Section~\ref{sec:models} we consider a simple but
memory--inefficient encoding of address relations (i.e., an index
array). For this encoding we derive a model that predicts, given
measurable machine parameters, the regimes where decoding address
relations, rather than recomputing them each time, can improve the
average message assembly latency.  For those cases where decoding the
relations is effective, we extend the model to predict the break--even
point (i.e.  how many times does an address relation have to be reused
to amortize the cost of precomputing it) and the upper bound on the
speedup.  Validation on the Intel iWarp~\cite{iwarpoverview,iwarpcomm}
and Paragon~\cite{PARAGON-XPS-MANUAL} systems suggests that the models
have good predictive power. Surprisingly, the results indicate that
even with the simple index array encoding, if recomputing an address
in the assembly loop involves more than a few instructions (1 on the
iWarp, 16 on the Paragon), then decoding a precomputed address
relation can do better.  Further, beyond this threshold, the cost of
precomputing the address relation can be quickly amortized.  }
%One paragraph
Next, in Section~\ref{sec:measure}, we measure the performance of
message assembly/disassembly routines based on the different compact
encodings of address relations.  The performance on each of the four
representative redistributions of Section~\ref{sec:relations} is measured.
Results are given for several common systems: DEC
Alpha~\cite{DEC-3000-JOURNAL}, Intel
Paragon~\cite{PARAGON-XPS-MANUAL}, IBM RS/6000
~\cite{RS6000-ARCHITECTURE-JOURNAL}, IBM SP/2~\cite{IBM-SP1}, Gateway
Pentium 90~\cite{PENTIUM-OVERVIEW-INPROCEEDINGS}, and Micron Pentium
133. The encouraging conclusion is that in most cases, assembly and
disassembly routines based on a highly compact encoding can run at or
near the copy throughput of the memory system for the memory reference
pattern of the relation.  Finally, Section~\ref{sec:discuss} discusses
other implications of decoding precomputed address relations as well
as some open questions that still need to be addressed before the
ultimate usefulness of our approach can be understood.

\section{Generating address relations}
\label{sec:relations}

The transfer of data items from the memory of a {\em source} node to
the memory of a {\em destination} node can be characterized with a
binary relation $R$, which is called an {\em address relation}. The
relation $R$ is a set of {\em tuples} $(s_i, d_i) \in R$, where $s_i$
is an address on the source, called a {\em source address}, and $d_i$
is an address on the destination, called a {\em destination address}.

There are two constraints on the elements of an address relation: (1)
source and destination addresses point to valid memory locations, and
(2) destination addresses are unique.\footnote{ In general, it would
be possible to relax this constraint by associating a binary
associative combining operation with each address relation to handle
non--unique destination addresses. However, we will assume unique
destination addresses in this paper.} Thus, address relations can
describe arbitrary permutations such as transpose operations, as well
as replication of source data items on the destination, but as 
defined here, they cannot describe operations like reductions or
scans.  
\comment{For example, the address relation $\{ (1,4), (2,3)\}$
describes a data transfer that takes the pair of elements at locations
1 and 2 on the source, swaps them, and stores them in locations 3 and
4 on the destination.  The relation $\{ (1,1), (1,2)\}$ describes a
data transfer that copies the data item at location 1 on the
source to locations 1 and 2 on the destination.
}


For a particular data transfer, the message assembly and disassembly
routines must somehow generate the appropriate address relation.
Where the addresses are computed is determined by the message passing
model.  With a {\em send/receive} model, the source node computes the
source addresses and the destination node computes the destination
addresses.  With a {\em put} model, the source node computes the
entire address relation and passes the destination addresses along
with the data.  With a {\em get} model, the destination node computes
the entire address relation and sends the source addresses to
the source, which replies with the data.

%
%  Pseudocode instead of verbatim would save space
%  Also, boxes around would be nice.
%

One approach for generating the address relation is to recompute the
addresses on the fly in the assembly routine. For example, at a high
level an assembly routine to transfer $n$ data items using a {\em
send/receive} model has the following form:
\small
\begin{verbatim}
    loop i = 1 to n	
        recompute source address s(i)
        buffer[i] = memory[s(i)]
    endloop	
\end{verbatim}
\normalsize
Even for fairly regular access patterns, such as those induced by HPF
whole array assignment statements, recomputing the addresses each time
can be complex.  Parallelizing compilers can help a great deal by
exploiting compile--time knowledge to generate specialized assembly
loops for each data transfer~\cite{stichnoth94,IRREGULAR-MALAGA}.
However, if the data access pattern is irregular (and thus unknown to
the compiler), or if a parallelizing compiler is not available, then a
general--purpose run--time library assembly routine is necessary, and
there is a limit to the number of special--case optimizations that
such a library routine can exploit.  Of course, a parallelizing
compiler is also limited to some number of optimizations, but it is
practical to support a far larger number because optimizations can be
selected at compile--time instead of run--time.

We are exploring an alternative approach, based on
inspector/executor~\cite{saltz91,AGRAWAL-CHAOS-BLOCK-STRUCT-JOURNAL},
that allows a general--purpose message assembly routine in a run--time
library to run efficiently without any special--case run--time or
compile--time optimizations. The idea is that instead of recomputing
the address relation from scratch during message assembly, the
relation is precomputed and {\em encoded} (i.e., stored in memory in
some form), either at startup time or the first time a particular data
transfer is encountered.  This inspector phase takes the following
form:
\small
\begin{verbatim}
    loop i = 1 to n	
        compute source address s(i) 
          and receiver address r(i)
        encode s(i), r(i) into relation
    endloop	
\end{verbatim}
\normalsize
The encoded address relation is then {\em decoded} inside the message
assembly loop to provide addresses.  The message assembly routine (the
executor) to transfer $n$ data items using a {\em send/receive} model
would have the following form:
\small
\begin{verbatim}
    loop i = 1 to n	
        decode relation to generate 
          source address s(i)
        buffer[i] = memory[s(i)]
    endloop	
\end{verbatim}
\normalsize
Depending on the message passing model (send/receive, put, or get),
the assembly and disassembly loops will have a different form, but a
library can support all three forms by simply providing the facility
to compute an address relation.

Notice that the two assembly loops above have a similar form. The
intuition behind the decoding approach is that it should be faster to
decode a precomputed form of the address relation than to recompute it
on the fly each time. 
%In some sense, recomputing address relations
%from scratch inside the assembly loop can be thought of as decoding a
%null--encoded relation.
The decoding time depends on the relation itself and its encoding in
memory. 
\comment{
In the next section we derive a representative set of
relations that capture the basic memory access patterns induced by
regular HPF block--cyclic array assignment statements. We then use
these relations to derive a family of encodings that provide the basis
%for the analytic models in Section~\ref{sec:models} and 
for the empirical results in Section~\ref{sec:measure}.
}

\section{Representative address relations}
\label{sec:represent}

%
% Arb, except that it must be onto
%
In general, the elements of an address relation are arbitrary.
However, in many cases the address relations induced by operations on
distributed data structures have considerable regularity. For example,
consider the redistribution of a distributed array in an HPF program.
Each dimension of a regular block--cyclic distributed array has some
size $N$ and is distributed over some number of nodes $P$ with some
fixed block size $M$.  Conceptually, the dimension is split into
$\lceil N/M \rceil$ blocks of size $M$ ($N - M (\lceil N/M \rceil -1) \leq
M$ for the final block) which are then distributed round robin over
the $P$ nodes.  The most common regular block--cyclic distributions
are BLOCK, where there are only $\lceil N/M \rceil = P$ blocks and
CYCLIC, where $M=1$.  Each dimension of an array can be independently
distributed.

The idea of array redistribution is to change the location (node and
address) of elements of the array so as to alter its distribution.
Because the redistribution may be onto a disjoint group of nodes, we
refer to {\em source} and {\em destination} distributions and nodes.
Figure~\ref{fig:redist} shows a source distribution of a
2--dimensional array with a (BLOCK,*) distribution (over two
processors.)  The shaded array elements are those that node 0 owns.
The ``*'' indicates a non--distributed dimension.  The same figure
also shows a destination distribution of (CYCLIC,*) (also over two
processors.)  To redistribute the array, for each pair of source and
destination nodes, we compute the intersection of the elements the
nodes own (which is highlighted in the figure) and map the source node
address of each element in the intersection to its address in the
destination node.  Data is gathered from the source node addresses to
a contiguous buffer, communicated, and scattered to the destination
node addresses.

\begin{figure}[tbph]
\centerline{\epsfxsize 2.5in \epsfbox{example.eps}}
\caption{Redistributing regular block--cyclic arrays}
\label{fig:redist}
\end{figure}

For 1--dimensional BLOCK and CYCLIC arrays, it can clearly be seen
that only simple stride--$j$ to stride--$k$ mappings can occur in an
address relation induced by redistribution.  BLOCK to BLOCK
redistributions result in unit strides on both the source and
destination nodes while the combination of CYCLIC and BLOCK results in
a unit stride on either the source or destination node and a non--unit
stride on the other.  CYCLIC to CYCLIC redistributions result in
non--unit strides on both nodes, as do certain forms of transpose
operations.  With more than one dimension, the access pattern is the
Cartesian product of the access patterns generated for each matching
dimension~\cite{stichnoth93a}.  This can also be seen in
Figure~\ref{fig:redist}.  The basic mapping is stride--2 to stride--1
repeated four times.  This pattern is then repeated stride--8 to
stride--8 16 times.

We place array redistributions of BLOCK and CYCLIC distributed arrays
into four equivalence classes based on whether their basic source and
destination reference patterns have unit or non--unit strides. The four
two--dimensional array redistributions (of a $1024\times 1024$ array
on four processors) of Figure~\ref{fig:repdist} are representatives of
the four classes.  Each distribution is over four nodes.  We include a
data transpose operation instead of a CYCLIC to CYCLIC redistribution
to demonstrate that our technique extends to transposes as well.

\begin{figure*}[tbph]
\begin{center}
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{Redistribution} & \multicolumn{2}{c|}{Assembly} & 
            \multicolumn{2}{c|}{Disassembly} & \multicolumn{2}{c|}{Tuples}  \\ 
\hline
Source & Destination & Read & Write & Read & Write & (0 to all) & (0 to 0)  \\
\hline
(BLOCK,*)  & (*,BLOCK) & 1 & 1 & 1 & 1 & 262144 & 65536\\
(BLOCK,*)  & (CYCLIC,*) & 4 & 1 & 1 & 1 & 262144 & 65536\\
(CYCLIC,*) & (BLOCK,*) & 1 & 1 & 1 & 4 & 262144 & 65536\\
(*,CYCLIC) & (CYCLIC,*)$^T$ & 4 & 1 & 1 & 1024 & 262144 & 65536\\
\hline
\end{tabular}
\normalsize
\end{center}
\caption{Representative redistributions of $1024 \times 1024$ distributed array (four processors), basic memory access strides, and number of tuples.  (``T'' represents a transpose)}
\label{fig:repdist}

\end{figure*}

\section{Encoding address relations}
\label{sec:compact}

In this section, we use the four representative address relations from
Figure~\ref{fig:repdist} to drive the development of compact address
relation encodings that can exploit the regularity of relations
induced by redistributions of regular block--cyclic arrays, while
retaining the expressive power to encode arbitrary relations.

We begin with the \AAPAIR~ encoding, which is a simple index
array that exploits no regularity in the relation, and then introduce
a series of encodings (\AABLK, \DMRLE, and \DMRLEC) that
incrementally exploit more regularity in the relations without loss of
expressive power. 
%For example, \DMRLEC~ exploits more regularity in the
%relations than \DMRLE, which exploits more regularity than
%\AABLK, which exploits more regularity than the \AAPAIR~ base case.
For each encoding, we place a total order $\leq$ on the tuples
of the address relation:
\begin{displaymath}
(s_i,d_i) \leq (s_j,d_j) \Leftrightarrow (s_i <  s_j) \vee ((s_i = s_j) \wedge (d_i
\leq d_j)).
\end{displaymath}
In other words, the tuples are ordered by source node address first,
then by destination node addresses.  The encodings we describe are
also readily adaptable to ordering by destination addresses first.
Because on decoding the relation, we generate addresses according to
this order, the destination--first order may be preferable on machines
which perform strided stores badly. Each encoding is designed so that
it can be used efficiently to assemble and disassemble messages. The
idea is that we repeatedly load some {\em key} from a stored relation,
expand it to a {\em symbol} according to the encoding, and then use
the symbol to gather or scatter some number of data elements.  We do
not consider encodings that are wholly expanded before use.  The
innovation here is not in developing simple data compression
techniques, but in applying these techniques to address relations and
using them during message assembly and disassembly.

Figure~\ref{fig:sizes} shows the total space required for address
relations on source node 0 for each of the representative relations of
Figure~\ref{fig:repdist}.  The sizes are in bytes and measured on the
DEC 3000/400, which is a 64--bit platform.  Relative sizes are the
same on 32--bit platforms, but the absolute sizes are halved.  Note
that the chart has a log scale.  The difference between the \AAPAIR~
and \DMRLEC~ encodings is over three orders of magnitude. Further, the
absolute size of the \DMRLEC--encoded relations is roughly 1Kbytes,
which is quite small.

\begin{figure}[htbp]
\centerline{\epsfxsize 2.5in \epsfbox{sizes.eps}}
\caption{Total size of all address relations on sending node 
zero in each encoding}
\label{fig:sizes}
\end{figure}

The \AAPAIR~encoding (Address, Address PAIR) is simply an array
of source-destination address pairs.  It takes advantage of no memory
access patterns and always requires storage space proportional to the
number of data items transferred.  The keys and symbols for \AAPAIR~are
the tuples themselves.

The \AABLK~encoding (Address, Address BLocK) run-length
encodes~\cite{DATA-COMPRESSION-OVERVIEW-TECHREPORT} blocks of stride--1 to
stride--1 mappings.  The symbols generated are of the form $(s,d,l)$,
which means that addresses $s$ through $s+l-1$ on the source map to
address $d$ through $d+l-1$ on the destination.  The keys are simply the
symbols. Notice that this is sufficient for the first representative
redistribution in Figure~\ref{fig:repdist}, but not the others.

The \DMRLE~ encoding (Difference Map, Run-Length Encoded)
generalizes \AABLK~ to map blocks of addresses with a stride--$j$ source
to a stride-$k$ destination. To encode the relation in \DMRLE, we form
the difference map sequence $\langle(ds_{i},dd_{i})\rangle$ from the
ordered sequence of tuples $\langle(s_{i},d_{i})\rangle$:
\begin{displaymath}
(ds_i, dd_i) = \left \{ 
\begin{array}{ll}
(s_0,d_0) & i=0 \\
(s_i-s_{i-1},d_i-d_{i-1}) & i > 0
\end{array}
\right .
\end{displaymath}
The difference map is then run-length encoded to produce symbols of
the form $(ds,dd,l)$ which are stored as keys.  Notice that this
encoding significantly reduces space requirements for the three
redistributions where \AABLK~ fails, and performs only marginally worse
for the fourth.   

The \DMRLE~ encoding efficiently encodes the basic stride--$j$ to
stride--$k$ mapping, but for multidimensional arrays this will be
repeated many times.  Clearly, because the symbols for inner
dimensions will occur more frequently than for outer dimensions, a
Huffman-encoding~\cite{HUFFMAN-CODES-JOURNAL} of the symbols would
most readily reduce their storage cost.  However, Huffman codes are of
variable length, so keys could span word boundaries and be expensive
to decode during message assembly.  Instead, we adopt a dictionary
substitution encoding~\cite{DATA-COMPRESSION-OVERVIEW-TECHREPORT} of
the symbols as fixed length keys.  The key length is the minimum power
of two bits necessary to represent all the unique symbols.  This
scheme is efficient for message assembly because keys cannot span word
boundaries and there is always the same number of keys per word.

To encode in \DMRLEC~(Difference Map, Run-Length Encoded, Compressed)
the relation is first encoded in \DMRLE ~and all unique $(ds,dd,l)$
symbols are found via hashing.  These symbols are stored in a table of
size $2^k$ for the minimum $k$ possible.  Each symbol is encoded as a
$2^k$-bit key representing its offset in the table.  As can be seen
from Figure~\ref{fig:sizes} this technique reduces the storage cost of
the relations induced by the encoding relations by two orders of
magnitude over \DMRLE.  Overall, \DMRLEC~is three to four orders of
magnitude more space efficient than \AAPAIR.

%\section{Message assembly performance models}
%\label{sec:models}

%This section introduces some models that characterize the tradeoffs
%between decoding precomputed address relations or recomputing them on
%the fly (which we will refer to as decoding vs. recomputing).  In
%particular, the models predict {\em threshold}, {\em break--even}, and
%{\em speed--up}.  {\em Threshold} is the minimum number of
%instructions at which decoding can produce tuples in the relation
%faster than recomputing.  In other words the threshold tells us if,
%for a particular relation, decoding the relation will ever be faster
%than recomputing it.  For those cases where threshold is defined, {\em
%break--even} tells us the number of times a relation must be reused
%before the one--time cost of encoding the relation is amortized, and
%{\em speedup} is the maximum improvement that is possible.

%The models are designed for a {\em put} message passing model, where 
%the source node generates the entire relation.  Only the \AAPAIR~ 
%encoding is used because it is independent of the relation.
%Validation on both the Intel iWarp and Intel Paragon systems finds 
%close agreement between the predicted and actual values.  An important 
%observation is that the threshold values are very low, on the order of 
%one instruction on the iWarp and 16 on the Paragon.  In other words, 
%if it is not possible to produce a tuple every instruction cycle on 
%the iWarp, or every 16 instruction cycles on the Paragon, then 
%decoding \AAPAIR~ wins over precomputing.

%\subsection{Algorithm and machine properties}

%Given some address relation, $n_g$ is the average number of machine
%instructions required to generate a tuple in the relation (either when
%encoding or recomputing the relation).  We model the generation of
%address relation tuples as instruction--issue--rate limited and as
%providing the address relation in an on--line manner.  Further, we
%assume that tuple generation is integer--based and performed in
%registers.

%Additionally, we must know $r_i$, the number of integer instructions
%the machine can issue per second.  Since memory accesses are at the
%heart of message assembly, we characterize the memory system by four
%values: $r_{rc}$ and $r_{wc}$, the read and write bandwidths for
%unit--stride accesses, respectively; and $r_{rr}$ and $r_{wr}$, the
%read and write bandwidths for random memory accesses. These
%parameters, along with the others in the models, are listed in
%Figure~\ref{fig:parms}.
%\begin{figure}[htbp]
%\centering

%\begin{tabular}{|l|l|}
%\hline
%\multicolumn{2}{|c|}{Problem parameters}\\ 
%\hline	
%\hline
%$n$      & data words in the message \\
%$n_i$    & times the data transfer is performed\\
%\hline
%\hline
%\multicolumn{2}{|c|}{Algorithm parameters}\\ 
%\hline	
%\hline
%$n_{ac}$ & loop overhead instructions during decoding\\
%$n_{au}$ & loop overhead instructions during recomputing\\
%$n_g$    & instructions required to compute a tuple \\
%$n_o$    & loop overhead instructions during encoding\\
%\hline
%\hline
%\multicolumn{2}{|c|}{Machine parameters}\\ 
%\hline	
%\hline
%$r_{i}$  & integer instructions/sec the machine can issue\\
%$r_{rc}$ & read bandwidth for unit--stride memory accesses\\
%$r_{rr}$ & read bandwidth for random memory accesses\\
%$r_{wc}$ & write bandwidth for unit--stride memory accesses\\
%$r_{wr}$ & write bandwidth for random memory accesses\\
%\hline
%\end{tabular}

%\caption{Summary of model parameters}
%\label{fig:parms}
%\end{figure}
 
%\subsection{Model for recomputing address relations}

%The time to recompute an address relation during message assembly
%has two components: $t_g$, the time to compute the address relation;
%and $t_{au}$, the message assembly time:
%\begin{displaymath}
%t_{recompute} = t_g + t_{au}
%\end{displaymath}
%The time to compute the address relation is largely determined by the
%complexity of algorithm used, its implementation, and the instruction
%execution rate of the machine.  We characterize the algorithm in terms
%of $n_g$, the number of integer instructions executed per tuple of the
%address relation.  Thus for an $n$ word message, the time required is
%\begin{displaymath}
%t_g = \frac{n n_g}{r_i}
%\end{displaymath}
%The address computation code generates source/destination address
%tuples $(s,r)$ in registers in an on-line manner.  For each of these
%tuples, we read the data word at address $s$, write $r$ to the message
%buffer, then write the data word to the message buffer.  Note that the
%writes to the message buffer are to consecutive addresses, so their
%performance can be captured by $r_{wc}$, the unit--stride write
%bandwidth of the machine.  On the other hand, the source side
%addresses of the tuples may not be contiguous, so we capture their
%performance with $r_{rr}$, the random read bandwidth of the machine.
%Furthermore, we capture the loop overhead as a number of instructions,
%$n_{au}$.  Thus the time to copy the data is
%\begin{displaymath}
%t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + 
%\frac{n_{au}}{r_i}
%          \right ) 
%\end{displaymath}
%Note that this expression does not take into account cache thrashing
%between the reads and writes.  An implementation will probably
%optimize access patterns by reading several data items at a time.  If
%the compiler software--pipelines the assembly loop, we should be close
%to the above expression.  Substituting into the expression for
%$t_{recompute}$, we get an average assembly latency of 
%\begin{equation}
% t_{recompute} = n \left ( \frac{n_g+n_{au}}{r_i} +
% \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
%\label{equ:uncached}
%\end{equation}

%\subsection{Model for decoding address relations}

%The average latency for decoding precomputed address relations has
%three components: $t_g$, the time to compute the address relation;
%$t_o$, the time to store the address relation in memory, and $t_{ac}$,
%the message assembly time using the stored address relation.  The
%average latency is
%\begin{displaymath}
% t_{decode} = \frac{t_g + t_o}{n_i} + t_{ac}
%\end{displaymath}
%Note that the first two components are amortized over the number of
%times the communication is performed, $n_i$.  $t_g$ is the same as for
%the case where we recomputed the relation on the fly and was
%determined in the preceding section.  We assume the \AAPAIR~ encoding
%from Section~\ref{sec:compact}.  As each tuple of the address relation
%is generated, we write it to memory.  Thus storing the address
%relation involves two contiguous writes for each in-register tuple
%produced by the address relation computation.  In addition, we execute
%$n_o$ loop overhead instructions, so
%\begin{displaymath}
% t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
%\end{displaymath}
%When assembling the message from the stored address relation, for each
%tuple in the message, we first perform two consecutive reads to read a
%tuple of the address relation.  Then, we perform a random read
%to read the data word.  Finally, we perform two consecutive writes to
%write the data word and the destination address into the message buffer.
%We also execute $n_{ac}$ loop overhead instructions, thus the time to
%assemble a message with $n$ data words is
%\begin{displaymath}
% t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
%\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
%\end{displaymath}
%Note that this expression does not take into account cache thrashing
%between the reads and writes, but we continue the assumption of the
%preceding section.  Substituting into the expression for $t_{decode}$,
%we see that an $n$ word communication between two nodes requires time
%\begin{equation}
% t_{decode} = n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} +
%	\frac{n_{ac}}{r_i} + \frac{2}{r_{rc}} \\
%       + \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
%\label{equ:cached}
%\end{equation}

%\subsection{Threshold, break--even, and speedup}

%Equations \ref{equ:uncached} and \ref{equ:cached} can help us answer
%the following questions: For a message assembly step that occurs many
%times at run--time, can decoding a precomputed address relation ever
%outperform simply recomputing it?  If so, how many times must the
%relation be reused before the decoding strategy wins?  Finally, what
%is the ultimate speedup of the decoding strategy over the recomputing
%strategy?

%Decoding can only be effective if we can assemble the message faster
%using the precomputed relation than by recomputing it on the fly. In
%other words, we require that
%\begin{displaymath}
%t_{ac} < t_g + t_{au}
%\end{displaymath}
%or
%\begin{equation}
%% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
%n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
%\label{equ:necessary}
%\end{equation}
%The right hand side of Equation (\ref{equ:necessary}) is a
%machine--specific {\em threshold} below which decoding precomputed
%address relations is ineffective. Notice that threshold is largely
%determined by the ratio of instruction issue rate and memory read
%bandwidth.

%If (\ref{equ:necessary}) is true, then we can also predict
%how many times the address relation must be decoded in order to
%break--even with recomputing on the fly.  This {\em break--even} point
%is found by setting $t_{decode} = t_{recompute}$ and solving for
%$n_i$.  In doing this we find that the break--even point is
%\begin{equation}
% n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
%           \frac{(n_g+n_o)r_{wc} +
%           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
%\right \rceil 
%\label{equ:crossover}
%\end{equation}
%It is important to point out that as $n_g$ moves away from the
%threshold, the break--even point decreases sharply because the $n_g$
%terms begin to dominate. 

%Finally, the ultimate speedup of decoding versus recomputing is
%\begin{equation}
%\label{equ:speedup}
%\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
%              + r_ir_{rc}(r_{wc}+2r_{rr})}
%             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
%\end{equation}
%The speedup is linear in $n_g$, so is bounded only by the complexity 
%of the address computation.

%\subsection{Validation on iWarp and Paragon}

%This section validates the break--even and speedup models using
%measurements from message assembly loops running on the Intel iWarp and
%Paragon system.  For each system, the threshold is defined, so we use
%(\ref{equ:crossover}) to predict the number of iterations necessary
%for the decoding approach to break--even with the recomputing approach, as
%a function of $n_g$, and we use (\ref{equ:speedup}) to predict the
%ultimate speedup of the decoding approach over the recomputing
%approach, also as a function of $n_g$.  In each case, the address
%relation is the identity relation, the encoding scheme is \AAPAIR, and
%$n_g$, the number of instructions to compute the relation, is varied
%by inserting nops into the code stream.  The predictions are then
%compared to actual measurements on the hardware.

%For the iWarp system, $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$ words/second,
%and a node can execute $r_i=2 \times 10^7$ instructions/second.
%Specifically, the iWarp clock speed is 20 MHz and all memory
%operations complete in two cycles.  There is no cache and we assume no
%dual dispatch of loads and stores.

%The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
%estimated {\em statically} by simply counting the number of
%instructions in each loop, or {\em dynamically} by accounting for
%bypass stalls that arise in the loops. The static estimates for iWarp
%are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
%$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

%Figure~\ref{fig:iw}(a) shows the measured and predicted number of
%iterations needed on iWarp for the message assembly latency of the
%decoding scheme to break even with that of the recomputing scheme for
%different $n_g$.  Entries with ``-'' denote that the value is
%undefined (i.e., $n_g$ is below the threshold). The predictions made
%using the dynamic measurements are exact.  The predictions made using
%the static measurements are pessimistic for small $n_g$, but quickly
%converge to the actual numbers as $n_g$ increases. Notice also
%that the predicted threshold (using the dynamic estimates) is
%only one instruction cycle!

%\begin{figure*}[htbp]
%\centering
%\small
%\begin{tabular}{cc}
%\begin{tabular}{|c|c|c|c|} 
%\hline
%\multicolumn{4}{|c|}{Break-even}\\ 
%\hline
%$n_g$       & Actual         & Predicted         & Predicted \\
%            &                & (static)          &  (dynamic) \\    
%\hline
%   1        &         13     &        -         &        13 \\
%   2        &         7      &         12        &        7\\
%   3        &         5      &          7        &        5\\
%   4        &         4      &          5        &        4\\
%   5        &         4      &          4        &        4\\
%   6        &         3      &          4        &        3\\
%   7        &         3      &          3        &        3\\
%   8        &         3      &          3        &        3\\
%\hline
%\end{tabular} 
%&
%\begin{tabular}{|c|c|c|c|} 
%\hline
%\multicolumn{4}{|c|}{Speedup}\\ 
%\hline
%$n_g$       & Actual         & Predicted         & Predicted \\
%            &                & (static)          &  (dynamic) \\    
%\hline
%   1       &         1.05    &         1.00     &         1.07 \\
%   2       &         1.12    &         1.09     &         1.15 \\
%   3       &         1.22    &         1.16     &         1.23 \\
%   4       &         1.28    &         1.25     &         1.32 \\
%   5       &         1.35    &         1.33     &         1.39 \\
%   6       &         1.43    &         1.41     &         1.47 \\
%   7       &         1.49    &         1.49     &         1.54 \\
%   8       &         1.59    &         1.59     &         1.61 \\
%\hline
%\end{tabular}\\
%(a) & (b) \\
%\end{tabular}
%\caption{(a) Break--even (Eqn. \protect{\ref{equ:crossover}})
%and 
%(b) speedup (Eqn. \protect{\ref{equ:speedup}}) on iWarp
%}
%\label{fig:iw}
%\end{figure*}

%Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
%the decoding scheme over the recomputing scheme on iWarp. As in
%Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for
%all $n_g$ and the static predictions are pessimistic for small $n_g$.
%However, it is important to note that even with static measurements,
%the predictions only become inaccurate when $n_g$ approaches the
%threshold.  Another important point is that the purpose of
%Figure~\ref{fig:iw}(b) is to help validate the models when $n_g$ is
%small, and not to highlight impressive speedups. In fact, the speedups
%grow arbitrarily large as $n_g$ increases (see
%Equation~\ref{equ:speedup}.)

%A Paragon node contains a 50 MHz pipelined, multiple issue RISC node.
%We can estimate the instruction issue rate by noting that we do not
%use the floating point pipeline.  Therefore we can issue at most
%$r_i=5 \times 10^7$ instructions/second.  We directly measured the
%memory parameters and found $r_{rc}=9 \times 10^6$ words/second,
%$r_{wc}=8.2 \times 10^6$ words/second, and $r_{rr}=r_{wr}=0.9 \times
%10^6$ words/second.  Only a static measurement (instruction counting)
%of the implementation properties was performed, which found
%$n_{au}=6$, $n_o=7$, and $n_{ac}=11$.

%Figure~\ref{fig:86}(a) shows, for different $n_g$, the measured and
%predicted number of iterations needed for decoding to break even with
%recomputing.  The predictions are pessimistic, but quickly converge to
%the actual values with increasing $n_g$.  Notice that $n_g$ must be
%much larger to permit break--even on the Paragon than on the iWarp.  The
%predicted threshold is 17 instructions/tuple and the actual
%threshold is 16 instructions/tuple.  This is not surprising given
%that the Paragon's instruction issue rate is much higher than its
%memory bandwidth, while the iWarp's issue rate is well matched to its
%memory bandwidth.

%\begin{figure}[htbp]
%\centering
%\small
%\begin{tabular}{cc}
%\begin{tabular}{|c|c|c|} 
%\hline
%\multicolumn{3}{|c|}{Break--even}\\ 
%\hline
%  $n_g$ & Actual & Predicted \\
%        &        & (static) \\
%\hline

%    14     &        -    &        -\\
%    15     &        -    &        -\\
%    16     &        19      &        -\\
%    17     &        11      &        41    \\     
%    18     &         9      &        20\\
%    19     &         7      &        14\\
%    20     &         6      &        11\\
%    21     &         5      &         9\\
%    22     &         5      &         7\\
%    23     &         4      &         7\\
%    24     &         4      &         6\\
%    25     &         4      &         5\\
%\hline
%\end{tabular}
%&
%\begin{tabular}{|c|c|c|} 
%\hline
%\multicolumn{3}{|c|}{Speedup}\\ 
%\hline
%  $n_g$ & Actual & Predicted \\
%        &        & (static) \\
%\hline
%    14     &        -   &         -\\
%    15     &        -   &         -\\
%    16     &       1.03     &         -\\
%    17     &       1.06     &         1.01\\
%    18     &       1.09     &         1.02\\
%    19     &       1.12     &         1.03\\
%    20     &       1.15     &         1.04\\
%    21     &       1.17     &         1.05\\
%    22     &       1.20     &         1.06\\
%    23     &       1.22     &         1.07\\
%    24     &       1.24     &         1.08\\
%    25     &       1.26     &         1.09\\
%\hline
%\end{tabular}\\
%(a) & (b) \\
%\end{tabular}
%\caption{(a) Break--even (Eqn. \protect{\ref{equ:crossover}})
%and 
%(b) speedup (Eqn. \protect{\ref{equ:speedup}}) on Paragon
%}
%\label{fig:86}
%\end{figure}

%Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
%decoding over recomputing on the Paragon. Notice that speedup does not
%increase as quickly with $n_g$ as on the iWarp
%(Figure~\ref{fig:iw}(b).)  This is the result of the lower memory
%bandwidth relative to instruction issue rate on the Paragon.  The
%predictions are not as accurate as for the iWarp, reflecting the fact
%the Paragon's memory system is more complex.  However, the predictions
%are pessimistic and quickly converge to the correct values.

%The significant lesson we can learn from the models developed in this 
%section is that the threshold for the decoding approach is quite low 
%(1 instruction/tuple on the iWarp and 16 instructions/tuple on the 
%Paragon).  In practice, the time to recompute the relation on the fly 
%for simple and regular access patterns, such as those induced by 
%BLOCK($P$) and CYCLIC($P$) distributions, falls beneath this 
%threshold and communication based on either approach has roughly the 
%same performance.  However, as the access pattern becomes even 
%slightly less regular, communication performance using the decoding 
%approach becomes attractive.  For example, we have measured the time 
%to redistribute block--cyclic HPF arrays (including message 
%assembly/disassembly and the actual data transfers) and found that the 
%decoding approach was a factor of two faster than the recomputing 
%approach\footnote{A reference is not given here in order to 
%preserve anonymity. It will be included in the final paper}.


\section{Measured performance}
\label{sec:measure}

We have developed analytical models for predicting the performance of
the \AAPAIR
~encoding~\cite{AR-CACHING-LCR-INPROCEEDINGS,FAST-ASSEMBLY-ADDRESS-RELATIONS-TECHREPORT}.
These models show that the inspector/executor approach using \AAPAIR~
outperforms the recomputing on the fly approach in a surprising number
of instances.
%However, the \AAPAIR~encoding is quite memory--inefficient, requiring
%space proportional to the number of data items being assembled.  

Performance for the more compact encodings is difficult to model
analytically since it depends heavily on the access patterns in the
relation, unlike with the \AAPAIR ~encoding.  Instead, we directly
measured the performance of message assembly and disassembly on six
different machines.  For each machine, each encoding (\AAPAIR, \AABLK,
\DMRLE, and \DMRLEC) was measured on each of the four representative 
redistributions of Figure~\ref{fig:repdist}.  The performance measure
in all cases is the absolute copy throughput (in MBytes/sec) of the
assembly and disassembly routines.  By copy throughput, we mean the
number of MBytes that are packed into (unpacked out of) the message
buffer each second during message assembly (disassembly).  

%In this
%section we investigate the performance of message assembly and
%disassembly based on the more compact \AABLK, \DMRLE, and \DMRLEC~
%encodings.

The measurements are taken from {\bf ART} --- {\bf A}ddress {\bf R}elation
{\bf T}oolbox --- a portable library, entirely written in C and
designed to be used with a conventional message passing system based
on the {\em send/receive} model.  The address relations used are those
induced between source node zero and destination node zero for each of
the four representative block--cyclic redistributions of
Figure~\ref{fig:repdist}.  
%The reason for measuring node zero to node
%zero transfers is to guarantee that the largest address relation is
%used.
%, although the address relations between each pair of nodes are of
%identical size for the representative redistributions.  
The distributed array contains $1024
\times 1024$ 8-byte doubles, so the zero to zero relation assembly and
disassembly steps copy 512 KBytes.  A larger $2048 \times 2048$ array
(2048 KBytes assembled/disassembled) was also tested with only minor
changes in measured copy throughput.

The identical C code was compiled and measured on the DEC
3000/400~\cite{DEC-3000-JOURNAL}, the Intel
Paragon~\cite{PARAGON-XPS-MANUAL}, the IBM
RS/6000~\cite{RS6000-ARCHITECTURE-JOURNAL} 250, the IBM
SP-2~\cite{IBM-SP1}, and two
Pentium~\cite{PENTIUM-OVERVIEW-INPROCEEDINGS,PENTIUM-MICRO-JOURNAL} systems:
a Gateway Pentium 90, and a Micron Pentium 133.  The details of each
testing environment are given in Figure~\ref{fig:machines}.
\begin{figure*}[tbph]
\begin{center}
\small
\begin{tabular}{|l|l|l|l|}
\hline
Machine & Node &  Compiler & Operating System\\
\hline
DEC 3000/400 & Alpha (21064 \@ 133 MHz) & GNU C 2.6.3, DEC CC 3.11 
    & OSF/1 2.0 \\
Intel Paragon & i860 (i860XP \@ 50 MHz) & Intel C R4.5 & OSF/1 1.0.4 \\
IBM RS/6000 250 & PowerPC (601 \@ 80 MHz) & IBM XLC 1.3.0.4 & AIX 3.2 \\
IBM SP-2 & POWER2 (66 MHz) & IBM XLC 1.3.0.33 & AIX 3 \\
Gateway Pentium 90 & i386 (Pentium \@ 90 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
Micron Pentium 133 & i386 (Pentium \@ 133 MHz) & GNU C 2.4.5 & NetBSD 1.0A \\
\hline
\end{tabular}
\normalsize
\end{center}
\caption{Testing environments}
\label{fig:machines}
\end{figure*}
Figure \ref{fig:results} shows the results for the first four
machines, while Figure \ref{fig:results-pentium} shows the results for
the Pentium--based machines.  Each row shows the results for a single
machine.  Message assembly throughput is shown on the left and
disassembly throughput on the right.  Superimposed on each
graph are annotated lines that represent the throughput of the library
\verb.memcpy(). routine and the throughput of C-coded strided copies
of doubles.  A line annotated ``stride--$k$'' on an assembly graph
indicates the throughput of a copy loop that copies from a stride of
$k$ doubles to a stride of 1, while on a disassembly graph it
indicates the throughput of a copy loop that copies from a stride of 1
to a stride of $k$ doubles.
%\footnote{The final paper will also show
%stride--1024 measurements in each graph for comparison to the
%disassembly throughput of the transpose.} 
Each chart measures the absolute throughput on the same scale, so that
results can be compared across machines.  Except for the Pentium
machines (Figure~\ref{fig:results-pentium}), only the throughput of
the
\AAPAIR~(as a baseline) and
\DMRLEC~encodings are shown.  This is because the story is the same
for these machines: \DMRLEC~is always the fastest compact encoding;
\DMRLE~ has similar performance, and \AABLK~ is the slowest for all
but the first redistribution. The situation is more complex on the
Pentium machines, so the throughputs of all four encodings are shown.


\begin{figure*}[pthb!]
\begin{center}
\small
\begin{tabular}{cc}
Alpha - Assemble & Alpha - Disassemble \\
\epsfxsize 2.5in \epsfbox{alpha_assem.eps} & \epsfxsize 2.5in 
\epsfbox{alpha_disassem.eps} \\
Paragon - Assemble & Paragon - Disassemble \\
\epsfxsize 2.5in \epsfbox{i860_assem.eps} & \epsfxsize 2.5in 
\epsfbox{i860_disassem.eps} \\
RS/6000 - Assemble & RS/6000 - Disassemble \\
\epsfxsize 2.5in \epsfbox{ppc_assem.eps} & \epsfxsize 2.5in 
\epsfbox{ppc_disassem.eps} \\
SP-2 - Assemble & SP-2 - Disassemble \\
\epsfxsize 2.5in \epsfbox{pwr2_assem.eps} & \epsfxsize 2.5in 
\epsfbox{pwr2_disassem.eps} \\
\end{tabular}
\normalsize
\end{center}
\caption{Throughput (MBytes/sec) of message assembly and message 
disassembly using  \AAPAIR~ and \DMRLEC~ on DEC Alpha 3000/400, Intel Paragon,
IBM RS/6000, and IBM SP-2}
\label{fig:results}
\end{figure*}



The principle observation is that, for most of the machines, message
assembly and disassembly using the \DMRLEC~encoding operates at rates
that are at or close to the copy throughput possible for their memory
reference patterns.  The remainder of this section analyzes each
machine in turn.

On the DEC 3000/400, \DMRLEC~gives ideal assembly and disassembly
throughputs for all the test cases.  \DMRLEC~ benefits from the
separate load/store pipeline in the Alpha 21064
processor~\cite{ALPHA-MICRO-JOURNAL,DEC-ALPHA-21064-JOURNAL}. Even on
a load stall, integer instructions associated with \DMRLEC~can
continue to issue.
%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{alpha_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{alpha_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput (MBytes/sec) of (a) message assembly and (b) message 
%disassembly using  \AAPAIR~ and \DMRLEC~ on DEC 3000/400}
%\label{fig:alpha}
%\end{figure*}


On a Paragon node (i860~\cite{I860-PR-MANUAL} processor),
\DMRLEC~gives near ideal assembly throughput for all the test cases, as
shown in Figure~\ref{fig:results}.  The disassembly results
are somewhat unexpected, however.  Although the first test case
(stride--1 writes) operates with ideal throughput, the third test case
(stride--4 writes) actually does slightly better than a stride--4 copy
loop.  We explain this due to different code generation on the part of
the compiler for the disassembly loop and the stride--4 copy loop.  As
can be seen by the factor of two difference between a compiler
generated stride--1 copy loop and the library \verb.memcpy().
routine, there is considerable room for improving on the stride--$k$
copy loops generated by the Paragon node compiler.
%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{i860_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{i860_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput of message (a) assembly and (b) disassembly using
%\AAPAIR~and \DMRLEC~on Paragon}
%\label{fig:i860}
%\end{figure*}


The RS/6000 250 (PowerPC 601~\cite{PPC601-MICRO-JOURNAL}) also
achieves near ideal assembly and disassembly throughput with \DMRLEC.
The IBM XLC compiler has software-pipelined each of the assembly,
disassembly, and strided copy loops.
%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{ppc_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{ppc_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput (MBytes/sec) of (a) message assembly and 
%(b) message disassembly using \AAPAIR~and \DMRLEC~on IBM RS/6000 Model 250}
%\label{fig:ppc}
%\end{figure*}

The throughput measurements of message assembly and disassembly using
\DMRLEC~on the IBM SP-2 are also close to ideal, tracking the 
throughput of the strided copy loops closely.  The IBM XLC compiler
has software--pipelined assembly, disassembly, and each of the strided
copy loops.  Because the
POWER2~\cite{POWER2-ARCH-INPROCEEDINGS,POWER2-PROCESSOR-INPROCEEDINGS}
processor, like the Alpha, can continue to issue integer instructions
while a load miss is being handled, instructions associated with
\DMRLEC~are free.
% We outperform memcpy() by about 5% and we are unsure why.

%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{pwr2_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{pwr2_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput (MBytes/sec) of (a) message assembly and 
%(b) message disassembly using \AAPAIR~ and \DMRLEC~on IBM SP-2}
%\label{fig:pwr2}
%\end{figure*}

The performance of the Pentium systems is more complex than that of
the other machines, as can be seen in
Figure~\ref{fig:results-pentium}. On both machines, the assembly and
disassembly throughput of \DMRLEC~is lower than \DMRLE.  On the faster
Micron machine, only the first test case achieves copy throughput for
message assembly, and then only in the simple
\AABLK~encoding.      Message assembly and disassembly appear to
be instruction issue limited for \DMRLEC~on both machines.  Because the
ratio of memory copy bandwidth (as measured by \verb.memcpy().) to
instruction issue rate is higher on the Micron, it is not surprising
that it performs comparatively worse in message assembly and
disassembly using \DMRLEC.

We believe that three factors contribute to this behavior.  First, the
GNU C compiler does not support Pentium optimizations, so clearly, no
scheduling is done to improve the multiple issue rate.  Second, the
Pentium is both register starved and non--orthogonal.  There are a
significant number of instructions in the message assembly loop that
do nothing but move values associated with the compact encodings to
and from the appropriate registers for certain operations.  In other
cases, memory operands are used where registers would appear with a
larger register set.  Finally, although the latency of memory operands
is shortened by the on--chip cache, the cache is not actually dual
ported, but banked, so cache references can become
serialized.\footnote{We also compiled the library with Microsoft's
Visual C++ 2.2, which includes Pentium-specific optimizations, but saw
no difference in performance during informal tests.  This leads us to
believe that the architecture or memory system is the overriding
factor in the lack-luster performance of the Pentiums.} Note that even
with this complexity, \DMRLEC~always performs better than
\AAPAIR~on the Pentium.




%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{p90_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{p90_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput (MBytes/sec) of (a) message assembly and (b) 
%message disassembly using \AAPAIR, \AABLK, \DMRLE, and \DMRLEC~ on 
%Gateway Pentium 90 system}
%\label{fig:p90}
%\end{figure*}
%\begin{figure*}[htbp]
%\begin{center}
%\begin{tabular}{cc}
%\epsfxsize 2.5in \epsfbox{p133_assem.eps} & \epsfxsize 2.5in 
%\epsfbox{p133_disassem.eps} \\
%(a) & (b) \\
%\end{tabular}
%\end{center}
%\caption{Throughput (MBytes/sec) of (a) message assembly and (b) message
%disassembly using \AAPAIR, \AABLK, \DMRLE, and \DMRLEC~on 
%Micron Pentium 133 system}
%\label{fig:p133}
%\end{figure*}



%\begin{figure*}[thbp!]
\begin{figure*}
\small
\begin{center}
\begin{tabular}{cc}
Pentium 90 - Assemble & Pentium 90 - Disassemble \\
\epsfxsize 2.5in \epsfbox{p90_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p90_disassem.eps} \\
Pentium 133 -Assemble  & Pentium 133 - Disassemble \\
\epsfxsize 2.5in \epsfbox{p133_assem.eps} & \epsfxsize 2.5in 
\epsfbox{p133_disassem.eps} \\
\end{tabular}
\end{center}
\normalsize
\caption{Throughput (MBytes/sec) of message assembly and message
disassembly using \AAPAIR, \AABLK, \DMRLE, and \DMRLEC~on 
Gateway Pentium 90 and Micron Pentium 133 systems}
\label{fig:results-pentium}
\end{figure*}

\section{Discussion}
\label{sec:discuss}

%The preceding sections demonstrated that the decoding approach to
%message assembly can outperform the recomputing approach for many
%relations.  We introduced address relation encodings that
%significantly reduce the space required to store common address
%relations.  The message assembly and disassembly throughputs of these
%encodings were measured on several machines and found to be on par
%with the copy bandwidth.  
%In this section we describe other attractive attributes of the
%decoding approach, arguing that stored, encoded address relations are
%a powerful interface to communication systems.  Additionally, we note
%that even with compact address relations discussed in this paper, an
%address relation caching strategy could still be useful.


By taking address computation out of the critical path of data
transfer, the decoding approach makes message assembly throughput
depend on the address relation and the memory access pattern it
encodes instead of how the address relation is computed.  One
advantage of this is that it becomes easier to achieve reasonable
communication performance without a large number of special--case
optimizations in the compiler and run--time library.  Instead of
optimizing each address computation algorithm for each architecture,
only message assembly for each address relation encoding needs to be
optimized for each architecture in order for all data transfers to
benefit.

Another interesting property of the decoding approach is that the
entire relation is available to be optimized at a new level of the
software hierarchy.  For example, by ordering the relation by
destination node addresses first, we can optimize for reference
locality on the destination node instead of on the source node.  As
shown in ~\cite{STRICKER-OPT-MEM-PERF-COMM-INPROCEEDINGS}, the choice
is architecture--dependent and can have a significant impact on
performance.

Stored encoded address relations also appear to be a good interface
from user programs to the communication system because they are
implementation neutral.  This permits communication systems to be
highly specialized for the available hardware while still presenting
the same interface to the programmer or compiler.  However, to best
utilize the hardware, it may be necessary to change the encoding of
the address relation before using it.  For example, the relation may
be transformed into a DMA gather map on the source to make best use of
a network adapter such as the one described in~\cite{GNECTAR-IFIP}.
It is unclear whether this is the optimal approach, since the main
processor could presumably support better encodings than the network
adaptor. However, it is an intriguing possibility.

%
% This is new
%
%
Address relations could also be a powerful abstraction in the
interface {\em between} the layers of a communication system.
Consider a TCP/IP implementation where each layer merely added tuples
necessary to prepend its header to the packet instead of recopying it.
The data link layer could then simply gather the contents of the
packet directly to the wire.  With appropriate user-level semantics,
such a communication system could avoid all data copies.

Even with the compact address relation encodings described in this
paper, it is still useful to be able to bound the amount of memory
used to store address relations.  Address relations can still have
huge space requirements if they do not have regularity to exploit.
This leads naturally to a caching approach.  When the allotted
amount of memory is in use and a new relation needs to be stored, a
victim relation must be chosen and flushed from memory to make room.
User or compiler-supplied hints would be helpful to minimize cache
thrashing. For example, if the run--time system were told that a group
of relations were associated with the same collective communication
operation, then it might make sense to flush and restore these
relations as a group.  Another possibility is to publicize the
characteristics of the address relation cache so that compilers and
programmers can make reasonable choices, as with a memory cache.

Since address relations can be decoded at near copy throughputs on
most systems, it might be interesting to investigate using precomputed
address relations as the basis for a generalized form of
\verb.memcopy(). on uniprocessors.   An example application for such
a library routine could be bit block transfers (bit--blit) and
transfers of regions in graphical interfaces and games.


\section{Conclusions and open issues}
\label{sec:conclude}

This paper introduced the notion of an address relation for
describing data transfers between the nodes of a parallel system, and
it described how compact encoded forms of these relations can
serve as the basis for fast message assembly and disassembly.
%Analytic models and measurements on real systems suggest two main
%conclusions.  First, the threshold at which decoding precomputed
%address relations during message assembly and disassembly becomes
%effective is quite low, on the order of units or tens of instructions.
%Second, 
The main result is that
message assembly and disassembly based on decoding compact
address relations typically perform at near copy throughput for
the relations induced by redistributions of regular block--cyclic
distributed arrays.  A side effect is that the storage requirements for
these relations is drastically reduced to a practical level.

While the paper shows encouraging results for a carefully constrained
(but very useful) set of relations, it does not address the
fundamental question of how to characterize machines and address
relations so that we can decide whether to decode or recompute, and if
we decide to decode, what encoding of the relation to use. This issue
will become increasingly important for relations with irregular memory
reference patterns, since the potential benefit of decoding would seem
to be greatest for the irregular cases. It will be necessary to
understand how to encode and decode relations that are more irregular
than the relations described in this paper, as well as to understand
the performance impact of these encodings.  One motivating application
for these irregular relations is sparse matrix--vector multiplication
over irregular finite--element meshes.

Another avenue to explore is the design of more aggressive encoding
schemes such as the Huffman--based~\cite{HUFFMAN-CODES-JOURNAL}
\DMRLE~encoding mentioned in Section~\ref{sec:compact}. Another
possibility is to use LZW~\cite{LZ-JOURNAL,LZW-JOURNAL} or a variant
thereof for further compression.~\footnote{Early experiments with the
\verb.gzip.  LZW--based compressor show dramatic compression, but we
do not yet know if LZW--based decoding can be made to run at copy
throughput speeds.}

We are also interested in how much the performance of message assembly
and disassembly depends on single--node compilers.  Many parallelizing
compilers generate sequential source code and rely on the single--node
compiler to produce good results.  However, as can be seen from the
factor of two difference between a compiler--generated copy loop and
the library \verb.memcpy(). routine on the Paragon
(Figure~\ref{fig:results}), this reliance may be unwarranted.  If so,
then the decoding approach becomes more attractive since the assembly
routine for each encoding could be low--level coded for each
architecture. Such low--level coding would be probably be impractical
for the recomputing approach due to the many variants of assembly
routines.  Instead, a parallelizing compiler could precompute address
relations at compile--time, and rely on low--level coded library of
message assembly routines to achieve near copy throughput. In general
though, the appropriate interaction between a parallelizing compiler
and a run--time message assembly library based on decoding precomputed
address relations is not yet understood.


\comment{
\subsection*{ACKNOWLEDGEMENTS}

We would like to thank Thomas Stricker for illuminating the finer
points of deposit model communication, James Stichnoth for his work
and explanations of array assignments between regular block-cyclic
distributed arrays, and Ed Segall for providing the impetus to look
at run--time approaches to fast message assembly.
}



\small
\bibliography{/afs/cs/project/iwarp/member/pdinda/BIB/texbib,/afs/cs/project/iwarp/member/droh/bib/refs}

\end{document}
 