%\documentstyle[twocolumn]{article}
\documentstyle[epsf,11pt]{article}
\setlength{\topmargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textheight}{9in}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\marginparsep}{0in}
\setlength{\marginparwidth}{0in}

\title{Address Relation Caching in Deposit Model Communication}
\author{Peter A. Dinda\footnote{pdinda@cs.cmu.edu} \\
        School of Computer Science \\ Carnegie-Mellon University \\
	5000 Forbes Ave\\ Pittsburgh, PA 15217}


\begin{document}


\maketitle

\input{psfig}

\abstract{
When two processes of a parallel program communicate, an important
element of the communication time is the time spent assembling and
disassembling the message.  Message assembly/disassembly involves
address computation in addition to memory copies.  If this computation
is significant, and the communication is repeated, it may be desirable
to amortize the computation time by caching the computation's results.
However, assembling a message from the cached results requires using
some memory bandwidth that could otherwise be used for the copy.

We present a fine grain, explanatory model for simple caching in
deposit model communication (where the relation between sender and
receiver addresses is computed only on the sender.) The model predicts
how many times a communication must be repeated in order for the
average communication time to decline to that of an uncached
implementation as well as the actual improvement in message assembly
over an uncached implementation.  The model shows that the
effectiveness of caching depends on CPU speed, memory bandwidth and
the complexity of the address computation.  We verify the model on the
iWarp and the Paragon.  We find that for both machines, caching can
improve the performance of message assembly even when only a small
number of (computational) instructions would otherwise be executed per
data item.

To show the practical benefit of address relation caching, we examine
the performance of an HPF distributed array communication library that
can be configured to use caching.  In some cases, caching can double
the performance of the library.  Finally, we discuss other benefits 
to caching and several open issues.

}


\section{Introduction}

Communication systems typically support simple send and receive
primitives to transport the contents of a contiguous chunk of memory
(a message buffer) between two processors - ie, ``point-to-point
communication.''  The performance of such a system is usually given as
latency and bandwidth between a pair of nodes as functions of the
message size.  Unfortunately, many applications do not keep the data
they wish to communicate in convenient contiguous chunks, so a message
must be assembled on the sender and disassembled on the receiver.
This requires copying the data, and computing which data items are to
be sent as well as their ultimate locations on the receiver.  In this
paper, we examine deposit model communication \cite{depmodel}, where
only the sender performs this computation, and the message contains
addressing information.

Although data copying can ruin performance \cite{stricker}, the
computation overhead of communication may also greatly affect
performance.  The natural question is whether caching can be used to
amortize this overhead if the communication is repeated.  Often, the
data at the same addresses are communicated several times.  For
example, an HPF\cite{HPF} array assignment on statically distributed
arrays may be repeated several times in a loop.  The computation
involved in executing the assignment statement could be amortized by
moving it outside of the loop.  However, it may be the case that the
amount of computation involved or the number of times the
communication is repeated is so low as to make caching unhelpful.
Clearly we understand what characterizes the performance of both
uncached and cached communication to determine when caching is
effective.


We find that while caching does eliminate computation, it does so by
using some of the memory bandwidth that would otherwise be available
purely for the copying of data items.  A result of this is that
caching may sometimes never outperform uncached communication,
regardless of the number of times the communication is performed.  If
caching can outperform uncached communication, the number of
repetitions required depends on the cache structure, the amount of
computation done, and machine characteristics such as memory
bandwidths and instruction rates.  Caching also represents an
opportunity to optimize memory access patterns and other features of
the communication which may not be inherent in the algorithm used to
compute which data items are to be communicated.

In the following, we begin by describing deposit model communication,
where all computation is performed on the sender and the message
contains addressing information.  Next, we derive and compare models
for point-to-point communication for the case where the message format
and cache structure are as simple as possible.  These models predict
the number of times a communication must be repeated for a cached
implementation to break even with an uncached implementation and the
improvement in message assembly time due to caching.  We verify these
predictions on the iWarp and Paragon.  Next, we present measurements
that show the practical benefit of caching in an HPF communication
library.  Finally, we comment on optimizations that can be performed
with caching, and on open issues in the implementation of the cache.



\section{Deposit Model Communication}

In conventional, buffered communication, such as that provided by PVM
or NX, the sender computes a binary relation $S$ between local
addresses and addresses offset into the {\em message} that is
transported between it and the receiver.  If $(s,m) \in S$ then the
data item at local address $s$ is copied into the message at offset
$m$.  The receiver computes a binary relation $R$ between addresses
offset into the message and local addresses.  If $(m,r) \in R$, then
the data item at offset $m$ in the message is copied to local address
$r$.  No actual address information is contained in the message.

In deposit model style communication, the sender computes a binary
relation $SR$ between local addresses at the sender and local
addresses at the receiver -- if $(s,r) \in SR$ then the data item at
address $s$ will be transported to address $r$ at the receiver.  We
refer to $SR$ as the {\em address relation}.  The sender uses the
address relation to {\em assemble} a message that conceptually
consists of 2-tuples $(r,d)$ where data item $d$ (read from address
$s$) is to be written to, or ``deposited'' at address $r$ on the
receiver.  The receiver runs a simple ``deposit handler'' which
receives the message and {\em disassembles} the message, writing each
$d$ to its corresponding local address $r$.  If the message actually
contains 2-tuples, it is in the ``address-data pair'' format.  Often,
the format of the message can be optimized so that the overhead of
including addresses in the message is very low.  One such format is
referred to as ``address-data block.'' In this format, the message
consists of 3-tuples $(r,l,D)$, where $D$ consists of $l$ data items
that are deposited at consecutive addresses starting at $r$.
Obviously, other optimizations are also possible, depending on the
properties of the address relation.  There are several other
attributes of pure deposit model communication that are ignored here.

\section{Modeling Point-to-Point Communication}
\label{simple}

\begin{table}
\centering
\begin{tabular}{ll}
	$t_{uncached}$ & Point-to-point communication time w/o caching  \\
	$t_{cached}$ & Point-to-point communication time with caching  \\
	$t_g$ & Address relation computation time  \\
	$t_{au}$ & Message assembly time w/o caching   \\
	$t_{ac}$ & Message assembly time with caching  \\
	$t_c$ & Message communication time  \\
	$t_d$ & Message disassembly time  \\
	$t_o$ & Cache update time  \\
\end{tabular}
\caption{Elements of point-to-point communication times}
\end{table}

In a point-to-point communication, a single sender sends a single
message to a single receiver.  After explaining which parameters of
the address computation algorithm and the machine we find important,
we begin by considering the simplest message format and cache
structure and build expressions for the average latency for both
uncached and cached communication.  We compare these expressions to
derive some insight as to the effect of caching.  

\subsection{Algorithm Properties}

\begin{table}
\centering
\begin{tabular}{ll}
	$n$   & Number of data words in message \\
	$n_i$ & Number of times communication is done  \\
	$n_g$ & Instructions/tuple to compute address relation \\
        $n_o$ & Instructions/tuple to store tuple in cache \\
        $n_{au}$ & Instructions/tuple to do uncached assembly \\
	$n_{ac}$ & Instructions/tuple to do cached assembly \\
        $n_d$ & Instructions/tuple to do disassembly
\end{tabular}
\caption{Algorithm and implementation parameters}
\label{algimp}
\end{table}

Clearly, the time spent computing the address relation depends on the
algorithm and its implementation.  We model computing the address
relation as instruction execution limited and as delivering the
address relation in an on-line manner.  The single property of the
implementation we are interested in is $n_g$, the number of
instruction executed per element of the relation.


\subsection{Machine Properties}

\begin{table}
\centering
\begin{tabular}{ll}
	$r_i$ & Instructions/second of machine  \\
	$r_{rc}$ & Words/Second of unit memory reads  \\
	$r_{rr}$ & Words/Second of random memory reads  \\
	$r_{wc}$ & Words/second of unit memory writes  \\
	$r_{wr}$ & Words/second of random memory writes  \\
	$r_c$ & Words/second of node-node communication \\
\end{tabular}
\caption{Machine parameters}
\label{machprop}
\end{table}

Since computing the address relation is instruction execution limited,
we must know $r_i$, the number of instructions the machine can execute
per second.  We also need to know $r_c$, the number of words per second
that can be communicated between two nodes.  Additionally, we need to
know something about the memory system because of the copies and cache
access that will be done.  We break down the performance of memory
accesses into unit stride accesses and random accesses.  We expect to
read and write memory with a unit stride at rates $r_{rc}$ and $r_{wc}$,
respectively.  Similarly, we expect to read and write memory at random
addresses at rates $r_{rr}$ and $r_{wr}$, respectively.

\subsection{A Simple Model For Uncached Communication}

The time for an uncached point-to-point communication has four
components: $t_g$, the time to compute the address relation; $t_{au}$,
the message assembly time; $t_c$, the message communication time; and
$t_d$, the message disassembly time at the receiver.  The total time
for the communication is
\\
$$ t_{uncached} = t_g + t_{au} + t_c + t_d $$
\\
In this simple model, we assume that the message format is
address-data pair.

The time to compute the address relation is largely determined by the 
complexity of algorithm used, its implementation, and the instruction 
execution rate of the machine.  Thus for an $n$ word message, the time 
required is
\\
$$t_g = \frac{n n_g}{r_i}$$

The code that computes the address relation presents tuples $(s,r)$ in
registers in an on-line manner.  For each of these tuples, we read the
data word at address $s$, write $r$ to the message buffer, then write
the data word to the message buffer.  Note that the writes to the
message buffer are to consecutive addresses, so their performance can
be captured by $r_{wc}$, the contiguous write performance of the
machine.  On the other hand, the sender side addresses of the tuples
may not be contiguous, so we capture their performance with $r_{rr}$,
the random read bandwidth of the machine.  Furthermore, we capture the
loop overhead as a number of instructions, $n_{au}$.  Thus the time to
assemble the message is
\\
$$t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{n_{au}}{r_i}
          \right ) $$
\\
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.
This will bring it close to the above expression.

After the message has been assembled, we use the communication
hardware to transport it to the receiver.  The message communication
time is
\\
$$ t_c = \frac{2 n}{r_c} $$
\\
Note that $r_c$ is really a function of $2n$, but we assume that $n$
is sufficiently large that we can use a fixed value for $r_c$.
Further, this value is the same without and with caching and becomes
irrelevant in comparing them.

On the receiving node, each word of the message is read consecutively
and each address-data pair results in one random write to memory.  
Additionally, $n_d$ loop overhead instructions are executed for a total
disassembly time of 
\\
$$ t_d = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{wr}} +\frac{n_d}{r_i}
       \right ) $$  
\\
Note that this expression does not take into account cache thrashing
between the reads and writes. Again, an implementation will probably
improve performance by writing several data items at a time, bringing it
close to the expression.

Substituting into the expression for $t_{uncached}$, we see that an
$n$ word communication between two nodes requires time
\\
$$ t_{uncached} = n \left ( \frac{n_g+n_{au}+n_d}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{2}{r_c} +
 \frac{2}{r_{rc}} + \frac{1}{r_{wr}} \right )$$



\subsection{A Simple Model For Cached Communication}

The time for an cached point-to-point communication has five components: 
$t_g$, the time to compute the address relation; $t_o$, the time to store 
the address relation in the cache, $t_{ac}$, the message assembly time from 
the cached address relation; $t_c$, the message communication time; and 
$t_d$, the message disassembly time at the receiver.  The average time for 
the communication is
\\
$$ t_{cached} = \frac{t_g + t_o}{n_i} + t_{ac} + t_c + t_d $$
\\
Note that the first two components are amortized over the number of times 
the communication is performed, $n_i$.  $t_g$, $t_c$, and $t_d$ are the 
same between the uncached and cached schemes and were determined in the
preceding sections.  We continue the assumption that the message format 
is address-data pair.  Further, we choose the simplest possible cache 
data structure - an array of 2-tuples that represent the relation.

For the simple model, as each tuple of the address relation is
generated, we write it to an array.  Thus storing the address relation
in the cache involves two contiguous writes for each in-register tuple
produced by the address relation computation.  In addition, we execute
$n_o$ loop overhead instructions, so 
\\
$$ t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )$$

When assembling the message from the cached address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the cached address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the receiver address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\\
$$ t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) $$
\\
Note that this expression does not take into account cache thrashing
between the reads and writes, but, again, a good implementation will
perform several data word reads at a time and thus approach this
performance.

Substituting into the expression for $t_{cached}$, we see that an 
$n$ word communication between two nodes requires time
\\
$$ t_{cached} = n \left (\frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} + 
	\frac{n_{ac}+n_d}{r_i} +
             \frac{2}{r_{rc}} + \frac{1}{r_{rr}} + \frac{2}{r_{wc}} + 
             \frac{2}{r_c} + \frac{2}{r_{rc}} + \frac{1}{r_{wr}} \right )$$


\subsection{Evaluating Caching Using the Simple Model}

In the cached implementation we are essentially trading memory bandwidth 
for computation.  Caching can only be effective if we can assembly the
message using the cache faster than without the cache:
\\
$$ t_{ac} < t_g + t_{au}$$
\\
or
\\
$$ r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}$$
\\
If this holds true, then the number of times the cached information
must be used in order to outperform the uncached implementation can be
determined by setting $t_{cached} = t_{uncached}$ and solving for
$n_i$.  In doing this we find that the break even point for caching is
\\
$$ n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right )
\right \rceil $$
\\
Finally, we find that the speedup in message assembly time is
\\
$$ \frac{r_{rr}r_{wc}}{r_{rc}} \left ( \frac{(n_g + n_{au}
-n_{ac})r_{rc} -2r_i}{(n_g+n_{au})r_{rr}r_{wc}+(r_{wc}+2r_{rr})r_i} \right ) $$


\section{Model Validation}

\begin{table}
\centering
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Actual & Predicted (Static) & Predicted (Dynamic) \\
\hline
   1        &         13     &        INF        &       13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &         3         &       3\\
\hline
\end{tabular}
\caption{Iterations required to break even with 
uncached message assembly on iWarp }
\label{iwbe}
\end{table}

\begin{table}
\centering
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Actual & Predicted (Static) & Predicted (Dynamic) \\
\hline
   1       &          5\%    &          0\%     &          7\% \\
   2       &         11\%    &          8\%     &         13\% \\
   3       &         18\%    &         14\%     &         19\% \\
   4       &         22\%    &         20\%     &         24\% \\
   5       &         26\%    &         25\%     &         28\% \\
   6       &         30\%    &         29\%     &         32\% \\
   7       &         33\%    &         33\%     &         35\% \\
   8       &         36\%    &         37\%     &         38\% \\
\hline
\end{tabular}
\caption{Speedup over uncached message assembly on iWarp}
\label{iwsu}
\end{table}

\begin{table}
\centering
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Actual & Predicted (Static) \\
\hline

    14     &        NONE    &        NONE\\
    15     &        NONE    &        NONE\\
    16     &        19      &        NONE\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
\caption{Iterations required to break even 
with uncached message assembly on Paragon}
\label{86be}
\end{table}

\begin{table}
\centering
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Actual & Predicted (Static) \\
\hline
    14     &        NONE   &         NONE\\
    15     &        NONE   &         NONE\\
    16     &        3\%     &         NONE\\
    17     &        6\%     &         1\%\\
    18     &        9\%     &         2\%\\
    19     &       12\%     &         3\%\\
    20     &       15\%     &         4\%\\
    21     &       17\%     &         5\%\\
    22     &       20\%     &         6\%\\
    23     &       22\%     &         7\%\\
    24     &       24\%     &         8\%\\
    25     &       26\%     &         9\%\\
\hline
\end{tabular}
\caption{Speedup in over uncached message assembly on Paragon}
\label{86su}
\end{table}

The simple model was validated on the iWarp and the Paragon.  The model
was used to predict the number of iterations necessary for the average
cached message assembly time to break even with the uncached message
assembly as a function of $n_g$.  The speedup in cached message
assembly time as a function of $n_g$ was also predicted.  These
numbers were then measured and compared to the predicted values.  The
address relation computed is the identity relation.  The number of
instructions to compute this relation was varied by inserting nops
into the code stream.  See tables \ref{iwbe}, \ref{iwsu}, \ref{86be},
\ref{86su}.

An iWarp node \cite{iwarp} contains a 20 MHz non-pipelined LIW
processor.  All memory operations complete in two cycles as there is
no cache and the main memory is SRAM.  We assume no dual dispatch of
loads and stores, so we have $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$
words/second.  iWarp can execute $r_i=2 \times 10^7$
instructions/second.  The algorithm and implementation properties (see
table \ref{algimp}) were measured in two ways.  First, a static
measurement of the number of non-memory instructions in the uncached
assembly, cache insert, and cached assembly loops was done.  These
static measurements found $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. These
numbers underestimate the amount of time spent in these instructions
because, although iWarp is not pipelined, RAW hazards do exist that
are not handled by data forwarding.  The instructions were also
measured accounting for stalls (as nops) produced by such hazards.
These dynamic measurements found $n_{au}=7$, $n_o=8$, and $n_{ac}=3$.
Table \ref{iwbe} shows the number of iterations needed to break even
with uncached assembly for different $n_g$.  Notice that the
predictions made using the dynamic measurements are almost exactly on
target.  The predictions made using the static measurements are
pessimistic.  The same is true for the predicted speedup over uncached
assembly, table \ref{iwsu}.  However, it is important to note that
even using static measurements, the predictions only become inaccurate
when $n_g$ approaches the minimum number of instructions where the
cached implementation can outperform the uncached implementation.

A Paragon \cite{paragon} contains a 50 Mhz pipelined, multiple issue
i860 \cite{i860} RISC processor.  We have $r_i=5 \times 10^7$
instructions/second because we do not use the floating point pipeline.
Data loads and stores complete through a 4-way set-associative cache.
We measured the memory performance of a Paragon node and found
$r_{rc}=9
\times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$ words/second, and
$r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only a static
measurement of the algorithm and implementation properties (see table
\ref{algimp}) was done, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.  Tables \ref{86be} and \ref{86su} show that this produces
conservative predictions.  However, it is important to again note that
the predictions only become inaccurate when $n_g$ approaches the the
minimum cache-friendly number of instructions.


\section{Address Relation Caching for HPF Communications}



We implemented a library for communicating HPF-distributed arrays
between data parallel tasks.  The processors over which the source and
destination arrays are distributed are disjoint and the distribution
of the arrays may be different.  For example, on the source, the
array's rows may be block-distributed over 4 processors, while on the
destination, its columns may be cyclic-distributed over 8 processors.
The library is called on every source processor with the source and
target distributions (general block-cyclic) along each dimension of
the array.  On each source processor, the address relation between the
local chunk of the array and the local chunks on each of the
destination processors is computed using an algorithm due to
\cite{stichnoth}.  

As each tuple of the address relation is computed, the library can
either insert that tuple into a cache, or use it directly in the
assembly of the message.  In other words, the inner loop of address
computation either adds tuples to the cache, or uses them to
assemble a message.  The cache data structure used is a simple linked
list of address-address blocks (runs of addresses that are contiguous on
the source and destination processor) for each destination processor.  
Coallescing is done to minimize the number of such blocks used.  Message
assembly from the cache is done in the obvious way.

We measured the performance of this library, with and without caching,
for a variety of array distributions on the Intel Paragon.  Because
the library is optimized for certain distributions (block and cyclic),
we find, as expected, that some communications that involve these
distributions are not improved by caching.  However, for other
communications involving these distributions, we find that using
caching can incrementally improve performance by 20\% or so. For
distributions (general block-cyclic) that are not optimized by the
library, we find that caching can improve performance by as much as a
factor of two.  Finally, we note that even when caching is ineffective,
it worsens performance only by 20\% or so.

\begin{figure}
\centerline{\epsfxsize=3in \epsfbox{bbpgon.epsf}}
\caption{Average communication time for block to block on Paragon}
\label{bbpgon}
\end{figure}

\begin{figure}
\centerline{\epsfxsize=3in \epsfbox{rcpgon.epsf}}
\caption{Average communication time for row-block to column-block on Paragon}
\label{rcpgon}
\end{figure}


Figure \ref{bbpgon} shows a communication where caching is
ineffective.  In this case, we are communicating from a block
distribution to a block distribution.  We see that the average
communication time for the cached communication after a large number
of iterations is about 20\% higher than that if the uncached
communication, which is the unamortized overhead of using the cached
approach.  Figure \ref{rcpgon} shows a communication where caching is
marginally effective.  Here, we send a two dimensional array
whose rows are distributed by blocks to a destination where its
columns are distributed by rows -- a distribution transpose.  Because
of the additional computation involved, as well as the small number of
address-address blocks necessary to represent the address relation,
caching proves to be marginal win.  Ultimimately, it improves the
performance of the communication by about 20\%.  In Figure
\ref{bcpgon} we see a communication between a general block-cyclic
distribution with a small block size to a general block-cyclic
distribution with a large block size.  The library does not have an
optimization for this case, and, in fact, the inner loop of the
computation only generates one tuple each time it is entered.  Despite
this, there is a fair amount of contiguity in the address relation,
so there are few address-address blocks in the cache.  As a result, the
communication is ultimately a factor of two faster by doing message
assembly using the cache.  

\begin{figure}
\centerline{\epsfxsize=3in \epsfbox{bcs2lpgon.epsf}}
\caption{BC(small) to BC(large) on Paragon}
\label{bcpgon}
\end{figure}

\section{Other Issues in Caching}

\subsection{Irregular Distributions}

The previous section examined the effect of caching in the
communication of HPF-distributed arrays.  The address computation
associated with such regular distributions can often be drastically
optimized.  For example, the library described handles block-block,
block-cyclic, cyclic-block, and cyclic-cyclic distributions as special
cases, maximizing the amount of time spent in a tight inner loop.
However, we noted that even in the optimized cases, caching can often
improve performance.  Caching should be much more effective in
irregular distributions, where the array is partitioned among the
processors such that an index array (which may itself be distributed)
is needed to find where a particular element resides.  Indeed, the
reusable ``schedules'' generated by CHAOS/PARTI \cite{commoptparti}
perform just this function.  In this case, because the index array may
be distributed, performing a communication without caching could
generate additional communication to determine where the needed
element of the index array is.

\subsection{Optimization Opportunity}

Because all the tuples are known before message assembly begins,
assembling the message from the cached address relation  presents
opportunities for optimizing memory reference patterns by changing the
order of the computed tuples.  For example, in the HPF library
described above, coallescing is performed so that the relation is
represented as runs of addresses that are contiguous on the sender and
the receiver.  The runs are ordered by their starting addresses on the
sender.  This scheme balances memory locality on the sender and the
receiver.  For general block-cyclic distributions, this is important
because using the tuples in the order presented by the address
computation would result in strided memory access both on the sender
and the receiver. The cache could also optimize the memory reference
pattern solely on the sender, if strided reads were more expensive
than strided writes, or on the receiver, if the opposite was true.
Further, the cache could generate prefetch hints to improve message
assembly performance.

\subsection{Cache Structures}

The cache data structure should minimize the amount of memory the
address relation requires, and the time spent accessing the cache
during message assembly.  It is important to note that the cache
access time also includes computation to reconstruct the relation from
the data actually stored in the cache.  There is a continuum of
combinations of storage and computation that ranges from individually
storing each tuple of the address relation in memory to simply
performing the original address computation.  The choice of cache
structure should find the point on that continuum which minimizes the
time spent accessing the cache given memory constraints.

\subsection{Cache Management}

One question that is unresolved here is how to manage the cache given
a finite amount of memory - ie, what is the cache replacement policy?
The HPF library described above has a finite sized cache which is
managed by having the compiler or user tag each communication.  Each
(tagged) element of the cache is the set of address relations (one per
destination processor) necessary to perform the communication.  If a
new communication has the same tag as an element already in the cache,
it replaces that element.  Of course, it could happen that the
allotted memory in the cache is insufficient to store even one of
these elements, or even a single address relation.  What to do in
those circumstances is unclear.

\subsection{Cache Location}

The simple model that was developed in section \ref{simple} assumed
that the processor performed message assembly/disassembly.  The HPF
library discussed did just that.  Assembly and disassembly could be
avoided if the network interface supported scatter/gather DMA.
However, note that the overhead of computing the address relation
remains.  If caching is supported, we could naturally convert cached
address relations into scatter/gather maps for the DMA controller
after optimizing their memory access patterns.  In fact, if the
controller is especially powerful, it may be reasonable to implement
cache management on the controller.


\section{Conclusion}

We derived a simple model for address relation caching in deposit
model communication and verified it on the iWarp and the Paragon.  We
find that cache overhead is quite low even for an inefficient caching
scheme - if address computation involves more than a few instructions
per data item, caching can improve performance if the cached data is
reused.  We showed that this holds true in practice with measurements
of an HPF library that uses caching.  Finally, we noted that caching
becomes more effective with irregular distributions, and that caching
provides an opportunity to optimize for the available hardware.


\begin{thebibliography}{foo}

\bibitem[1]{depmodel}Deposit Model Reference

\bibitem[2]{stricker}Reference to Copy Performance

\bibitem[3]{HPF}High Performance Fortran Forum, {\em High Performance
Fortran Lanuage Specification}, Technical Report CRPC-TR92225,
Center for Research in Parallel Computation, Rice University, May 1993.

\bibitem[4]{iwarp}S. Borkar, et. al., {\em IWARP: An Integrated
Solution to High-speed Parallel Computing}, Proceedings of
Supercomputing '88.

\bibitem[5]{paragon}Intel Corporation, {\em Paragon User's Guide}

\bibitem[6]{i860}Intel Corporation, {\em i860 Microprocessor Family
Programmer's Reference Manual}

\bibitem[7]{stichnoth}J. Stichnoth, D. O'Hallaron, and T. Gross,
{\em Generating Communication for Array Statements: Design,
Implementation, and evaluation}, Journal of Parallel and Distributed
Computing, vol 21, no. 1, April 1994, pp.150-159.

\bibitem[8]{commoptparti}R. Das, M. Uysal, J. Saltz, and Y. Hwang,
{\em Communication Optimizations for Irregular Scientific Computations on
Distributed Memory Architectures}, Journal of Parallel and Distributed
Computing, vol 22, no. 3, September 1994, pp.462-478.


\end{thebibliography}

\end{document}
