%\documentstyle[times,acmproc]{article}
%\documentstyle[epsf,11pt]{article}
%\documentstyle[epsfig,11pt]{article}
\documentstyle[editedbk,epsf,psfonts]{kluwerbk}
%\setlength{\topmargin}{0in}
%\setlength{\headheight}{0in}
%\setlength{\headsep}{0in}
%\setlength{\textheight}{9in}
%
%\setlength{\oddsidemargin}{0in}
%\setlength{\evensidemargin}{0in}
%\setlength{\textwidth}{6.5in}
%\setlength{\marginparsep}{0in}
%\setlength{\marginparwidth}{0in}


\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}}}

\begin{document}

\TOCauthor{Peter A. Dinda, David R. O'Hallaron}

\articleauthor{Peter A. Dinda 
David R. O'Hallaron}

\articleaffil{School of Computer Science,
              Carnegie Mellon University,
              Pittsburgh, PA 15213}

\articletitle{The Performance Impact of Address Relation Caching}


\shortenedtitle{Impact of Address Relation Caching}

\bibliographystyle{plain}

\begin{abstract}



An important portion of end--to--end latency in data transfer is spent
in address computation, determining a relation between sender and
receiver addresses.  In deposit model communication, this computation
happens only on the sender and some of its results are embedded in the
message.  Conventionally, address computation takes place on--line, as
the message is assembled.  If the amount of address computation is
significant, and the communication is repeated, it may make sense to
remove address computation from the critical path by caching its
results.  However, assembling a message using the cache uses
additional memory bandwidth.

We present a fine grain analytic model for simple address relation
caching in deposit model communication.  The model predicts how many
times a communication must be repeated in order for the average
end--to--end latency of an implementation which caches to break even
with that of an implementation which doesn't cache.  The model also
predicts speedup and those regimes where a caching implementation
never breaks even.  The model shows that the effectiveness of caching
depends on CPU speed, memory bandwidth and the complexity of the
address computation.  We verify the model on the iWarp and the Paragon
and find that, for both machines, caching can improve performance even
when address computation is quite simple (one instruction per data
word on the iWarp and 16 instructions per data word on the Paragon.)

To show the practical benefit of address relation caching, we examine
the performance of an HPF distributed array communication library that
can be configured to use caching.  In some cases, caching can double
the performance of the library.  Finally, we discuss other benefits to
caching and several open issues.

\begin{center}
\begin{minipage}[t]{4.5in}
\footnotesize
This research was sponsored in part by the Advanced Research Projects
Agency/CSTO monitored by SPAWAR under contract N00039-93-C-0152, in
part by the National Science Foundation under Grant ASC-9318163, and
in part by a grant from the Intel Corporation.

The authors' email addresses are: pdinda@cs.cmu.edu and
droh@cs.cmu.edu.  
\end{minipage}
\end{center}

\end{abstract}


\section{Introduction}
\label{sec:intro}

A basic communication operation for applications running on
distributed memory machines is a {\em point-to-point communication},
where the {\em sender} transfers data, in the form of a {\em message},
from various locations in its memory to various locations in the
memory of the {\em receiver}.  Because each communication represents
overhead to the application, it is important to perform this operation
as quickly as possible. This task is complicated by the fact that
applications do not always keep the data they need to communicate in
convenient contiguous memory locations.

In general, point-to-point communication with conventional message
passing systems like PVM~\cite{sund90}, NX~\cite{pierce94}, and
MPI~\cite{walker94} involves three basic steps on the sender: (1) {\em
address computation}, which determines the local addresses of the data
items to be transferred, (2) {\em message assembly}, which copies the
data items into a contiguous buffer, and (3) {\em message transfer},
which copies the message from the buffer to the wire.  The receiver
performs a symmetric set of steps. The performance measure of interest
is {\em end--to--end latency}, which is the time to perform all three
steps on both the sender and receiver.

A different and typically more efficient approach, known as {\em
deposit model message passing}~\cite{stichnoth94,stricker95}, combines
the address computation steps on the sender and the receiver into a
single step on the sender. The result of the address computation step
on the sender is an {\em address relation} that associates addresses
on the sender with addresses on the receiver.  The sender performs the
same steps as above, but passes the receiver addresses along with the
data.  The receiver uses these addresses to scatter the data to the
correct memory locations.  Other properties of the deposit model make
it possible for sophisticated implementations to assemble and
disassemble to and from the wire, possibly via specialized hardware.
In this paper, we concentrate on a simple implementation that stages
data in memory at both the sender and receiver.

In any implementation of deposit model message passing, and in
conventional message passing systems, address computation is in the
critical path of data transfer.  Address relation caching decouples
address computation and data transfer by doing address computation
off-line, storing the address relation in memory, and doing message
assembly using the stored address relation.  A primary motivation is
to amortize the address computation cost of communications that are
repeated (for example, an HPF array assignment statement in a loop.)
A similar approach is used by the CHAOS system~\cite{saltz91} to
improve the performance of irregular collective communications.  

To determine the effectiveness of address relation caching, we develop
a simple analytic model for predicting the performance tradeoffs
between cached and uncached point--to--point communications for a
simple representation of the address relation.  The model shows that
there are regimes where caching is always ineffective, as well as
regimes where caching becomes effective after the communication has
been repeated a sufficient number of times.  For these latter regimes,
the model predicts the number of times the communication must be
repeated in order for caching to break even, as well as speedup.
Speedup is bounded only by the complexity of the address relation
computation.  

We validate the simple model on both the Intel Paragon and Intel iWarp
systems.  The model's predictions closely match our measurements.
Because of iWarp's higher ratio of memory bandwidth to instruction
issue rate, caching becomes effective with far simpler address
computations than on the Paragon.  However, in both cases, even if
only a small number of instructions are executed per tuple of the
address relation, caching is effective.  

To demonstrate the practical benefit of caching, we compare the cached
and uncached performance of an HPF~\cite{HPF} communication library on
the Paragon.  We find that caching becomes more effective as the
distribution of the data becomes more irregular, and thus the address
relations become harder to compute.  

We conclude with discussion about other issues and benefits of caching.




\section{Deposit model communication}

In a conventional communication system, such as PVM~\cite{sund90},
NX~\cite{pierce94}, or MPI~\cite{walker94}, the sender computes a
binary relation $S$ between local addresses and addresses offset into
the {\em message} that is transported between it and the receiver.  If
$(s,m) \in S$ then the data item at local address $s$ is copied into
the message at offset $m$.  The receiver computes a binary relation
$R$ between addresses offset into the message and local addresses.  If
$(m,r) \in R$, then the data item at offset $m$ in the message is
copied to local address $r$.  No actual address information is
contained in the message.

In deposit model communication, the sender computes a binary relation
$SR$ between local addresses at the sender and local addresses at the
receiver -- if $(s,r) \in SR$ then the data item at address $s$ will
be transported to address $r$ at the receiver.  We refer to $SR$ as
the {\em address relation}.  The sender uses the address relation to
{\em assemble} a message that conceptually consists of 2-tuples
$(r,d)$ where data item $d$ (read from address $s$) is to be written
to, or deposited at address $r$ on the receiver.  The receiver
runs a simple {\em deposit engine} which receives the message and {\em
disassembles} the message, writing each $d$ to its corresponding local
address $r$.  If the message actually contains 2-tuples, it is in the
{\em address-data pair} format.  Often, the format of the message can be
optimized so that the overhead of including addresses in the message
is very low.  One such format is referred to as {\em address-data
block.} In this format, the message consists of 3-tuples $(r,l,D)$,
where $D$ consists of $l$ data items that are deposited at consecutive
addresses starting at $r$.  Obviously, other optimizations are also
possible, depending on the properties of the address relation.  

Strictly speaking, including addresses in the message is a variant of
deposit called {\em direct deposit.}  Other properties of the deposit
model make it possible for sophisticated implementations to eliminate
in memory staging of the message, possibly by using specialized
hardware.  We concentrate on a simple implementation that stages the
message in memory and uses a simple address-data pair message format.
Further details of deposit model communication can be found in
\cite{stricker95}.


\section{Analytic models}
\label{sec:simple}

This section presents models for the average end--to--end latency of a
point-to-point communication using uncached and cached implementations
of deposit model message passing.  We consider the simplest (and least
efficient) message format and cache structure and derive expressions
for the average latency for both uncached and cached communication.
We compare these expressions to provide some insight into the effect
of caching.  

%As an aid to reader, we have summarized the meaning of
%the symbols that follow in several figures.  Figure \ref{elts}
%contains the symbols used for the coarse grain elements of the latency
%expressions.
%\begin{figure}
%\centering
%\begin{tabular}{ll}
%	$t_{uncached}$ & Point-to-point communication time w/o caching  \\
%	$t_{cached}$ & Point-to-point communication time with caching  \\
%	$t_g$ & Address relation computation time  \\
%	$t_{au}$ & Message assembly time w/o caching   \\
%	$t_{ac}$ & Message assembly time with caching  \\
%	$t_c$ & Message communication time  \\
%	$t_d$ & Message disassembly time  \\
%	$t_o$ & Cache update time  \\
%\end{tabular}
%\caption{Elements of point-to-point communication times}
%\label{fig:elts}
%\end{figure}
%Figure \ref{algimp}
%contains the implementation parameters of the cache and address
%computation algorithm that we will use.  
%Finally, figure \ref{machprop}
%contains the machines parameters that we will use.

\subsection{Algorithm properties}

%\begin{figure}
%\centering
%\begin{tabular}{ll}
%	$n$   & Number of data words in message \\
%	$n_i$ & Number of times communication is done  \\
%	$n_g$ & Instructions/tuple to compute address relation \\
%        $n_o$ & Instructions/tuple to store tuple in cache \\
%        $n_{au}$ & Instructions/tuple to do uncached assembly \\
%	$n_{ac}$ & Instructions/tuple to do cached assembly \\
%        $n_d$ & Instructions/tuple to do disassembly
%\end{tabular}
%\caption{Algorithm and cache implementation parameters}
%\label{fig:algimp}
%\end{figure}

Clearly, the time spent computing the address relation depends on the
algorithm and its implementation.  We model computing the address
relation as instruction execution limited and as delivering the
address relation in an on-line manner.  We assume that all computation
is done in registers, so we concentrate on the parameter $n_g$, the
number of instructions executed per element of the relation.


\subsection{Machine properties}

%\begin{figure}
%\centering
%\begin{tabular}{ll}
%	$r_i$ & Instructions/second of machine  \\
%	$r_{rc}$ & Words/second of unit memory reads  \\
%	$r_{rr}$ & Words/second of random memory reads  \\
%	$r_{wc}$ & Words/second of unit memory writes  \\
%	$r_{wr}$ & Words/second of random memory writes  \\
%	$r_c$ & Words/second of node-node communication \\
%\end{tabular}
%\caption{Machine parameters}
%\label{fig:machprop}
%\end{figure}

Since computing the address relation is instruction execution limited,
we must know $r_i$, the number of instructions the machine can execute
per second.  We also need to know $r_c$, the communication bandwidth
between the two nodes.  The memory system is parameterized by four
values.  The first two values, $r_{rc}$ and $r_{wc}$, are the read and
write bandwidths for unit--stride accesses, respectively.  The other
two values, $r_{rr}$ and $r_{wr}$, are the read and write bandwidths
for random memory accesses.  
%These machine properties are summarized
%in figure \ref{machprop}.

\subsection{Uncached communication}

The time for an uncached point-to-point communication has four
components: $t_g$, the time to compute the address relation; $t_{au}$,
the message assembly time; $t_c$, the message communication time; and
$t_d$, the message disassembly time at the receiver.  The total time
for the communication is
\begin{displaymath}
t_{uncached} = t_g + t_{au} + t_c + t_d
\end{displaymath}
In this simple model, we assume that the message format is
address-data pair.

The time to compute the address relation is largely determined by the 
complexity of algorithm used, its implementation, and the instruction 
execution rate of the machine.  Thus for an $n$ word message, the time 
required is
\begin{displaymath}
t_g = \frac{n n_g}{r_i}
\end{displaymath}

The code that computes the address relation presents tuples $(s,r)$ in
registers in an on-line manner.  For each of these tuples, we read the
data word at address $s$, write $r$ to the message buffer, then write
the data word to the message buffer.  Note that the writes to the
message buffer are to consecutive addresses, so their performance can
be captured by $r_{wc}$, the contiguous write bandwidth of the
machine.  On the other hand, the sender side addresses of the tuples
may not be contiguous, so we capture their performance with $r_{rr}$,
the random read bandwidth of the machine.  Furthermore, we capture the
loop overhead as a number of instructions, $n_{au}$.  Thus the time to
assemble the message is
\begin{displaymath}
t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{n_{au}}{r_i}
          \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.
This will bring it close to the above expression.

After the message has been assembled, we use the communication
hardware to transport it to the receiver.  The message communication
time is
\begin{displaymath}
 t_c = \frac{2 n}{r_c} 
\end{displaymath}
Note that $r_c$ is really a function of $2n$, but we assume that $n$
is sufficiently large that we can use a fixed value for $r_c$.
Further, this value is the same without and with caching and becomes
irrelevant in comparing them.

On the receiving node, each word of the message is read consecutively
and each address-data pair results in one random write to memory.  
Additionally, $n_d$ loop overhead instructions are executed for a total
disassembly time of 
\begin{displaymath}
 t_d = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{wr}} +\frac{n_d}{r_i}
       \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes. Again, an implementation will probably
improve performance by writing several data items at a time, bringing it
close to the expression.

Substituting into the expression for $t_{uncached}$, we see that an
$n$ word communication between two nodes requires time
\begin{equation}
 t_{uncached} = n \left ( \frac{n_g+n_{au}+n_d}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{2}{r_c} +
 \frac{2}{r_{rc}} + \frac{1}{r_{wr}} \right )
\label{equ:uncached}
\end{equation}


\subsection{Cached communication}

The average time for an cached point-to-point communication has five
components: $t_g$, the time to compute the address relation; $t_o$,
the time to store the address relation in the cache, $t_{ac}$, the
message assembly time from the cached address relation; $t_c$, the
message communication time; and $t_d$, the message disassembly time at
the receiver.  The average time for the communication is
\begin{displaymath}
 t_{cached} = \frac{t_g + t_o}{n_i} + t_{ac} + t_c + t_d 
\end{displaymath}
Note that the first two components are amortized over the number of times 
the communication is performed, $n_i$.  $t_g$, $t_c$, and $t_d$ are the 
same between the uncached and cached schemes and were determined in the
preceding section.  We continue the assumption that the message format 
is address-data pair.  Further, we choose the simplest possible cache 
data structure: an array of 2-tuples that represent the relation.

For the simple model, as each tuple of the address relation is
generated, we write it to an array.  Thus storing the address relation
in the cache involves two contiguous writes for each in-register tuple
produced by the address relation computation.  In addition, we execute
$n_o$ loop overhead instructions, so 
\begin{displaymath}
 t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
\end{displaymath}

When assembling the message from the cached address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the cached address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the receiver address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\begin{displaymath}
 t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes, but, again, a good implementation will
perform several data word reads at a time and thus approach this
performance.

Substituting into the expression for $t_{cached}$, we see that an 
$n$ word communication between two nodes requires time
\begin{eqnarray}
 t_{cached} & = & n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} + 
	\frac{n_{ac}+n_d}{r_i} + \frac{2}{r_{rc}} \right . \nonumber \\
            &   & \left . + \frac{1}{r_{rr}} +  \frac{2}{r_{wc}} + 
             \frac{2}{r_c} + \frac{2}{r_{rc}} + \frac{1}{r_{wr}} \right )
\label{equ:cached}
\end{eqnarray}

\subsection{Effects of caching}

Equations ~\ref{equ:uncached} and ~\ref{equ:cached} provide insight
into the following questions: For a point--to--point communication
that occurs multiple times at run--time, can a cached implementation
ever outperform an uncached implementation?  If so, how many times
must the communication be repeated before the cached implementation
wins?  For a large number of iterations, what is the speedup in
message assembly time of the cached implementation over the uncached
implementation?

Caching can only be effective if we can assemble the message faster
using the cache than without using the cache, i.e., 
\begin{displaymath}
t_{ac} < t_g + t_{au}
\end{displaymath}
or
\begin{equation}
% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
\label{equ:necessary}
\end{equation}
The right hand side of equation~\ref{equ:necessary} is a
machine--specific threshold below which caching is ineffective.
Notice that it is largely determined by the ratio of instruction issue
rate and memory read bandwidth.  

If~\ref{equ:necessary} is true, then we can also predict how many
times cached address relation must be used in order to fully amortize
the cost of caching the relation.  This break even point is found by
setting $t_{cached} = t_{uncached}$ and solving for $n_i$.  In doing
this we find that the break even point for caching is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
\right \rceil 
\label{equ:crossover}
\end{equation}
It is important to point out that as $n_g$ moves away from the
threshold, the break even point decreases sharply because the $n_g$
terms in the fraction begin to dominate.

Finally, the speedup in message assembly time for the cached
implementation over the the uncached implementation is
\begin{equation}
\label{equ:speedup}
\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
              + r_ir_{rc}(r_{wc}+2r_{rr})}
             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
\end{equation}
The speedup is linear in $n_g$, so is bounded only by the complexity of
the address computation.


\section{Model validation}


In this section we validate the models from Section~\ref{sec:simple}
for point--to--point communications on an iWarp system and on a
Paragon.  To test the predictive power of the models, we use
(\ref{equ:crossover}) to predict the number of iterations necessary
for the average cached message assembly time to break even with the
uncached message assembly time as a function of $n_g$, and we use
(\ref{equ:speedup}) to predict the speedup of cached over uncached
message assembly, also as a function of $n_g$.  In each case, the
address relation that is generated is the identity relation, and
$n_g$, the number of instructions to compute the relation, is varied
by inserting nops into the code stream. The predictions are compared
to actual measured running times on the hardware.

\subsection{iWarp}

On the iWarp~\cite{iwarp} system, $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$
words/second, and a node can execute $r_i=2 \times 10^7$
instructions/second.  The memory system parameters were 
derived directly from the iWarp specification.  Specifically, the iWarp clock
speed is 20 MHz and all memory operations complete in two cycles.  There
is no cache and we assume no dual dispatch of loads and stores.

The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
estimated {\em statically} by simply counting the number of
instructions in each loop, or {\em dynamically} by accounting for
bypass stalls that arise in the loops. The static estimates for iWarp
are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

Figure~\ref{fig:iw}(a) shows the measured and predicted number of
iterations needed on iWarp for cached assembly to break even with uncached
assembly for different $n_g$.  The predictions made using the dynamic
measurements are exact.  The predictions made using the static
measurements are pessimistic for small $n_g$, but quickly converge to
the actual numbers as $n_g$ increases.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1        &         13     &        INF        &       13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &          3        &        3\\
\hline
\end{tabular} 
&
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1       &         1.05    &         1.00     &         1.07 \\
   2       &         1.12    &         1.09     &         1.15 \\
   3       &         1.22    &         1.16     &         1.23 \\
   4       &         1.28    &         1.25     &         1.32 \\
   5       &         1.35    &         1.33     &         1.39 \\
   6       &         1.43    &         1.41     &         1.47 \\
   7       &         1.49    &         1.49     &         1.54 \\
   8       &         1.59    &         1.59     &         1.61 \\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{Model validation on iWarp:  (a) Iterations required to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:iw}
\end{figure}

Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
cached over uncached message assembly on iWarp. As in
Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for all
$n_g$ and the static predictions are pessimistic for small $n_g$.
However, it is important to note that even with static measurements,
the predictions only become inaccurate when $n_g$ approaches the
minimum number of instructions where the cached implementation can
outperform the uncached implementation. Another important point is
that the purpose of Figure~\ref{fig:iw}(b) is to help validate the
models when $n_g$ is small, and not to highlight impressive
speedups. In fact, the speedups grow arbitrarily large as $n_g$
increases (see Equation~\ref{equ:speedup}.)

%\begin{figure}[t]
%\centering
%\small
%\begin{tabular}{|c|c|c|c|} 
%\hline
%  $n_g$ & Act & Pred-Stat) & Pred-Dyn \\
%\hline
%   1       &         1.05    &         1.00     &         1.07 \\
%   2       &         1.12    &         1.09     &         1.15 \\
%   3       &         1.22    &         1.16     &         1.23 \\
%   4       &         1.28    &         1.25     &         1.32 \\
%   5       &         1.35    &         1.33     &         1.39 \\
%   6       &         1.43    &         1.41     &         1.47 \\
%   7       &         1.49    &         1.49     &         1.54 \\
%   8       &         1.59    &         1.59     &         1.61 \\
%\hline
%\end{tabular}
%\caption{Speedup of cached message assembly over uncached message assembly on iWarp (Eqn. \protect{\ref{equ:speedup}})}
%\label{fig:iwsu}
%\end{figure}



\subsection{Paragon}

A Paragon node contains a 50 MHz pipelined, multiple issue RISC
processor~\cite{PARAGON-XPS,I860-PR} .  We estimated the instruction
issue rate by noting that we do not use the floating point pipeline.
Therefore we can issue at most $r_i=5 \times 10^7$
instructions/second.  We directly measured the memory parameters and
found $r_{rc}=9 \times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$
words/second, and $r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only
a static measurement (instruction counting) of the implementation
properties was performed, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.

Figure~\ref{fig:86}(a) shows the measured and predicted number of
iterations needed for cached assembly to break even with uncached
assembly for different $n_g$.  The predictions are pessimistic, but
quickly converge to the actual values with increasing $n_g$.  Notice
that $n_g$ must be much larger to permit break even on the Paragon than
on the iWarp.  This is not surprising given that the Paragon's
instruction issue rate is much higher than its memory bandwidth, while the
iWarp's issue rate is well matched to its memory bandwidth.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline

    14     &        NONE    &        NONE\\
    15     &        NONE    &        NONE\\
    16     &        19      &        NONE\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
&
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline
    14     &        NONE   &         NONE\\
    15     &        NONE   &         NONE\\
    16     &       1.03     &         NONE\\
    17     &       1.06     &         1.01\\
    18     &       1.09     &         1.02\\
    19     &       1.12     &         1.03\\
    20     &       1.15     &         1.04\\
    21     &       1.17     &         1.05\\
    22     &       1.20     &         1.06\\
    23     &       1.22     &         1.07\\
    24     &       1.24     &         1.08\\
    25     &       1.26     &         1.09\\
\hline
\end{tabular}\\
(a) & (b)
\end{tabular}
\caption{Model validation on Paragon: (a) Iterations to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:86}
\end{figure}

Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
cached message assembly over uncached message assembly on the Paragon.
Notice that speedup does not increase as quickly with $n_g$ as on the
iWarp (Figure~\ref{fig:iw}(b).)  This is the result of the lower
memory bandwidth relative to instruction issue rate on the Paragon.
The predictions are not as accurate as for the iWarp, reflecting the
fact the Paragon's memory system is more complex.  However, the
predictions are pessimistic and converge to the correct values.  

%\begin{figure}
%\centering
%\begin{tabular}{|c|c|c|} 
%\hline
%  $n_g$ & Act & Pred-Stat \\
%\hline
%    14     &        NONE   &         NONE\\
%    15     &        NONE   &         NONE\\
%    16     &       1.03     &         NONE\\
%    17     &       1.06     &         1.01\\
%    18     &       1.09     &         1.02\\
%    19     &       1.12     &         1.03\\
%    20     &       1.15     &         1.04\\
%    21     &       1.17     &         1.05\\
%    22     &       1.20     &         1.06\\
%    23     &       1.22     &         1.07\\
%    24     &       1.24     &         1.08\\
%    25     &       1.26     &         1.09\\
%\hline
%\end{tabular}
%\caption{Speedup in cached over uncached message assembly on Paragon (Eqn. \protect{\ref{equ:speedup}})}
%\label{fig:86su}
%\end{figure}


\section{Caching in HPF}
\label{sec:hpf}

This section investigates the impact of caching on the performance of
array assignments in task parallel HPF programs.  In a task parallel
array assignment statement, the source and destination arrays are
distributed over disjoint sets of processors. The test cases presented
are written in Fx~\cite{gross94a} (a dialect of HPF that integrates
task and data parallelism) and compiled and run on an Intel Paragon.
Communication is performed with an enhanced version of the Fx
run--time library that supports cached and uncached message assembly.
The appropriate library function is invoked on each source node, with
the distributions of the source and destination arrays passed in as
arguments. Address computation is performed using the algorithm in
~\cite{stichnoth93a}.  The library can either cache the results of
this computation, or use them to directly assemble a message.

We measured the performance of the library, with and without caching,
for a variety of test cases.  As shown in Figure~\ref{fig:tasks}, each
\begin{figure}[t]
\centerline{\epsfysize .75in \epsfbox{tasks.eps}}
\caption{Experimental setup}
\label{fig:tasks}
\end{figure}
test case consists of two data parallel tasks that repeatedly assign a
$512 \times 512$ {\em source array} $S$, distributed over 16 nodes in
the {\em source task}, to a {\em destination array} $D$, distributed
over a disjoint set of 16 nodes in the {\em destination task}.  Each
test case is characterized by the distributions of the source and
target arrays.

The results indicate that the effectiveness of caching depends on the
source and destination distributions, and thus on the complexity of
address computation. In some cases caching is worse, in other cases it
is marginally better, and in other cases it is significantly better.
Three of the Paragon test cases (Figure~\ref{fig:hpfpgon}) illustrate
this point quite clearly:

\noindent
{\bf Case 1:} $\mbox{S(:,BLOCK)} \rightarrow \mbox{D(:,BLOCK)}$.  In
this case, the distributions of the source and destination arrays are
identical. The address computation for a block--to--block assignment
of this sort is a common case that is highly optimized in the library.
Because address computation is very inexpensive, caching is not
helpful in this case. In fact, although it is hard to see in the
figure, even after 1000 iterations the average end--to--end latency
for the cached version is marginally higher than that of the uncached
version.  For this case, cached message assembly is slightly slower
than uncached message assembly.

\begin{figure}[t]
\centerline{
\begin{tabular}{cc}
\epsfxsize=2.25in \epsfbox{bbpgon.epsf} &
\epsfxsize=2.25in \epsfbox{rcpgon.epsf} \\
$\mbox{S(:,BLOCK)} \rightarrow \mbox{D(:,BLOCK)}$ &
$\mbox{S(BLOCK,:)} \rightarrow \mbox{D(:,BLOCK)}$ \\
{\bf Case 1} & {\bf Case 2} \\
\end{tabular}
}
\centerline {
\begin{tabular}{c}
\epsfxsize=2.25in \epsfbox{bcs2lpgon.epsf} \\
$\mbox{S(:,CYCLIC(5))} \rightarrow \mbox{D(:,CYCLIC(20))}$ \\
{\bf Case 3} \\
\end{tabular}
}
\caption{Effectiveness of caching for HPF on Paragon.  Each graph
plots the average end--to--end latency versus the number of iterations
for cached and uncached message assembly.}
\label{fig:hpfpgon}
\end{figure}

%\begin{figure}[htbp]
%\centerline{\epsfxsize=2.5in \epsfbox{bbpgon.epsf}}
%\caption{{\bf Case 1:} $\mbox{S(:,BLOCK)} \rightarrow \mbox{D(:,BLOCK)}$.  
%Caching is ineffective on Paragon}
%\label{fig:bbpgon}
%\end{figure}

\noindent
{\bf Case 2:} $\mbox{S(BLOCK,:)} \rightarrow \mbox{D(:,BLOCK)}$.  This
example assigns a source array that is distributed by rows, to a
destination array that is distributed by columns.  Because of the
additional computation involved (compared to Case 1), as well as the
small number of address-address blocks necessary to represent the
address relation, caching proves to be a marginal win.  Ultimately, it
improves average latency by about 20\%.
%\begin{figure}[htbp]
%\centerline{\epsfxsize=2.5in \epsfbox{rcpgon.epsf}}
%\caption{{\bf Case 2:} $\mbox{S(BLOCK,:)} \rightarrow \mbox{D(:,BLOCK)}$.  
%Caching is marginally effective on Paragon.}
%\label{fig:rcpgon}
%\end{figure}   

{\bf Case 3:} $\mbox{S(:,CYCLIC(5))} \rightarrow
\mbox{D(:,CYCLIC(20))}$.  This example assigns a source array with a
small block size of 5 to a destination array with a block size of 20.
The library does not have an optimization for this case, and, in fact,
the inner loop of the address computation only generates one tuple
each time it is entered.  Despite this, there is a fair amount of
contiguity in the address relation, so there are few address--address
blocks in the cache.  As a result, the communication is ultimately a
factor of two faster if message assembly is performed from the cache.
%\begin{figure}[htbp]
%\centerline{\epsfxsize=2.5in \epsfbox{bcs2lpgon.epsf}}
%\caption{{\bf Case 3:} $\mbox{S(:,CYCLIC(5))} \rightarrow 
%\mbox{D(:,CYCLIC(20)}$. Caching improves latency by a factor of 2 on
%Paragon.}
%\label{fig:bcpgon}
%\end{figure}

In general, as distributions become more irregular, address
computation becomes more time expensive because it becomes
increasingly difficult for the the communication library to provide
special--case optimizations.  In these cases, such as case 3 of
Figure~\ref{fig:hpfpgon}, caching can have a significant impact.


\section{Discussion}

In this section, we comment on other issues in address relation
caching.  We begin by noting the benefits of decoupling address
computation and message assembly in the communication system.  Next,
we discuss where to make the decision to cache an address
computation, pointing out that compilers play a key role.
Continuing, we comment on cache data structures and replacement
policy.  Finally, we point out that the cache provides an opportunity
for optimizations.


\subsection{Decoupling address computation and message assembly}

By taking address computation out of the critical path of data
transfer, address relation caching decouples address computation and
message assembly.  One advantage of this is that it makes it easier to
achieve reasonable communication performance without a large number of
special--case optimizations in the compiler and run--time library.
Instead of optimizing each address computation algorithm for each
architecture, only message assembly from cache needs to be optimized
for each architecture in order for all data transfers to benefit.

Address relations are an ideal interface to the communication system
because they are implementation neutral.  This permits communication
systems to be highly specialized for the available hardware while
still presenting the same interface to the programmer or compiler.
However, to best utilize the hardware, it may be necessary to change
the representation of the address relation before using it.  For
example, the relation may be transformed into a DMA gather map on the
sender to make best use of a network adapter such as the one described
in~\cite{GNECTAR-IFIP}.  Because transforming the relation may be
expensive, it is desirable for the communication system to cache the
(transformed) relation instead of the application.  


\subsection{Deciding to cache}

Our analytical model and HPF results show that caching can sometimes
be ineffective and that the number of times a cached address relation
must be reused to break even depends on the complexity of address
generation.  Given this, where should the decision to cache or not be
made?  One option is to make this decision in the compiler.  This is
almost certainly the right place if the distributions and machine
characteristics and the number of times the communication is performed
are known at compile--time.  If this is not the case, the decision can
be made at run--time, where at least the distributions and machine
characteristics are known.  However, the number of repetitions may be
data dependent and unknown until after the communication has been
performed.  To deal with this case, the library could keep track of
the number of times a particular communication is performed uncached,
and switch to a cached approach after some number of repetitions.

\subsection{Cache data structures}

There is a continuum of cache data structures that ranges from
individually storing each tuple of the address relation in memory to
simply performing the original address computation.  The cache data
structure should be chosen to minimize the time spent accessing the
cache given memory constraints.  The structure should be implementation
dependent in order to make best use of available hardware.


\subsection{Replacement policy}

The cache replacement policy should make use of compile--time
knowledge.  The HPF library described above uses a strictly
direct--mapped policy at run--time, but the cache entry tags are
selected by the compiler, which knows the run--time policy.  This
means the compiler can assure that useful address relations remain in
the cache.  What to do when there is insufficient memory to hold even
one address relation in cache is unclear.

\subsection{Cache optimizations}

The cache is an intermediate layer between address computation and
message assembly.  As such, it presents opportunities for
optimization.  For example, since the entire address relation is known
to the cache, it could optimize it for memory locality on the sender,
receiver, or both.  As shown in ~\cite{memoryperform}, the choice is
architecture--dependent. If the distributions, machine
characteristics, and repetition count are known at compile--time, it
is reasonable to perform address generation at this point,
``pre-cache'' the results, and always assemble from cache when running
the program.



\section{Conclusions}

Address relation caching at the communication system level is an
effective strategy for facilitating high bandwidth, low latency data
transfer.  Because address computation is removed from the critical
path, its cost can not determine communication performance.  Our
results show that address relation caching is effective even when the
cost of address computation is very low, even on the order of a few
instructions.  Further, the overhead of caching can be generally be
quickly amortized if the communication is repeated. Although our
analysis was in the context of deposit model communication, address
relation caching is not limited this model.

\section*{ACKNOWLEDGEMENTS}

We would like to thank Thomas Stricker for illuminating the finer
points of deposit model communication, James Stichnoth for endless
patience in explaining the details of performing HPF array assignment
statements, and Ed Segall for helping to debug the HPF library.

  

%\section{Other Issues in Caching}

%\subsection{Irregular Distributions}

%The previous section examined the effect of caching in the
%communication of HPF-distributed arrays.  The address computation
%associated with such regular distributions can often be drastically
%optimized.  For example, the library described handles block-block,
%block-cyclic, cyclic-block, and cyclic-cyclic distributions as special
%cases, maximizing the amount of time spent in a tight inner loop.
%However, we noted that even in the optimized cases, caching can often
%improve performance.  Caching should be much more effective in
%irregular distributions, where the array is partitioned among the
%processors such that an index array (which may itself be distributed)
%is needed to find where a particular element resides.  Indeed, the
%reusable ``schedules'' generated by CHAOS/PARTI \cite{saltz91}
%perform just this function.  In this case, because the index array may
%be distributed, performing a communication without caching could
%generate additional communication to determine where the needed
%element of the index array is.

%\subsection{Optimizations}

%Because all the tuples are known before message assembly begins,
%assembling the message from the cached address relation presents
%opportunities for optimizing memory reference patterns by changing the
%order of the computed tuples.  For example, in the HPF library
%described above, coallescing is performed so that the relation is
%represented as runs of addresses that are contiguous on the sender and
%the receiver.  The runs are ordered by their starting addresses on the
%sender.  This scheme balances memory locality on the sender and the
%receiver.  For general block-cyclic distributions, this is important
%because using the tuples in the order presented by the address
%computation would result in strided memory access both on the sender
%and the receiver. The cache could also optimize the memory reference
%pattern solely on the sender, if strided reads were more expensive
%than strided writes, or on the receiver, if the opposite was true.
%Whether it is better to optimize for access stride on the sender or
%the receiver is machine and network interface dependent.  Another
%optimization possible is for the cache to generate prefetch hints
%during message assembly, which should be easy given it knows all the
%memory accesses from the start.

%While it is certainly possible to add the optimizations discussed to the 
%implementations of address computation algorithms, instead of in the 
%address relation cache, this would require considerable duplication of 
%efforts.  If we have $m$ algorithms and $p$ machines, we would have to 
%create $m p$ implementations, whereas if the optimizations were implemented 
%in the address relation cache, we could implement each algorithm once and 
%combine it with machine-specific cache implementations, requiring
%only $m+p$ implementations.


%\subsection{Data Structures}

%The cache data structure should minimize the amount of memory the
%address relation requires, and the time spent accessing the cache
%during message assembly.  It is important to note that the cache
%access time also includes computation to reconstruct the relation from
%the data actually stored in the cache.  There is a continuum of
%combinations of storage and computation that ranges from individually
%storing each tuple of the address relation in memory to simply
%performing the original address computation.  The choice of cache
%structure should find the point on that continuum which minimizes the
%time spent accessing the cache given memory constraints.

%\subsection{Replacement Policy}

%One question that is unresolved here is how to manage the cache given
%a finite amount of memory - ie, what is the cache replacement policy?
%The HPF library described above has a finite sized cache which is
%managed by having the compiler or user tag each communication.  Each
%(tagged) element of the cache is the set of address relations (one per
%destination processor) necessary to perform the communication.  If a
%new communication has the same tag as an element already in the cache,
%it replaces that element.  Of course, it could happen that the
%allotted memory in the cache is insufficient to store even one of
%these elements, or even a single address relation.  What to do in
%those circumstances is unclear.

%\subsection{Message Passing Interfaces and Hardware}

%The HPF library described above implements caching (and message assembly) 
%in a layer above the message passing interface of the Paragon (NX).  If, 
%instead of only communicating a contiguous buffer, the message passing 
%interface provided primitives for ``composing'' a communication from 
%elements of an address relation and a primitive to ``perform'' a 
%composition, then caching could be implemented in the message passing 
%system itself.  In deposit model, the composition and performance would be 
%performed only on the sender.  For models with matching sends and receives, 
%both the sender and receiver would compose their ``half'' of the address 
%relation.  This would be followed by a final negotiation stage to determine 
%the appropriate optimizations and to agree on a special tag that would be 
%used to match incoming messages with the receive side half of the address 
%relation.  

%Such an interface would permit secure access to the scatter/gather DMA 
%features of sophisticated network interfaces such as the one described in 
%\cite{GNECTAR-IFIP}.  By exposing these features to the user (or compiler) the 
%memory copies inherent in using a conventional ``contiguous buffer
%only'' message passing interface could be avoided in favor of
%``assembling the message on the wire.''  We note that a cached address
%relation could be easily converted to a scatter/gather map for such an
%interface.  In fact, if the network interface is especially powerful,
%it may be reasonable to implement much of cache management on the
%interface.




\small
\bibliography{/afs/cs/project/iwarp/member/droh/bib/refs,/afs/cs/project/iwarp/member/fx/bib/texbib,local}

\end{document}
