\documentstyle[times]{article}

\textheight =  9.0in
\topmargin = -.5 in 
\oddsidemargin = -.25 in
\evensidemargin = -.25 in
\textwidth=7in


\title{Address Relation Caching -- SIGMETRICS}

\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron}

\begin{document}

\begin{abstract}

blah blah

\end{abstract}


\maketitle

%\setlength{\baselinestretch}{2} 

\section{Introduction}

Motivation

Introduce address relations here.

We begin by examining address relations in the context of message
assembly for interprocessor direct deposit model communication induced
by high level parallel languages, particularily Fx, a variant of High
Performance Fortran.  Such message assembly involves address
computation as well as data copying.  By replacing the address
computation in the assembly loop with scanning the precomputed address
relation, we hope to improve the performance of message assembly.
Initially, we consider a very simple (and inefficient) representation
of address relations --- an array of tuples.  For this representation,
we develop a fine grain analytical model to predict, given machine
parameters, the regimes where address relation reuse can improve
message assembly performance.  The model also predicts how many times
a relation must be reused to break even with doing the address
computation in the assembly loop.  The model is validated on the Intel
iWarp and Intel Paragon.  Surprisingly, even with this simple
representation, we discover that if the address computation in the
assembly loop involves more than a few instructions (2 on the iWarp,
16 on the Paragon), address relation reuse can improve performance.
Further, beyond this threshold, the cost of precomputing the address
relation can be quickly amortized.

Next, we attack the storage requirements of address relations by
introducing three compressed representations that significantly reduce
storage costs.  The final form, DMRLEC, compacts relations induced by
common HPF distributed array assignments, including index
permutations, by three to four orders of magnitude.  The
representations are designed to be directly used in message assembly
and disassembly --- relations are {\em not} decompressed before use.
We examine the performance of all four address relations
representations on sixteen common HPF assignment statements on six
different machines using three metrics.  The vehicle for these tests
is ART --- the Address Relation Toolbox --- a portable address
relation framework written in C.  All measurements are of compiled
code --- most HPF compilers generate sequential source code that is
then compiled by a node compiler.  The machines we test are the DEC
3000/400 (133 MHz Alpha 21064), a 90 MHz Pentium system, a 133 Mhz
Pentium system, the Intel Paragon (50 Mhz i860XP), the IBM RS/6000
Model 250 (80 MHz PowerPC 601), and the IBM SP-2 (66 MHz POWER2.)
These machines are either parallel supercomputer nodes, or the
workstations that are likely to see use in a high speed cluster
environment.  

Our first performance metric is {\em economy} - how much message
assembly throughput each byte of storage cost buys.  Then we decompose
economy into {\em size}, an absolute storage cost metric, and {\em
speed}, percentage of stride-one copy loop throughput metric.  On all
the machines we tested, we find the same story with regard to economy
and size.  DMRLEC typically wins by orders of magnitude.  The result
of the speed measurements is harder to characterize. Except on the
Pentium systems, DMRLEC has the best throughput, typically running
just shy of the stride-one copy loop throughput when blocks are
assembled and at the strided copy loop throughput for strided access
patterns.  The other representations are about as fast, except for
pathological cases, where their throughput plummets.  For non
pathological situations, DMRLEC is the fastest, followed closely by
AABLK, then DMRLE.  AAPAIR never exceeds 50\% of the stride-one copy
loop throughput.  On the Pentium, the situation is reversed.  For non
pathological cases, AABLK is the fastest, with DMRLE a distant second
and DMRLEC even slower.  We believe this is due to register starvation
on the Pentium.

DMRLEC is highly effective on scalar as well as superscalar machines
because it requires very few extra instructions inside the innermost
loop.  On superscalar machines with load/store functional units and
other characteristics, we believe there is room for significantly more
compression than even DMRLEC without affecting message assembly and
disassembly throughput.  Such a superscalar machine is starved for
instruction level parallelism in a copy loop --- a hunger that can be
fulfilled with compression related instructions.  We identify a
``superscaler plateau'' --- the number of additional register-register
instructions that can be imbedded in a stride one copy loop without
affecting the throughput and measure it on the Alpha, (Maybe) POWER2,
and the Pentium.  The existance of the plateau on the Pentium odd given
the chip's performance with DMRLEC, but can be explained due to
register starvation. 


\section{Address Relation Reuse}

\begin{figure}
%\epsfig
\caption{Levels of an operation on distributed data structure}
\label{fig:hpfassign}
\end{figure}

\begin{figure}
%\epsfig
\caption{Decoupling address computation and message assembly}
\label{fig:decouple}
\end{figure}


Parallel programs for distributed memory multicomputers, whether
written in a high level parallel language such as HPF or Fx, or in a
sequential language with message passing, operate on data structures
distributed over many processors.  These operations can be examined at
many levels.  The language level defines what data structures are
possible, how they can be distributed, and the operations that can be
performed on them.  For a particular program, the programmer or
compiler chooses the appropriate data structures, distributions, and
operations to accomplish his goals.  These choices interact to produce
some global communication pattern between the processors.  Between
each pair of processors, communication involves mapping sender address
to receiver addresses and transfer data according to this mapping.  We
refer to the act of computing this mapping as {\em address
computation} and to the mapping itself as an {\em address relation}.


For example, consider figure \ref{fig:hpfassign}.  The innocuous
language level HPF array assignment \verb.B=A. is between two
dimensional arrays with different data distributions (block
distributed columns and rows for B and A, respectively).  This induces
an all-to-all communication operation.  Between each pair of
processors, an address relation is computed mapping blocks of $k$
addresses from a sender stride of $Pk$ to a receiver stride of $k$.
To communicate data according to the relation, the strided data is
gathered at the sender either into a buffer or directly to the wire
and scattered at the receiver.

Typically, for regular data distributions and access patterns, address
computation is done in--line with {\em message assembly} (gathering
the data) on the sending processor, and {\em message disassembly}
(scattering the data) on the receiving processor.  Compile--time
knowledge (or programmer knowledge of the strengths and weaknesses of
the compiler) is used to reduce the cost of address
computation~\cite{jim}.  

We {\em decouple} address computation and message
assembly/disassembly.  As shown in figure~\ref{fig:decouple}, address
computation generates an address relation which is stored in memory
and then reused for assembling and disassembling messages whenever the
associated communnication is done.  The address relation can be
represented in many different ways to minimize memory use or to
maximize performance.  The power of this technique is that message
assembly/disassembly code need only be optimized for each
representation and architecture, but not for different data
distributions.  







\section{Modelling decoupled address computation and reuse}

In this section, we model decoupled address computation and address
relation reuse in the context of direct deposit model communication.
After explaining this communication style, we develop expressions for
the average latency for assembly with inline address computation and
for assembly with decoupled address computation and address relation
reuse as a function of machine parameters.  The address relation
representation we use is a simple list of sender address, receiver
address tuples.  We combine our expressions to produce predictions for
the {\em threshold}: the minimum computation at which the decoupled
scheme is faster than inline address computation, the {\em breakeven}:
how many times a relation must be reused before the initial address
computation is amortized, and the {\em speedup}: the maximum
improvement in performance possible.  We test these values on the
iWarp and Paragon and find close agreement.  An important
observatation is that the threshold values are very low, on the order
of one instruction on the iWarp and 16 on the Paragon.


\subsection{Direct deposit model communication}

In a conventional communication system, such as PVM~\cite{sund90},
NX~\cite{pierce94}, or MPI~\cite{walker94}, address computation is
done both on the sender and the receiver.  The message is assembled
according to the address relation computed by the sender, transported
to the receiver, and disassembled according to the address relation
computed by the receiver.  Obviously, the same total order must be
placed on both address relations.  The message contains only the
actual data.  In direct deposit model communication~\cite{stricker95},
only the sender performs address computation.  The receiver addresses
are imbedded in the message.  In addition to reducing the end-to-end
latency, his approach also allows for more easily separating
synchronization from data transfer, allowing a compiler to amortize
synchronization over many data transfers.  The message format that
we examine is called {\em address-data-pair} since each data word
is accompanied by the address at which it will be deposited.

\subsection{Analytic models}
\label{sec:simple}

In this section, we develop expressions for the average latency
of message assembly with inline address computation and with
decoupled address computation and address relation reuse.  We
combine the expressions to predict the threshold at which the
decoupled approach becomes effective, the breakeven point at which
the address computation is fully amortized, and the final speedup
due to the decoupled approach.

\subsubsection*{Algorithm and machine properties}

Clearly, the time spent computing the address relation depends on the
algorithm and its implementation.  We model address computation as
instruction issue rate limited and as delivering the address relation
in an on-line manner.  We assume that address computation is
integer-based and done in registers, so we characterize it by $n_g$,
the number of instructions executed per element of the relation.
Additionally, we must know $r_i$, the number of integer instructions
the machine can issue per second.  Since memory accesses are at the
heart of message assembly and address relation reuse, we characterize
the memory system by four values: $r_{rc}$ and $r_{wc}$, the read and
write bandwidths for unit--stride accesses, respectively; and $r_{rr}$
and $r_{wr}$, the read and write bandwidths for random memory
accesses.

\subsubsection*{Inline address computation}

The time for message assembly using inline address computation has
two components: $t_g$, the time to compute the address relation;
and $t_{au}$, the message assembly time:
\begin{displaymath}
t_{inline} = t_g + t_{au}
\end{displaymath}
The time to compute the address relation is largely determined by the
complexity of algorithm used, its implementation, and the instruction
execution rate of the machine.  We characterize the algorithm in terms
of $n_g$, the number of integer instructions executed per tuple of the
address relation.  Thus for an $n$ word message, the time required is
\begin{displaymath}
t_g = \frac{n n_g}{r_i}
\end{displaymath}
The address computation code presents us with sender, receiver address
tuples $(s,r)$ in registers in an on-line manner.  For each of these
tuples, we read the data word at address $s$, write $r$ to the message
buffer, then write the data word to the message buffer.  Note that the
writes to the message buffer are to consecutive addresses, so their
performance can be captured by $r_{wc}$, the unit--stride write
bandwidth of the machine.  On the other hand, the sender side
addresses of the tuples may not be contiguous, so we capture their
performance with $r_{rr}$, the random read bandwidth of the machine.
Furthermore, we capture the loop overhead as a number of instructions,
$n_{au}$.  Thus the time to assemble the message is
\begin{displaymath}
t_{au} = n \left (\frac{1}{r_{rr}} + \frac{2}{r_{wc}} + \frac{n_{au}}{r_i}
          \right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes.  An implementation will probably
optimize access patterns by reading several data items at a time.  If
the compiler software-pipelines the assembly loop, we should be close
to the above expression.  Substituting into the expression for
$t_{inline}$, we get an average assembly latency of 
\begin{equation}
 t_{inline} = n \left ( \frac{n_g+n_{au}}{r_i} +
 \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right )
\label{equ:uncached}
\end{equation}


\subsubsection*{Decoupled address computation}

The average latency for message assembly with decoupled address
computation and address relation reuse has three components: $t_g$,
the time to compute the address relation; $t_o$, the time to store the
address relation in the cache, and $t_{ac}$, the message assembly time
using the stored address relation.  The average latency is 
\begin{displaymath}
 t_{decoupled} = \frac{t_g + t_o}{n_i} + t_{ac}
\end{displaymath}
Note that the first two components are amortized over the number of
times the communication is performed, $n_i$.  $t_g$ is the same as for
the inline scheme and was determined in the preceeding section.  The
address relation representation we examine is a simple array of sender
address, receiver address tuples.  As each tuple of the address
relation is generated, we write it to the array.  Thus storing the
address relation involves two contiguous writes for each in-register
tuple produced by the address relation computation.  In addition, we
execute $n_o$ loop overhead instructions, so
\begin{displaymath}
 t_o = n \left ( \frac{2}{r_{wc}} + \frac{n_o}{r_i} \right )
\end{displaymath}
When assembling the message from the cached address relation, for each
tuple in the message, we first perform two consecutive reads to read a
tuple of the cached address relation.  Then, we perform a random read
to read the data word.  Finally, we perform two consecutive writes to
write the data word and the receiver address into the message buffer.
We also execute $n_{ac}$ loop overhead instructions, thus the time to
assemble a message with $n$ data words is
\begin{displaymath}
 t_{ac} = n \left (\frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
\frac{2}{r_{wc}} +\frac{n_{ac}}{r_i}\right ) 
\end{displaymath}
Note that this expression does not take into account cache thrashing
between the reads and writes, but we continue the assumption of the
preceding section.  Substituting into the expression for $t_{cached}$,
we see that an $n$ word communication between two nodes requires time
\begin{equation}
 t_{decoupled} = n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} +
	\frac{n_{ac}}{r_i} + \frac{2}{r_{rc}} + \frac{1}{r_{rr}} +
	\frac{2}{r_{wc}} \right )
\label{equ:cached}
\end{equation}

\subsubsection{Effects of decoupling}

Equations ~\ref{equ:uncached} and ~\ref{equ:cached} provide insight
into the following questions: For a message assembly that occurs many
times at run--time, can decoupled address computation and address
relation reuse ever outperform inline address computation?  If so, how
many times must the relation be reused before the decoupled strategy
wins?  Finally, what is the ultimate speedup of the decoupled strategy
over the inline strategy?

Caching can only be effective if we can assemble the message faster
using the cache than without using the cache, i.e., 
\begin{displaymath}
t_{ac} < t_g + t_{au}
\end{displaymath}
or
\begin{equation}
% r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
n_g > \frac{2r_i}{r_{rc}} + n_{ac}-n_{au}
\label{equ:necessary}
\end{equation}
The right hand side of equation~\ref{equ:necessary} is a
machine--specific threshold below which decoupled address computation
and address relation reuse is ineffective.  Notice that it is largely
determined by the ratio of instruction issue rate and memory read
bandwidth. If~\ref{equ:necessary} is true, then we can also predict
how many times the address relation must be used in order to break
even with the inline scheme.  This break even point is found by
setting $t_{decoupled} = t_{inline}$ and solving for $n_i$.  In doing
this we find that the break even point is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right ) 
\right \rceil 
\label{equ:crossover}
\end{equation}
It is important to point out that as $n_g$ moves away from the
threshold, the break even point decreases sharply because the $n_g$
terms in the fraction begin to dominate.

Finally, the ultimate speedup of the decoupled strategy over the
inline strategy is 
\begin{equation}
\label{equ:speedup}
\frac{t_g+t_{au}}{t_{ac}} = \frac{r_{rc}r_{wc}r_{rr}(n_g+n_{au}) 
              + r_ir_{rc}(r_{wc}+2r_{rr})}
             {r_{rc}r_{wc}r_{rr}n_{ac} + r_ir_{rc}(r_{wc}+2r_{rr})}
\end{equation}
The speedup is linear in $n_g$, so is bounded only by the complexity of
the address computation.


\subsection{Model validation}

In this section we validate the models from Section~\ref{sec:simple}
for message assembly on an iWarp system and on a Paragon.  To test the
predictive power of the models, we use (\ref{equ:crossover}) to
predict the number of iterations necessary for decoupled address
computation and address relation reuse to break even with inline
address computation as a function of $n_g$, and we use
(\ref{equ:speedup}) to predict the ultimate speedup of the decoupled
strategy over the inline stratege as a function of $n_g$.  In each
case, the address relation that is generated is the identity relation,
and $n_g$, the number of instructions to compute the relation, is
varied by inserting nops into the code stream. The predictions are
compared to actual measurements on the hardware.

\subsubsection{iWarp}

On the iWarp~\cite{iwarp} system, $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$
words/second, and a node can execute $r_i=2 \times 10^7$
instructions/second.  The memory system parameters were derived
directly from the iWarp specification.  Specifically, the iWarp clock
speed is 20 MHz and all memory operations complete in two cycles.
There is no cache and we assume no dual dispatch of loads and stores.

The implementation parameters ($n_{au}$, $n_o$, and $n_{ac}$) can be
estimated {\em statically} by simply counting the number of
instructions in each loop, or {\em dynamically} by accounting for
bypass stalls that arise in the loops. The static estimates for iWarp
are $n_{au}=5$, $n_o=6$, and $n_{ac}=2$. The dynamic estimates are
$n_{au}=7$, $n_o=8$, and $n_{ac}=3$.

Figure~\ref{fig:iw}(a) shows the measured and predicted number of
iterations needed on iWarp for the message assembly latency of the
decoupled scheme to break even with that of the inline scheme for
different $n_g$.  The predictions made using the dynamic measurements
are exact.  The predictions made using the static measurements are
pessimistic for small $n_g$, but quickly converge to the actual
numbers as $n_g$ increases.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1        &         13     &        INF        &       13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &          3        &        3\\
\hline
\end{tabular} 
&
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat & Pred-Dyn \\
\hline
   1       &         1.05    &         1.00     &         1.07 \\
   2       &         1.12    &         1.09     &         1.15 \\
   3       &         1.22    &         1.16     &         1.23 \\
   4       &         1.28    &         1.25     &         1.32 \\
   5       &         1.35    &         1.33     &         1.39 \\
   6       &         1.43    &         1.41     &         1.47 \\
   7       &         1.49    &         1.49     &         1.54 \\
   8       &         1.59    &         1.59     &         1.61 \\
\hline
\end{tabular}\\
(a) & (b) \\
\end{tabular}
\caption{Model validation on iWarp:  (a) Iterations required to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:iw}
\end{figure}

Figure~\ref{fig:iw}(b) compares the actual and predicted speedup of
the decoupled scheme over the inline scheme on iWarp. As in
Figure~\ref{fig:iw}(a), the dynamic predictions are quite close for
all $n_g$ and the static predictions are pessimistic for small $n_g$.
However, it is important to note that even with static measurements,
the predictions only become inaccurate when $n_g$ approaches the
threshold.  Another important point is that the purpose of
Figure~\ref{fig:iw}(b) is to help validate the models when $n_g$ is
small, and not to highlight impressive speedups. In fact, the speedups
grow arbitrarily large as $n_g$ increases (see Equation~\ref{equ:speedup}.)



\subsubsection{Paragon}

A Paragon node contains a 50 MHz pipelined, multiple issue RISC
processor~\cite{PARAGON-XPS,I860-PR} .  We estimated the instruction
issue rate by noting that we do not use the floating point pipeline.
Therefore we can issue at most $r_i=5 \times 10^7$
instructions/second.  We directly measured the memory parameters and
found $r_{rc}=9 \times 10^6$ words/second, $r_{wc}=8.2 \times 10^6$
words/second, and $r_{rr}=r_{wr}=0.9 \times 10^6$ words/second.  Only
a static measurement (instruction counting) of the implementation
properties was performed, which found $n_{au}=6$, $n_o=7$, and
$n_{ac}=11$.

Figure~\ref{fig:86}(a) shows, for different $n_g$, the measured and
predicted number of iterations needed for message assembly using the
decoupled scheme to break even with assembly using the inline scheme.
The predictions are pessimistic, but quickly converge to the actual
values with increasing $n_g$.  Notice that $n_g$ must be much larger
to permit break even on the Paragon than on the iWarp.  This is not
surprising given that the Paragon's instruction issue rate is much
higher than its memory bandwidth, while the iWarp's issue rate is well
matched to its memory bandwidth.

\begin{figure}[t]
\centering
\small
\begin{tabular}{cc}
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline

    14     &        NONE    &        NONE\\
    15     &        NONE    &        NONE\\
    16     &        19      &        NONE\\
    17     &        11      &        41    \\     
    18     &         9      &        20\\
    19     &         7      &        14\\
    20     &         6      &        11\\
    21     &         5      &         9\\
    22     &         5      &         7\\
    23     &         4      &         7\\
    24     &         4      &         6\\
    25     &         4      &         5\\
\hline
\end{tabular}
&
\begin{tabular}{|c|c|c|} 
\hline
  $n_g$ & Act & Pred-Stat \\
\hline
    14     &        NONE   &         NONE\\
    15     &        NONE   &         NONE\\
    16     &       1.03     &         NONE\\
    17     &       1.06     &         1.01\\
    18     &       1.09     &         1.02\\
    19     &       1.12     &         1.03\\
    20     &       1.15     &         1.04\\
    21     &       1.17     &         1.05\\
    22     &       1.20     &         1.06\\
    23     &       1.22     &         1.07\\
    24     &       1.24     &         1.08\\
    25     &       1.26     &         1.09\\
\hline
\end{tabular}\\
(a) & (b)
\end{tabular}
\caption{Model validation on Paragon: (a) Iterations to break even (Eqn. \protect{\ref{equ:crossover}}), (b) Speedup in message assembly (Eqn. \protect{\ref{equ:speedup}})}
\label{fig:86}
\end{figure}

Figure~\ref{fig:86}(b) compares the actual and predicted speedup of
the decoupled scheme over the inline scheme on the Paragon.  Notice
that speedup does not increase as quickly with $n_g$ as on the iWarp
(Figure~\ref{fig:iw}(b).)  This is the result of the lower memory
bandwidth relative to instruction issue rate on the Paragon.  The
predictions are not as accurate as for the iWarp, reflecting the fact
the Paragon's memory system is more complex.  However, the predictions
are pessimistic and quickly converge to the correct values.


\section{Address relation compression}

\subsection{Hierarchy}


\section{Performance of address relation compression}

\subsection{Tradeoff}

\subsection{Metrics}

\subsection{Measurements}

\subsection{Discussion}

\section{Superscalar plateau}

\section{Discussion}

In this section, we comment on other issues in address relation
caching.  We begin by noting the benefits of decoupling address
computation and message assembly in the communication system.  Next,
we discuss where to make the decision to cache an address
computation, pointing out that compilers play a key role.
Continuing, we comment on cache data structures and replacement
policy.  Finally, we point out that the cache provides an opportunity
for optimizations.


\subsection{Decoupling address computation and message assembly}

By taking address computation out of the critical path of data
transfer, address relation caching decouples address computation and
message assembly.  One advantage of this is that it makes it easier to
achieve reasonable communication performance without a large number of
special--case optimizations in the compiler and run--time library.
Instead of optimizing each address computation algorithm for each
architecture, only message assembly from cache needs to be optimized
for each architecture in order for all data transfers to benefit.

Address relations are an ideal interface to the communication system
because they are implementation neutral.  This permits communication
systems to be highly specialized for the available hardware while
still presenting the same interface to the programmer or compiler.
However, to best utilize the hardware, it may be necessary to change
the representation of the address relation before using it.  For
example, the relation may be transformed into a DMA gather map on the
sender to make best use of a network adapter such as the one described
in~\cite{GNECTAR-IFIP}.  Because transforming the relation may be
expensive, it is desirable for the communication system to cache the
(transformed) relation instead of the application.  


\subsection{Deciding to cache}

Our analytical model and HPF results show that caching can sometimes
be ineffective and that the number of times a cached address relation
must be reused to break even depends on the complexity of address
generation.  Given this, where should the decision to cache or not be
made?  One option is to make this decision in the compiler.  This is
almost certainly the right place if the distributions and machine
characteristics and the number of times the communication is performed
are known at compile--time.  If this is not the case, the decision can
be made at run--time, where at least the distributions and machine
characteristics are known.  However, the number of repetitions may be
data dependent and unknown until after the communication has been
performed.  To deal with this case, the library could keep track of
the number of times a particular communication is performed uncached,
and switch to a cached approach after some number of repetitions.

\subsection{Cache data structures}

There is a continuum of cache data structures that ranges from
individually storing each tuple of the address relation in memory to
simply performing the original address computation.  The cache data
structure should be chosen to minimize the time spent accessing the
cache given memory constraints.  The structure should be implementation
dependent in order to make best use of available hardware.


\subsection{Replacement policy}

The cache replacement policy should make use of compile--time
knowledge.  The HPF library described above uses a strictly
direct--mapped policy at run--time, but the cache entry tags are
selected by the compiler, which knows the run--time policy.  This
means the compiler can assure that useful address relations remain in
the cache.  What to do when there is insufficient memory to hold even
one address relation in cache is unclear.

\subsection{Cache optimizations}

The cache is an intermediate layer between address computation and
message assembly.  As such, it presents opportunities for
optimization.  For example, since the entire address relation is known
to the cache, it could optimize it for memory locality on the sender,
receiver, or both.  As shown in ~\cite{memoryperform}, the choice is
architecture--dependent. If the distributions, machine
characteristics, and repetition count are known at compile--time, it
is reasonable to perform address generation at this point,
``pre-cache'' the results, and always assemble from cache when running
the program.

\section{Related work}

 A similar technique is used to
support irregular data distributions and access patterns in
CHAOS/PARTI~\cite{saltz91} --- in that system, an ``inspector'' is
used to compute all communication for a parallel loop and an
``executor'' performs the loop, including the communication.  Our
main contribution is to demonstrate 

\section{Conclusion}

\end{document}
