%\documentstyle[twocolumn]{article}
%\documentstyle[epsf,11pt]{article}
\documentstyle[epsf,twocolumn]{article}
\setlength{\topmargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textheight}{9in}

\setlength{\oddsidemargin}{-.25in}
\setlength{\evensidemargin}{-.25in}
\setlength{\textwidth}{7in}
\setlength{\marginparsep}{0in}
\setlength{\marginparwidth}{0in}

\title{Address Relation Caching in Deposit Model Communication \\
        Extended Abstract}
\author{Peter A. Dinda  David. R. O'Hallaron \\
        School of Computer Science \\ Carnegie-Mellon University \\
	5000 Forbes Ave\\ Pittsburgh, PA 15217}


\begin{document}

\input{psfig}

\maketitle

\section{Introduction}

Typical communication systems for parallel computers transport the
contents of a contiguous chunk of memory (a message buffer) between
two processors.  Unfortunately, many applications do not keep the data
they wish to communicate in convenient contiguous chunks, so a message
must be assembled on the sender and disassembled on the receiver.
This requires copying the data, and computing which data items are to
be sent as well as their ultimate locations on the receiver.  In this
paper, we examine deposit model communication \cite{depmodel}, where
this {\em address computation} is performed only by the sender, and
the destination addresses are imbedded in the message.

Although data copying can ruin performance \cite{stricker}, the address 
computation may also greatly affect performance.  Caching can be used to 
amortize this overhead if the communication is repeated, which it 
frequently is, for example, in HPF programs.  We present simple models of
cached and uncached communication that take into account address 
computation.  The models are based on machine and address computation 
parameters.  Using them, we show that while caching is sometimed ineffective 
or even detrimental to performance it can improve performance even when 
the amount of address computation is low.  If caching is effective, the 
models show how many times a communication must be repeated for the cached 
performance to break even with the uncached performance, as well as the 
speedup in message assembly due to caching.  We verify these predictions 
for the iWarp and the Paragon.

We show the practical benefit of caching in an HPF \cite{HPF} communication 
library.  For some distributions, caching can improve performance by a 
factor of two.  Finally, we comment on optimizations that can be performed 
with caching, and on open issues in the implementation of the cache.

\section{Modeling Point-to-Point Communication}
\label{simple}

In this section, we present models for the average end-to-end latency
of a point-to-point communication for uncached and (simple) cached
implementations.  Because of space constraints, we have ommitted the
derivation of the models and only present the final expressions.  
From these expressions, we derive expressions for when caching is
effective, how many times the cached information must be used before
the cached implementation breaks even with the uncached implementation,
and the speedup in message assembly.  

\subsection{A Simple Model For Uncached Communication}

The time for an uncached point-to-point communication has four
components: $t_g$, the time to compute the address relation; $t_{au}$,
the message assembly time; $t_c$, the message communication time; and
$t_d$, the message disassembly time at the receiver.  The total time
for the communication is
\begin{equation}
t_{uncached} = t_g + t_{au} + t_c + t_d
\end{equation}
In this simple model, we assume that the message format is address-data 
pair -- that each data word in the message is paired with its target 
address on the receiver.  The final expression for the uncached average 
end-to-end latency for a communication of $n$ data words is
\begin{eqnarray}
 t_{uncached} & = & n \left ( \frac{n_g+n_{au}+n_d}{r_i} 
                    + \frac{1}{r_{rr}} + \frac{2}{r_{wc}} \right . \nonumber \\
              &   & + \left . \frac{2}{r_c} + \frac{2}{r_{rc}} + \frac{1}{r_{wr}}
                       \right )
\end{eqnarray}
where $n_g$ is the number of instructions executed in address computation 
per data word, $n_{au}$ is the number of loop overhead instructions per 
data word in message assembly, and $n_d$ is the number of loop overhead 
instructions per data word in message disassembly.  The other parameters 
are machine properties.  $r_{rc}$ and $r_{wc}$ are the stride-1 memory read 
and write bandwidths of the machine in words/second, respectively.  
$r_{rr}$ and $r_{wr}$ are the read and write bandwidths, respectively, of 
the machine given a random access pattern in words/second.  Finally, $r_i$ 
is the instruction execution rate of the machine in instructions/second, 
while $r_c$ is the bandwidth of a communication link in words/second.
\label{simpleuc}


\subsection{A Simple Model For Cached Communication}

The time for an cached point-to-point communication has five components: 
$t_g$, the time to compute the address relation; $t_o$, the time to store 
the address relation in the cache, $t_{ac}$, the message assembly time from 
the cached address relation; $t_c$, the message communication time; and 
$t_d$, the message disassembly time at the receiver.  The average time for 
the communication is
\begin{equation}
 t_{cached} = \frac{t_g + t_o}{n_i} + t_{ac} + t_c + t_d 
\end{equation}
Note that the first two components are amortized over the number of times 
the communication is performed, $n_i$.  $t_g$, $t_c$, and $t_d$ are the 
same between the uncached and cached schemes.  We continue the assumption 
that the message format is address-data pair.  Further, we choose the 
simplest possible cache data structure - an array of 2-tuples that 
represent the relation.  The final expression for the cached average 
end-to-end latency for a communication of $n$ data words after $n_i$
iterations is
\begin{eqnarray}
 t_{cached} & = & n \left ( \frac{(n_g+n_o)/r_i + 2/r_{wc}}{n_i} + 
	\frac{n_{ac}+n_d}{r_i} \right . \nonumber \\            
            &   & \left . + \frac{2}{r_{rc}} + \frac{1}{r_{rr}} + 
             \frac{2}{r_{wc}} \right . \nonumber \\
            &   & \left . + 
             \frac{2}{r_c} + \frac{2}{r_{rc}} + \frac{1}{r_{wr}} \right )
\end{eqnarray}
where $n_g$ is the number of instructions executed in address computation 
per data word, $n_o$ is the number of instructions executed in inserting a 
tuple of the address relation into the cache, $n_{ac}$ is the number of 
loop overhead instructions per data word in message assembly using the 
cache, and $n_d$ is the number of loop overhead instructions per data word 
in message disassembly.  The other parameters are the machine properties
explained in section \ref{simpleuc}.

\subsection{Evaluating Caching Using the Simple Model}

In the cached implementation we are essentially trading memory bandwidth 
for computation.  Caching can only be effective if we can assembly the
message using the cache faster than without the cache:
\begin{equation}
 t_{ac} < t_g + t_{au}
\end{equation}
or
\begin{equation}
 r_{rc} > \frac{2r_i}{n_g+n_{au}-n_{ac}}
\end{equation}
If this holds true, then the number of times the cached information
must be used in order to outperform the uncached implementation can be
determined by setting $t_{cached} = t_{uncached}$ and solving for
$n_i$.  In doing this we find that the break even point for caching is
\begin{equation}
 n_i = \left \lceil \frac{r_{rc}}{r_{wc}} \left (
           \frac{(n_g+n_o)r_{wc} +
           2r_i}{(n_g+n_{au}-n_{ac})r_{rc}-2r_i} \right )
\right \rceil 
\end{equation}
Finally, we find that the speedup in message assembly time over the
the uncached implementation is
\begin{equation}
\frac{r_{rr}r_{wc}}{r_{rc}} \left ( \frac{(n_g + n_{au}
-n_{ac})r_{rc} -2r_i}{(n_g+n_{au})r_{rr}r_{wc}+(r_{wc}+2r_{rr})r_i} \right ) 
\end{equation}

\section{Model Validation}

\begin{figure}
\centering
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act. & Pred. (Static) & Pred. (Dynamic) \\
\hline
   1        &         13     &        INF        &       13            \\
   2        &         7      &         12        &        7\\
   3        &         5      &          7        &        5\\
   4        &         4      &          5        &        4\\
   5        &         4      &          4        &        4\\
   6        &         3      &          4        &        3\\
   7        &         3      &          3        &        3\\
   8        &         3      &         3         &       3\\
\hline
\end{tabular}
\caption{Iterations required to break even with 
uncached message assembly on iWarp }
\label{iwbe}
\end{figure}

\begin{figure}
\centering
\begin{tabular}{|c|c|c|c|} 
\hline
  $n_g$ & Act. & Pred. (Static) & Pred. (Dynamic) \\
\hline
   1       &          5\%    &          0\%     &          7\% \\
   2       &         11\%    &          8\%     &         13\% \\
   3       &         18\%    &         14\%     &         19\% \\
   4       &         22\%    &         20\%     &         24\% \\
   5       &         26\%    &         25\%     &         28\% \\
   6       &         30\%    &         29\%     &         32\% \\
   7       &         33\%    &         33\%     &         35\% \\
   8       &         36\%    &         37\%     &         38\% \\
\hline
\end{tabular}
\caption{Speedup over uncached message assembly on iWarp}
\label{iwsu}
\end{figure}

The model was validated on the iWarp and the Paragon.  Because of space 
considerations, we present only the iWarp results here and limit discussion 
of how the various machine and algorithm properties were measured.  The 
model was used to predict the number of iterations necessary for the 
average cached message assembly time to break even with the uncached 
message assembly as a function of $n_g$.  The speedup in cached message 
assembly time as a function of $n_g$ was also predicted.  These numbers 
were then measured and compared to the predicted values.  The address 
relation computed is the identity relation.  The number of instructions to 
compute this relation was varied by inserting nops into the code stream.

On the iWarp \cite{iwarp} we have $r_{rc}=r_{rr}=r_{wc}=r_{wr}=10^7$ 
words/second.  iWarp can execute $r_i=2 \times 10^7$ instructions/second.  
The algorithm and implementation properties were measured both
statically, by simply counting the number of instructions in each loop,
and dynamically, by accounting for stalls that arise in the loops.  The
static measurements found $n_{au}=5$, $n_o=6$, and $n_{ac}=2$, while the
dynamic measurements found $n_{au}=7$, $n_o=8$, and 
$n_{ac}=3$.  Figure \ref{iwbe} shows the measured and predicted number of 
iterations needed to break even with uncached assembly for different $n_g$.  
Notice that the predictions made using the dynamic measurements are almost 
exactly on target.  The predictions made using the static measurements are 
pessimistic for small $n_g$.  The same is true for the predicted speedup 
over uncached assembly, detailed in figure \ref{iwsu}.  However, it is 
important to note that even using static measurements, the predictions only 
become inaccurate when $n_g$ approaches the minimum number of instructions 
where the cached implementation can outperform the uncached implementation.

\section{Address Relation Caching for HPF Communications}

We implemented a library for communicating HPF-distributed arrays between 
data parallel tasks.  The processors over which the source and destination 
arrays are distributed are disjoint and the distribution of the arrays may 
be different.  For example, on the source processors, the array's rows may 
be block-distributed over 4 processors, while on the destination 
processors, its columns may be cyclic-distributed over 8 processors.  The 
library is called on every source processor with the source and target 
distributions (general block-cyclic) along each dimension of the array.  On 
each source processor, the address relation between the local chunk of the 
array and the local chunks on each of the destination processors is 
computed using an algorithm due to \cite{stichnoth}.  The library can
either cache the results of the computation, or use them to directly
assemble a message.  

We measured the performance of this library, with and without caching, for 
a variety of array distributions on the Intel Paragon and found that the 
effectiveness of caching depends on the source and destination 
distributions (and therefore the complexity of the address computation.) 
For example Figure \ref{bbpgon} shows a communication where caching is 
ineffective.  In this case, we are communicating from a block distribution 
to a block distribution, which is treated as a special case.  We see that 
the average communication time for the cached communication after a large 
number of iterations is about 20\% higher than that if the uncached 
communication, which is the unamortized overhead of using the cached 
approach.  Figure \ref{rcpgon} shows a communication where caching is 
marginally effective.  Here, we send a two dimensional array whose rows are 
distributed by blocks to a destination where its columns are distributed by 
rows -- a distribution transpose.  Because of the additional computation 
involved, as well as the small number of address-address blocks necessary 
to represent the address relation, caching proves to be marginal win.  
Ultimimately, it improves the performance of the communication by about 
20\%.  In Figure \ref{bcpgon} we see a communication between a general 
block-cyclic distribution with a small block size to a general block-cyclic 
distribution with a large block size.  The library does not have an 
optimization for this case, and, in fact, the inner loop of the computation 
only generates one tuple each time it is entered.  Despite this, there is a 
fair amount of contiguity in the address relation, so there are few 
address-address blocks in the cache.  As a result, the communication is 
ultimately a factor of two faster by doing message assembly using the 
cache.


\begin{figure}
\centerline{\epsfxsize=2.5in \epsfbox{bbpgon.epsf}}
\caption{Average communication time for block to block on Paragon}
\label{bbpgon}
\end{figure}

\begin{figure}
\centerline{\epsfxsize=2.5in \epsfbox{rcpgon.epsf}}
\caption{Average communication time for row-block to column-block on Paragon}
\label{rcpgon}
\end{figure}   



\begin{figure}
\centerline{\epsfxsize=2.5in \epsfbox{bcs2lpgon.epsf}}
\caption{BC(small) to BC(large) on Paragon}
\label{bcpgon}
\end{figure}

\section{Other Issues in Caching}

We argue that as distributions become more irregular, address computation
becomes more time consuming, and caching becomes more effective.  If 
caching is used, various optimizations to improve memory locality can be
performed to benefit {\em all} address computations - essentially, with
caching, address computation algorithms can concentrate on address
computation and avoid machine specific optimizations.  There is a
continuum of possible cache data structures to choose from.  Integrating
caching into a message passing interface would allow the full use
of scatter/gather DMA hardware to avoid copies.



\begin{thebibliography}{foo}

\bibitem[1]{depmodel}Deposit Model Reference

\bibitem[2]{stricker}Reference to Copy Performance

\bibitem[3]{HPF}High Performance Fortran Forum, {\em High Performance
Fortran Lanuage Specification}, Technical Report CRPC-TR92225,
Center for Research in Parallel Computation, Rice University, May 1993.

\bibitem[4]{iwarp}S. Borkar, et. al., {\em IWARP: An Integrated
Solution to High-speed Parallel Computing}, Proceedings of
Supercomputing '88.

\bibitem[5]{paragon}Intel Corporation, {\em Paragon User's Guide}

\bibitem[6]{i860}Intel Corporation, {\em i860 Microprocessor Family
Programmer's Reference Manual}

\bibitem[7]{stichnoth}J. Stichnoth, D. O'Hallaron, and T. Gross,
{\em Generating Communication for Array Statements: Design,
Implementation, and evaluation}, Journal of Parallel and Distributed
Computing, vol 21, no. 1, April 1994, pp.150-159.

\bibitem[8]{commoptparti}R. Das, M. Uysal, J. Saltz, and Y. Hwang,
{\em Communication Optimizations for Irregular Scientific Computations on
Distributed Memory Architectures}, Journal of Parallel and Distributed
Computing, vol 22, no. 3, September 1994, pp.462-478.


\end{thebibliography}

\end{document}

\maketitle
