%\documentstyle[twocolumn]{article}
\documentstyle[11pt]{article}
\setlength{\topmargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textheight}{9in}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\marginparsep}{0in}
\setlength{\marginparwidth}{0in}

\title{Modeling Caching in Deposit Model Communication}
\author{Peter A. Dinda}


\begin{document}

\maketitle

\abstract{
We present a model for predicting the performance of caching the results of 
address relation computation in deposit model communication.  The 
effectiveness of caching the relation depends on node hardware (CPU speed, 
memory bandwidth, and concurrency between the CPU and the NIC), the 
performance of the communication network, the complexity of computing the 
relation, and the representational efficiency of the data structure used by 
the cache.  Generally, if a relation is slow to compute, but can be stored 
efficiently, then caching is quite effective.  On the other extreme, if 
representational efficiency is low or the relation can be computed quickly, 
caching is less effective.  Further, if computing the relation and 
assembling a message can be fully overlapped with actual communication, 
then caching is ineffective.  Surprisingly, the model also shows that there 
are cases where caching is {\em never} effective, regardless of the number 
of times a communication is performed.  Finally, caching may provide 
additional opportunity to optimize memory access patterns and may make it 
possible to simplify the algorithms used to compute the address relation.

We back up our model by presenting performance results for a communication 
library that transports HPF-style distributed arrays between data parallel
tasks.  Results are presented for two very different platforms:  A NOW 
environment of ethernet-connected DEC Alphas running PVM, and the Intel 
Paragon using NX.
}


\section{Introduction}

\section{Deposit Model Communication}

In conventional, mailbox \cite{Petersen} communication, such as that 
provided by PVM or NX, the sender computes a binary relation $S$ between 
local addresses and addresses offset into the {\em message} that is 
transported between it and the receiver.  If $(s,m) \in S$ then the data 
item at local address $s$ is copied into the message at offset $m$.  The 
receiver computes a binary relation $R$ between addresses offset into the 
message and local addresses.  If $(m,r) \in R$, then the data item at 
offset $m$ in the message is copied to local address $r$.  No actual 
address information is contained in the message.

In deposit model style communication, the sender computes a binary relation 
$SR$ between local addresses at the sender and local addresses at the 
receiver -- if $(s,r) \in SR$ then the data item at address $s$ will be 
transported to address $r$ at the receiver.  We refer to $SR$ as the {\em 
address relation}.  The sender uses the address relation to {\em assemble} 
a message that conceptually consists of 2-tuples $(r,d)$ where data item 
$d$ (read from address $s$) is to be ``deposited'' at address $r$ on the 
receiver.  This message format is refered to as ``address-data pair.'' The 
receiver runs a simple ``deposit handler'' which receives the message and 
{\em disassembles} the message, writing each $d$ to its corresponding local 
address $r$.  The format of the message can often be optimized so that the 
overhead of including addresses in the message is very low.  One such 
format is refered to as ``address-data block.'' In this format, the message 
consists of 3-tuples $(r,l,D)$, where $D$ consists of $l$ data items that 
are deposited at consecutive addresses starting at $r$.  Obviously, other 
optimizations are also possible, depending on the properties of the address 
relation.  There are several other attributes of pure deposit model 
communication that are ignored here.

\section{Point-to-Point Communication}

%In this section, we model the time a single $n$ word point-to-point 
%communication requires without and without caching.  We begin by 
%explaining which properties of the algorithm that computes $SR$ are 
%important for the model.  Next, we state which properties of the machine 
%our model depends on.  Then, we derive 

We model an $n$ word point-to-point communcation without and with caching.  
Without caching, the time to perform a point-to-point communication is 
always
\\
$$t_{uncached} = t_g + t_{au} + t_c + t_d$$ 
\\
where $t_g$ is the time spent computing the address relation, $t_{au}$ is 
the message assembly time, $t_c$ is the time spent communicating the 
message, and $t_d$ is the message disassembly time on the receiver.  

With caching, the average time to perform a point-to-point communication is
\\
$$t_{cached} = (t_g + t_o)/n_i + t_{ac} + t_c + t_d$$ 
\\
The two components in the fraction, $t_g$, the time spent computing the 
address relation and $t_o$, the time spent updating the cache are taken 
only the first time the communication is performed.  We amortize these 
times over $n_i$, the number of times the communciation is performed.  The 
last three components are $t_{ac}$, the time to assemble the message from 
the cached address relation; $t_c$, the time to perform the actual 
communication; and $t_d$, the time to disassemble the message on the 
receiver.  The times spent computing the address relation, performing the 
actual communication, and disassembling the message are the same for the 
cached and uncached cases.

In explicating further, we will first discuss the properties of the 
algorithm used to compute the address relation that are important.  Then, 
we present the machine parameters we use.  Next, we elaborate the above 
equations for an implementation where the message format and the cached 
address relation are in address-data pair form.  Finally, we extend the 
model for message and cache formats that have greater representational 
efficiency.


\subsection{Algorithm Properties}

Clearly, the time spent computing the address relation depends on the 
algorithm and its implementation.  We model computing the address relation as 
instruction execution limited and as delivering the address relation in an
on-line manner.  The single property of the implementation we are 
interested in is $n_g$, the number of instruction executed per element of
the relation.  

For example, we have implemented an algorithm for HPF distributed array 
assignment statements in two different ways.  The first implementation is 
in an optimizing compiler and generates a $d+2$ deep loopnest for $d$ 
distributed dimensions in the worst case.  For common HPF distributions 
(Block, Cyclic), this can often be reduced to only a $d$ deep nest.  
Further $d$ rarely exceeds two.  The operations performed by the loop nest, 
as well as the computation of the loop bounds, can be performed entirely in 
registers.  The algorithm was also implemented as a run-time system, using 
recursion.  Because the recursion is never more than $d$ calls deep, it too 
executes almost entirely out of registers.

\subsection{Machine Properties}

Our machine has the following properties:
\begin{enumerate}
\item Executes $r_i$ instructions per second,
\item Performs stride-1 memory writes at $r_{wc}$ words/second,
\item Performs random memory writes at $r_{wr}$ words/second,
\item Performs stride-1 memory reads at $r_{rc}$ words/second,
\item Performs random memory reads at $r_{rr}$ words/second, and
\item Communicates with another node at $r_c$ words/second.
\end{enumerate}
These are all easily measurable.

\subsection{Simple Model For Uncached Communication}

The time for an uncached point-to-point communication has four components:
$t_g$, the time to compute the address relation; $t_{au}$, the message 
assembly time; $t_c$, the message communication time; and $t_d$, the 
message disassembly time at the receiver.  The total time for the 
communication is 
\\
$$ t_{uncached} = t_g + t_{au} + t_c + t_d $$
\\
In this simple model, we assume that the message format is 
address-data-pair.

\subsubsection{Computing the Address Relation}

We argue that the address computation time is largely determined by the 
complexity of algorithm used, its implementation, and the instruction 
execution rate of the machine.  Thus for an $n$ word message, the time 
required is
\\
$$t_g = n n_g / r_i$$

\subsubsection{Assembling the Message}

The code that computes the address relation presents us with tuples $(s,r)$ 
in registers in an on-line manner.  For each of these tuples, we read the 
data word at address $s$, write $r$ to the message buffer, then write the 
data word to the message buffer.  Note that the writes to the message 
buffer are to consecutive addresses, so their performance can be captured 
by $r_{wc}$, the contiguous write performance of the machine.  On the 
other hand, the sender side addresses of the tuples may be all over the 
map, so we capture their performance with $r_{rr}$, the random read 
bandwidth of the machine.  Thus the time to assemble the message is
\\
$$t_{au} = n (1/r_{rr} + 2/r_{wc}) $$

Note that this expression does not take into account cache thrashing 
between the reads and writes.  

\subsubsection{Communicating the Message}

The message communication time is 
\\
$$ t_c = 2 n / r_c $$
\\
Note that $r_c$ is really a function of $2n$, but we assume that $n$ is 
sufficiently large that we can use a fixed value for $r_c$.  Further, 
this value is the same without and with caching and becomes 
irrelevant in comparing them.  

\subsubsection{Disassembling the Message}

On the receiving node, each word of the message is read consecutively and 
each address-data pair results in one random write to memory.  Thus the 
time to disassemble the message is 
\\
$$ t_d = n (2/r_{rc} + 1/r_{wr}) $$  

Note that is expression does not take into account cache thrashing between
the reads and writes.

\subsubsection{Putting It Together: Simple Uncached Communication}

Substituting into the expression for $t_{uncached}$, we see that an 
$n$ word communication between two nodes requires time
\\
$$ t_{uncached} = n ( n_g/r_i + 1/r_{rr} + 2/r_{wc} + 2/r_c + 2/r_{rc} + 
1/r_{wr})$$

\subsection{A Simple Model For Cached Communication}

The time for an cached point-to-point communication has five components: 
$t_g$, the time to compute the address relation; $t_o$, the time to store 
the address relation in the cache, $t_{ac}$, the message assembly time from 
the cached address relation; $t_c$, the message communication time; and 
$t_d$, the message disassembly time at the receiver.  The average time for 
the communication is
\\
$$ t_{cached} = (t_g + t_o)/n_i + t_{ac} + t_c + t_d $$
\\
Note that the first two components are amortized over the number of times 
the communication is performed, $n_i$.  $t_g$, $t_c$, and $t_d$ are the 
same between the uncached and cached schemes and were determined in the
preceding sections.  We continue the assumption that the message format 
is address-data-pair.  Further, we choose the simplest possible cache 
data structure - and array of 2-tuples that represent the relation.

\subsubsection{Storing the Address Relation}

For the simple model, storing the address relation in the cache involves 
two contiguous writes for each in-register tuple produced by the address 
relation computation:
\\
$$ t_o = n (2/r_{wc}) $$


\subsubsection{Assembling the Message}

When assembling the message from the cached address relation, for each 
tuple in the message, we first perform two consecutive reads to read a tuple of 
the cached address relation.  Then, we perform a random read to read the 
data word.  Finally, we perform two consecutive writes to write the data 
word and the receiver address into the message buffer.  Thus the time to 
assemble a message with $n$ data words is 
\\
$$ t_{ac} = n (2/r_{rc} + 1/r_{rr} + 2/r_{wc}) $$

Note that this expression does not take into account cache thrashing 
between the reads and writes.

\subsubsection{Putting it Together: Simple Cached Communication}

Substituting into the expression for $t_{cached}$, we see that an 
$n$ word communication between two nodes requires time
\\
$$ t_{cached} = n ((n_g/r_i + 2/r_{wc})/n_i + 2/r_{rc} + 1/r_{rr} + 2/r_{wc} + 2/r_c + 2/r_{rc} + 
1/r_{wr})$$

\subsection{Evaluating Caching Using the Simple Model}

In the cached implementation we are essentially trading memory bandwidth 
for computation.  Caching can only be effective if 
\\
$$ t_{ac} < t_g + t_{au}$$
\\
or
%$$ 2/r_{rc} + 1/r_{rr} + 2/r_{wc} < n_g/r_i + 1/r_{rr} + 2/r_{wc}$$
%$$ 2/r_{rc} < n_g/r_i$$
$$ r_{rc} > 2r_i/n_g $$

If this hold true, then the number of times the cached information must 
be used in order to outperform the uncached implementation determined by 
setting $t_{cached} = t_{uncached}$ and solving for $n_i$.  In doing this 
we find that the break even point for caching is 
\\
$$ n_i = \lfloor (r_{rc}/r_{wc})((n_gr_{wc} + 2r_i)/(n_gr_{rc}-2r_i)) 
\rfloor $$

\subsection{Parameters For Extended Models}

The models for point-to-point communications described above have several 
simplifications that may be unrealistic.  First, they assume that the 
message is in the simple address-data-pair format and that the cache data 
structure is a simple array of address-address pairs.  While this format is 
the best that can be done for a random address relation, we argue that most 
useful address relations can be represented in more efficient ways.  For 
example, the address-data-block message format combined with a 
run-length-encoded cache format greatly reduces the message size and the 
number of times the cache is accessed for most HPF array assignment 
statements.  

In this section, we introduce several additional parameters which are 
intended to model the effect of various optimizations.  Of these, the 
most important are $\alpha$ and $\beta$.  $\alpha$ represents 
the effect of an optimized message format while $\beta$ represents the 
effect of an optimized cached address relation during the assembly of the 
message.  Additionally, we introduce the parameters $n_o$, $\gamma$, and 
$\delta$, which represent the computation, memory reads, and memory 
writes associated with updating an optimized cache strucuture.  These new 
parameters depend on the implementation, the address relation, 
and the order in which the address relation is computed.  The dependence on 
the address relation, can make prediction difficult.  However, 
classification of a communication may be sufficient to estimate the 
parameters.  For example, for an HPF distributed array assignment 
statement \verb.A=B., our implementation examines the distribution along 
each pair of corresponding dimensions to determine whether an optimization 
can be used for that pair of dimensions.  Since a dimension can only be one 
of block, cyclic, or general block-cyclic, one of nine schemes is used when 
that dimension is considered.  By testing the performance of the 
implementation for these nine cases, we can measure $\alpha$ and $\beta$ 
(not to mention $n_g$) for each of these classes.  


\subsection{Extended Models For Uncached and Cached Communication}

\begin{table}
\caption{Elements of point-to-point communication times}
\begin{tabular}{ll}
	$t_{uncached}$ & Point-to-point communication time w/o caching  \\
	$t_{cached}$ & Point-to-point communication time with caching  \\
	$t_g$ & Address relation computation time  \\
	$t_{au}$ & Message assembly time w/o caching   \\
	$t_{ac}$ & Message assembly time with caching  \\
	$t_c$ & Message communication time  \\
	$t_d$ & Message disassembly time  \\
	$t_o$ & Cache update time  \\
\end{tabular}
\end{table}

\begin{table}
\caption{Machine parameters}
\begin{tabular}{ll}
	$r_i$ & Instructions/second of machine  \\
	$r_{rc}$ & Words/Second of stride-1 memory reads  \\
	$r_{rr}$ & Words/Second of random memory reads  \\
	$r_{wc}$ & Words/second of stride-1 memory writes  \\
	$r_{wr}$ & Words/second of random memory writes  \\
	$r_c$ & Words/second of node-node communication \\
\end{tabular}
\end{table}

	
\begin{table}
\caption{Algorithm and implementation dependencies}
\begin{tabular}{ll}
	$n$   & Number of data words in message \\
	$n_i$ & Number of times communication is done  \\
	$n_g$ & Instructions/tuple to compute address relation \\
\end{tabular}
\end{table}

\begin{table}
	\caption{Additional parameters for extended point-to-point model}
	\protect\label{addlparam}
	\begin{tabular}{ll}
		$\alpha$ & Message overhead per data word  \\
		$\beta$ & Cache reads per data word  \\
	   $n_o$  & Instructions/tuple to update cache   \\
		$\gamma$ & Cache writes per tuple of address relation  \\
		$\delta$ & Cache reads per tuple of address relation \\
	\end{tabular}
\end{table}

$\alpha$ represents the additional amount of space required in the 
message for addressing overhead per word of data in the message.  Thus we 
find that the communication time for an $n$ word message is
\\
$$t_c = n(1+\alpha)/r_c$$
\\
and the disassembly time at the receiver is
\\
$$t_d = n((1+\alpha)/r_{rc}+1/r_{wr})$$
\\
By comparison, we see that the simple model has $\alpha=1$.   The address 
computation time does not depend on the new parameters and thus remains
\\
$$t_g=n n_g/r_i$$
\\
The uncached assembly time depends only on the message format, because tuples 
are generated on-line by directly computing the address relation and 
never stored.  The uncached assembly time
\\
$$t_{au} = n(1/r{rr} + (1+\alpha)/r_{wc})$$

Notice that for $\alpha=1$ the uncached communication time for an $n$ word message,
\\
$$t_{uncached} = n(n_g/ri + 1/r_{rr} + (1+\alpha)/r_{wc} + (1+\alpha)/r_c 
+ (1+\alpha)/r_{rc} + 1/r_{wr}) $$
\\
collapses to that for the simple address-data-pair messsage format.

Modeling $t_o$, the time to update the cache from the results of the 
address computation, is, difficult.  In the simple model, adding a tuple of 
the address relation to the cache only required two (contiguous) writes to 
an array, thus we had that $t_o=n(2/r_{wc})$ for an $n$ word message.  In 
general, adding a tuple to the cache will require a certain number of 
reads, some computation, and a certain number of writes to update the data 
structure.  Let $n_o$ be the number of instructions executed per tuple, 
$\gamma$ is the number of reads per tuple, and $\delta$ is the number of 
writes per tuple.  We make the simplifying that the cache data structure is 
(memory) cache optimized, so that the read and write performance can be 
characterized with the stride-1 contiguous read and write numbers, $r_{rc}$ 
and $r_{wc}$, respectively.  Then the time to add the elements to the cache 
becomes
\\
$$t_o = n (n_o/r_i + \gamma/r_{rc} + \delta / r_{wc})$$
\\
In the simple address-data-pair case, $n_o\approx 0$, $\gamma=0$, and 
$\delta=2$.  

When assembling the message using the cached address relation, we model the 
time spent accessing the cache is being limited by the contiguous read 
performance of the machine.  We feel this is reasonable because a final 
stage of the cache update time $t_o$ can always be spent converting between 
a data structure optimized for additions to one optimized for use during 
assembly.  Because of this possible change of data structure, we must 
introduce an additional term, $\beta$, which represents the number of words 
read from the cache per data word.  Then the time to assemble the message 
is
\\
$$t_{ac}=n(\beta/r_{rc} + 1/r_{rr} + (1+\alpha)/r_{wc})$$
\\
Notice that this expression combines $\beta$ and $\alpha$ - the portion 
of the time spent reading from the cache depends on the cache data 
structure while the portion spent writing to the message depends on the 
message format.   

The total cached communication time for an $n$ word message is
\\
$$t_{cached} = n((n_g/r_i + (n_o/r_i + \gamma/r_{rc} + \delta/r_{wc}))/n_i + 
\beta/r_{rc} + 1/r_{rr} + (1+\alpha)/r_wc + (1+\alpha)/r_c + 
(1+\alpha)/r_{rc}+ 1/r_{wr})$$

\subsection{Evaluating Caching in Light of the Extended Models}

As before, we note that for an implementation with caching to ever 
outperform one without, it must be true that
\\
$$t_{ac} < t_g + t_{au}$$
\\
means (as in section \ref{simple cache}) that
\\
$$r_{rc} > \beta r_i / n_g $$
\\
or, more interestingly, that
\\
$$\beta < n_g r_{rc}/r_i $$
\\
The latter expression is telling us that the representational efficiency 
of the cached address relation determines whether caching is effective or 
not.  Surprisingly, the efficiency of the message format does not come 
into play.  

If caching can win, then the breakeven point is determined by setting
$t_{cached}=t_{uncached}$ and solving for $n_i$, the number of times the 
communication is performed.  We find that the break-even point is where
\\
$$n_i = \lfloor ((n_g+n_o)r_{rc} + \gamma r_i + \delta r_i r_{rc}/r_{wc}) /
             (n_g r_{rc} - \beta r_i) \rfloor $$
\\



\section{One-to-many Communication and Caching}

So far, we have developed a model to predict the average performance of a single 
point-to-point communication without and with caching.  If the hardware 
does not support concurrency between actual communication (``getting it 
on the wire'') and computation, then determining the total time for
depositing to $n_p$ processors $n_i$ times is 
\\
$$ n_i\sum_{i=0}^{n_p-1}t_{(pp,i)} $$
\\
where $t_{(pp,i)}$ is the point-to-point time, $t_{cached}$ or 
$t_{uncached}$, time for the message sent to processor $i$.  This time can 
be optimized by choosing whichever of cached or uncached communication is 
best for each individual point-to-point time. 

If the hardware supports communication concurrently with computatation, 
then some complexity arises.  In the following, we assume that the number 
of data words and the amount of computation and memory references is the 
same for each target processor -- that 
$t_{(pp,0)}=t_{pp,1)}=\dots=t_{(pp,n_p-1)}$ and the components of each of 
the $t_{(pp,i)}$ are equal.

Because of the concurrency between the communication and computation, we 
assemble the message for processor $i$ while sending the message for 
processor $i-1$.   Suppose that the time to assemble a message is less 
than that to send a message, then the total time spent performing the 
communication is overwhelmingly dominated by the communication time.  
Without caching, we find that the total time required is
\\
$$ t_g+t_{au}+n_in_pt_c+t_d$$
\\
whereas with caching we require
\\
$$ n_p(t_g+t_o) + t_{ac} + n_in_pt_c+t_d$$
\\
therefore, caching will only make sense if $t_g < (t_{au}-t_{ac})/(n_p-1)$, 
which is unlikely.  See figure \ref{commbig}. Even if this is found to be 
true, caching is unlikely to be of much help.

If the time to assemble a message is greater than that to send a message, 
then the total time to perform the communication is dominated by the 
assembly time.  Without caching, the total time required is
\\
$$ n_in_p(t_g+t_{au})+t_c+t_d $$
\\
whereas with caching the total time is
\\
$$ n_p(t_g+t_o)+n_in_pt_{ac} + t_c +t_d$$
\\
In this case, caching makes sense if $t_{ac}<t_g+t_{au}$ as for the 
point-to-point case.  See figure \ref{commsmall}.  In general, we can say 
that for caching to be ever be effective for one-to-many communication that in 
addition to $t_{ac}<t_g+t_{au}$, it must be that $t_{ac}<t_c$.  In the 
extended model, this means that 
\\
$$ \beta/r_{rc} + 1/r_{rr} + (1+\alpha)/r_{wc} < (1+\alpha)/r_c$$
\\

\end{document}
