\input epsf
%\input{psadobe}
\makeatletter
\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}}}
\makeatother
\documentstyle[fleqn]{art-nocolon}
\include{psfig}
%\include{caption}


\textheight =  9.0in
\textwidth = 7.0in              % assume output has this much centered

\topmargin = -.5 in 
\oddsidemargin = -.25 in
\evensidemargin = -.25 in

\renewcommand{\floatpagefraction}{1}
\renewcommand{\topfraction}{1}
\renewcommand{\bottomfraction}{1}
\renewcommand{\textfraction}{0}
\renewcommand{\baselinestretch}{1}
\renewcommand{\arraystretch}{1.5}


\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newcommand{\ignore}[1]{}

\renewcommand{\to}{\rightarrow}
\newcommand{\A}{{\cal A}}
\begin{document}
\bibliographystyle{acm}

%\load{\footnotesize}{\sc}
\title{ 
%\begin{flushleft}
%\vspace{-1.0in}
%\centerline{\scriptsize\em (Extended abstract submitted for PLDI 94)}
%\normalsize { }
%\end{flushleft}
%\vspace{0.4in}
Language and Runtime Support for Network Parallel Computing\\
}
\author{}
\date{   {email:\tt \{\}@cs.cmu.edu} \vspace*{.2in} \\
      School of Computer Science \\
         Carnegie Mellon University \\
         Pittsburgh PA 15213 \vspace*{.2in}}
%        Corresponding Author: Jaspal Subhlok \\
%        Phone: (412) 268 7893}


\maketitle
\thispagestyle{empty}
\abstract{
\normalsize
%Fast networks have become  very important 
%for high performance computing as many applications need hardware
%and software resources that are distributed  across networks. 
The
latencies and bandwidths available on networks have improved
dramatically making it possible to distribute applications on a
heterogeneous network and execute them efficiently. However, programming
support for network parallel applications is 
%virtually non-existent.
rudimentary, especially in the context of standard languages.
This paper describes how the Fx compiler at Carnegie Mellon
for a variant of  High Performance Fortran is being extended
to support network parallel computing. Task parallelism is used
to distribute and manage computations across the sequential
and parallel machines of a network. We introduce a network communication
interface  which makes it possible to efficiently
compile the programs for heterogeneous networks even in the 
absence of full runtime information, which is often the 
case in network parallel computing. We describe how the Fx
programs are compiled onto the communication interface and discuss
the issues in implementing the interface. The result is that
a programmer is able to write a single program that executes
efficiently across the network, even in the presence of architectural
heterogeneity and load imbalance.
}

%\voffset = -.5 in
\normalsize

\setcounter{page}{1}
\section{Introduction}\label{intro}
With the advent of standard parallel languages like High Performance
Fortran (HPF), parallelizing compilers have become the most convenient
and powerful way of developing programs for parallel machines. These
standard languages make it easy to run the same program efficiently on
a wide variety of sequential and parallel machines.

At the same time, recent advances in standard network protocols (i.e.,
HiPPI and ATM) have made it possible to link these different machines
together and build network parallel applications. The low latency and
high bandwidths of the new protocols offer the possibility of high
efficiency, and the standard languages offer the benefit of a common
programming environment.

However, standard parallel languages offer rudimentary support, at
best, for network communication. The low-level details of the network
protocols are usually exposed, making the
construction of an efficient network parallel application a difficult and
error-prone task in which the applications programmer must become a
network specialist. 

%Parallelizing compilers have become the primary way
%of programming parallel machines. This is made possible by 
%advances in compilation technology for parallel computers,
%and by the emergence of standard, portable  parallel languages, 
%in particular High Performance Fortran (HPF). This mode of parallel
%programming frees the programmer from the the details of
%communication between processors, and is critical
%for widespread use of parallel machines.
%
%Recently, the latency and bandwidth of networks have improved dramatically
%and standard protocols for high speed communication
%(i.e. HiPPI and ATM) are now available.
%%The result is that it is practical to  map an
%%application across a variety of sequential and parallel 
%%machines and still achieve good performance. Many real applications
%%need hardware and software resources distributed across networks.
%%e.g. ...
%%
%These networks provide the hardware resources to
%develop efficient network parallel applications, 
%and portable, standard languages make it possible to map the same program efficiently
%across a wide variety of sequential and parallel machines,
%but software support
%for network parallel computing is rudimentary. 
%The programmer has no help in managing the heterogeneity
%of the underlying system: different nodes in the network may have widely
%varying computation and communication capabilities, and even different ways
%of representing data. The result is that
%%Using a parallelizing compiler like Fx or Fortran D, parts of a parallel
%%application can be developed for different computers with
%%relative ease. However, 
%building the full application
%%requires
%%explicit communication across networks, which
%is a difficult
%and error prone task that forces the applications programmer to become a
%network specialist.

The goal of our research is to make it possible to program
the network applications in an easy and efficient way. Compiling 
for a variety of machines across a network presents several
new challenges. This paper shows how parallelism
inside a single machine and across a set of machines can be
expressed in a uniform framework. We also present the design
of a runtime system that enables efficient compilation of
programs for network parallel computing.

The programming model we use combines task and data 
parallelism. Procedures in an application are data parallel 
and each procedure is mapped to a homogeneous set of nodes,
which may be different nodes of a large parallel machine
or cluster of workstations. Different procedures are placed
on different sets of nodes on the network and the compiler
manages the communication between them.

%For several reasons,
The most challenging part of compilation for
a network is generating efficient communication. The most efficient
communication interface across a network is typically different from the 
communication interface between the nodes of a parallel machine, hence
the compiler has to manage multiple communication paradigms.
More important, the network environment is expected to be 
dynamic. Runtime load balancing is important even for applications
with a static computation structure since the external
load on different machines  can vary. Hence the compiler has
limited information about the runtime environment during 
code generation.

For generating efficient communication in a dynamic runtime
environment, it is important to distinguish between the
aspects of the communication that are constant for the
duration of the execution and those that are varying and 
are known only at runtime. In our model, the program is
partitioned into modules and each module is mapped to a
set of processors. The communication steps between modules
are determined at compile time, while the actual low level communication
calls may depend on the  runtime environment.

Our solution to generating efficient communication involves
defining an inter module communication interface. This 
interface provides mechanisms for communication between
modules which are mapped on groups of nodes. The compiler
generates code for this communication interface and the underlying
implementation of the interface ensures that appropriate low
level communication calls are generated.

We begin by motivating the need for network parallel applications with
a few examples. Then we describe our approach, which we call NetFx, as
it is based on our variant of High Performance Fortran, Fx.

\section{Motivation for Network Parallel Computing}

Network parallel programming makes it possible to build larger and
more powerful
application systems than can be built without it, but this is not the
only or even the
best motivation. We are developing applications in which the network
parallelism is an inherent feature, chiefly due to the need for
heterogeneity, regardless of the size of the computer system used. One
of these applications is in real-world modeling.

\subsection{Network Parallel Computation for Real-World Modeling}

High performance computers and their programming tools have
traditionally been applied to problems in scientific modeling:
simulation, analysis, and the like. We are applying these same systems
and tools to problems in real-world modeling, by which we mean the
three-dimensional shape modeling of objects and environments in the
real world by direct observations using computer vision techniques. In
so doing we hope to replace complex and specialized sensory systems
based on technology like laser range-finding by much simpler, more
flexible and scalable, but computationally more expensive, systems
like stereo vision.

In these applications the sensor system plays a key role. In some
situations, e.g., imaging moving objects, it is necessary to capture
imagery at high speeds, i.e., video rate (30 Hz) or faster. Capturing
data at this rate is complicated by the need to synchronously access
several cameras (recently the most significant advances in stereo
vision accuracy and reliability have come from the introduction of
multiple cameras.)  In other situations, e.g., imaging building
interiors, speed of image capture is not so important but the imaging
system must be portable.

\begin{figure}
\label{fig:netargos}
\centerline{\epsfxsize=4in \epsfbox{netargos.eps}}
\caption{foobar}
\end{figure}

We believe these conflicting requirements can be met through the
application of modern networking technology. The system we envision is
illustrated in Figure \ref{fig:netargos}. The camera system is a
detachable unit, and can be attached either to a large number of
workstations, giving a high frame capture rate, or to a smaller number
of portable workstations or PCs, giving portability at the expense of
speed. One the images are captured (and the portable computer system
is reattached to the network, if necessary), they are transformed into
stereo vision data through the application of parallel stereo vision
techniques we have invented and coded in Fx; because of Fx's
architecture independence we can run these programs on the
workstations themselves or on powerful tightly-coupled high
performance computers.  The resulting range imagery is then
transformed into a three-dimensional model, again in parallel using
Fx, and the resulting model can then be stored to disk or displayed
remotely, enabling sophisticated three-dimensional videoconferencing.

\section{Programming model for Net-Fx}
The basic programming model provides support for task and
data parallelism and is not changed from Fx~\cite{SSOG93,GrOS94}.
Support for data parallelism is similiar to that provided in
High Performance Fortran and is described in~\cite{StOG94,YWSO93}.
Task parallelism is used to map data parallel subroutines onto
different, possibly heterogeneous, groups of processors. We outline
how task parallelism is expressed in Net-Fx and illustrate how it
is used to exploit coarse grain parallelism.

\subsection{Task parallelism}
Task parallelism is expressed solely using compiler directives and no new
language extensions are introduced.  To simplify the
implementation, the current version of Fx relies on the user to
identify the side effects of the task-subroutines and to specify
them. Directives are also used to guide the compiler in mapping 
the program which is important for performance and also to ensure
that tasks execute on computers for which the compiled versions are
available.
We describe the directives  that are used to express task
parallelism and illustrate their use. We use the FFT-Hist
kernel as an example program. This program takes a sequence
of arrays as input, and for each input
array, it  performs a 2D FFT by performing a set of 1DFFTs, first
on columns and then on  the rows, followed by histogram analysis.

\subsubsection*{Parallel sections}
Calls to task-subroutines are permitted only in special code regions
called {\em parallel sections}, denoted by a {\tt begin parallel/end
parallel} pair. For example, the parallel section for the FFT-Hist
example has the following form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
c\$ \>{\bf begin parallel} \\
\>     do i = 1,m \\
\>\>     call colffts(A)\\
c\$\>\>  {\em input/output and mapping directives}\\ \vspace{3mm}
\>\>     call rowffts(A)\\
c\$\>\>  {\em input/output and mapping directives}\\ \vspace{3mm}
\>\>     call hist(A)\\
c\$\>\>  {\em input/output and mapping directives}\\
\>     enddo\\
c\$\>   {\bf end parallel}
\end{tabbing}
\rm
The code inside a parallel section can only contain loops and
subroutine calls.  These restrictions are necessary to make it
possible to manage shared data and shared resources (including
nodes) efficiently. Every task-subroutine call inside a parallel
section corresponds to a (possibly data parallel)
{\em task} that can execute in parallel with other tasks
subject to dependence constraints.

A parallel section introduces a data parallel thread for
each task-subroutine inside it.
Outside the parallel section, the program
executes as a single data parallel thread.

\subsubsection*{Input/output directives}

The user includes {\tt input} and {\tt output} directives to
define the side-effects of a task-subroutine, i.e., the data space
that the subroutine accesses and modifies.  Every variable whose value
at the call site may potentially be used by the called subroutine must
be added to the input parameter list of the task-subroutine.
Similarly, every variable whose value may be modified by the called
subroutine must be included in the output parameter list. The variables
in the input and output parameter lists can be  distributed or replicated
arrays, or scalars.

For example, the input and output directives for the call to {\tt
rowffts} have the form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
\>\>     call rowffts(A)\\
c\$\>\>  {\bf input} (A),  {\bf output} (A)\\
c\$\>\>  {\em mapping directives}\\
\end{tabbing}
\rm
This tells the compiler that the subroutine {\tt rowffts}
can potentially use values of, and write to the parameter array {\tt A}.
As another example, the input and output directives
for the the call to {\tt colffts} has the form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
\>\>     call colffts(A)\\
c\$\>\>  {\bf output} (A)\\
c\$\>\>  {\em mapping directives}\\
\end{tabbing}
\rm

This tells the compiler that subroutine {\tt colffts} does not use the value
of any parameter that is passed but  can potentially write to array {\tt A}

\subsubsection*{Mapping directives}
Labeling task-subroutine calls and supplying input/output information
is sufficient for the compiler to generate a correct parallel
program. However, the performance of the program largely depends
on how the task-subroutines are mapped onto the available computation
resources on the network. The possible mappings are constrained by 
the availability of data parallel implementations of different
task-subroutines for different architectures, memory requirements of
the task-subroutines, and physical location of devices like framebuffers.
Ideally Net-Fx should be able to automatically map the program for
correct execution and best possible performance, but that is
a hard problem and NetFx is not able to do it except in some simple
situations~\cite{SuVo95}. In general, the programmer has to supply 
the mapping information.

The mapping information  specifies where each of the task-subroutines
should execute. The details of the mapping directives are somewhat
cumbersome and machine specific and we will not describe them here.
 For example, when mapping a task-subroutine
to a network of workstations, it may be necessary to specify the names
of the workstations to be used as a node, but when mapping to a part
of a massively parallel machine like the Intel Paragon, it may be
sufficient to specify the number of nodes to be used.

\subsection{Programming coarse grain parallelism}
Net-Fx allows the programmer to use fine grain data parallelism as well as
coarse grain task parallelism. We illustrate how the task
parallelism directives described above are used to exploit
common forms of coarse grain parallelism.

\subsubsection*{Task pipelines}
In image and signal processing programs , the main computation often consists
of repeatedly executing an acyclic task graph. A simple example of such
a program is the  FFT-Hist example program, which we now present
with directives:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
c\$ \>{\bf begin parallel} \\
\>     do i = 1,m \\
\>\>     call colffts(A)\\
c\$\>\>  {\bf output} (A)\\
\>\>     call rowffts(A)\\
c\$\>\>  {\bf input} (A),  {\bf output} (A)\
\>\>     call hist(A)\\
c\$\>\>  {\bf input} (A)\\
\>     enddo\\
c\$\>   {\bf end parallel}
\end{tabbing}
\rm

In FFT-Hist, all data dependences in the program are forward and
loop independent. Task-subroutine {\em colffts} reads input, computes and
sends output data set to {\em rowffts}, which recieves
the  data set, computes, and sends the output data set to
{\em hist}, which computes the final output. This allows 
an earlier stage, say {\em colffts} to process a new data
set while {\em rowffts} is processing the previous data
set, allowing coarse grain pipeline parallelism. In  related
work, we have argued that such parallelism can be exploited
efficiently and profitably for many applications including
narrowband tracking radar and multibaseline
stereo~\cite{SOGD94}.

\subsubsection*{Communicating tasks}
In many scientific applications, different subroutines can
execute in parallel, but need to transfer or exchange data
between invocations. For example, two routines {\em fproca}
and {\em fprocb} may work on different parts of a data structure
and need to exchange boundary elements between successive
invocations. A task parallel program that expresses this
computation is as follows:

\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
c\$ \>{\bf begin parallel} \\
\>     do i = 1,m \\
\>\>     call fproca(A, boundA, boundBx)\\
c\$\>\>  {\bf output} (boundA), {\bf input} (boundBx)\\
\>\>     call fprocb(B, boundB, boundAx)\\
c\$\>\>  {\bf output} (boundB),  {\bf input} (boundAx)\\
\>\>     call ftransfer(boundA, boundB, boundAx, boundBx)\\
c\$\>\>  {\bf input} (boundA, boundB) {\bf output} (boundAx, boundBx)\\
\>     enddo\\
c\$\>   {\bf end parallel}
\end{tabbing}
\rm

The task dependence graph for this computation is
shown in Figure (TOBEMADE). 
There is no direct dependence between
task-subroutines {\em fproca} and {\em fprocb}, but they
have a loop carried dependence on {\em ftransfer} due to variables
{\em boundAx} and {\em boundBx} respectively. Task-subroutine
{\em ftransfer} has a loop independent dependence on 
{\em fproca} and {\em fprocb}, due to variables
{\em boundA} and {\em boundB}. Thus, the iterations of
{\em fprocA} and {\em fprocB} execute in parallel and
effectively exchange boundary elements using
{\em ftransfer}. Note that the results obtained by
task parallel execution are consistent with sequential execution.

\subsection{Compilation}
The compiler uses a data flow algorithm to analyze the dependences
caused by input/output variables and precisely determines the 
data movement between tasks. The basic paradigm is that the
results obtained must alwaus be consistent with sequential execution.
 It then uses the mapping information
to group task-subroutines into {\em modules}. All task-subroutines
inside a module are mapped to the same set of processor nodes.

The compiler generates calls to the runtime system for communication
between modules, and may also generate distribution mapping
routines that the runtime system uses for communication. Inter-module
communication
management is done jointly by the compiler and the runtime system and 
the details are discussed in the next section. The involvement of the
runtime system is necessary since the compiler is aware of the 
communication pattern between modules, but may not be aware of
the mapping of the modules. This is particularly important if
the  mappings can change dynamically, or if some of the modules
contain routines not generated by the Net-Fx compiler.



\section{Run-time Support for Tasking}

The Net-Fx run-time system is responsible for efficiently transporting data 
between modules.  This task is complicated by a number of requirements, 
including dynamic distributions, asynchrony, compile-time customizability, 
and node and network heterogeneity.  In order to meet these requirements, 
the run-time system is specified as a three subsystems with well defined 
interfaces between them and the Net-Fx executable.

\subsection{Terms and Abstractions}

A {\em task} is a data parallel subroutine.  One or more tasks are
grouped into a {\em module}, where they execute consecutively.  A
module runs on a set of processes mapped to {\em processor group},
which is a set of processors that share some common attributes.  There
is a {\em dependence} between two modules if a task in one produces a
data set for a task in the other.  The compiler resolves this
dependence by generating a send call for the producer module and a
receive call for the receiver module.  Every module essentially
repeats three phases.  First, it receives data sets from each
module it is dependent on.  Next, it executes its tasks.  Finally,
it sends data sets to each module that depends on it.

The send and receive calls are collective and non-blocking, and
present the abstraction of a ``module-to-module'' communication of a
distributed data set between uniquely numbered modules.  Below this
abstraction, on both the sending and the receiving module, we think of
the elements of the data set being distributed across consecutively
numbered processes.  This lower level abstraction is internal to the
run-time system.  How the elements map to processes is defined by a
distribution type and distribution.  A {\em distribution type}
identifies a class of similar distributions, while a {\em
distribution} identifies a specific member of the class.  For example,
the distribution type of $HPF\_ARRAY$ includes the distribution
$(*,BLOCK)$.  The combination of the sender and receiver distribution
types identifies a method for mapping between the two distributions, a
process that is called {\em distribution computation}.  The result of
distribution computation is an {\em address relation} which maps
memory addresses in a sending process to memory addresses in a
receiving process.

The distribution of a data set may be {\em static} or {\em dynamic}.
A static distribution never changes while a dynamic one may change.  A
distribution type never changes.  For example, a load balanced
distribution type that involves data movement will certainly result in
a dynamic distribution.  The mapping of a module's processes to a
processor group may also be static or dynamic.  The process of finding
the current values of distributions or module mappings is refered
to as {\em binding} and the results are {\em bindings}.  Bindings are
static or dynamic depending on the distribution or module mappings
they represent.

\subsection{Subsystems and Interfaces}

\begin{figure}
\centerline{\epsfxsize=5in \epsfbox{Int1.eps}}
\caption{The Net-Fx run-time system: subsystems, interfaces, and relation 
to the Net-Fx executable.  Each arrow represents an interface and points
from the caller of the interface to the callee.  Note that some callee's 
have several callers.}
\label{fig:interface}
\end{figure}

This section summarizes the Net-Fx run-time system's subsystems and the 
interfaces between them and the Net-Fx executable.  Figure 
\ref{fig:interface} graphically shows these relationships.  The run-time system is 
primarily responsible for transporting distributed data structures between 
modules.  This communication must avoid synchronizing the source and 
destination modules, but at the same time, it must support changing data 
distribution and module bindings.  The details of binding, distribution 
computation, and of actual data transport over a likely complex and 
irregular network are hidden from the Net-Fx executable below two simple 
interfaces.  The first of these is the {\em intertask communication 
interface}, which consists of simple, high-level, non-blocking collective 
send and receive calls for transporting whole distributed data structures 
between modules.  The second is the {\em registry interface}, which 
allows the Net-Fx node program to describe the characteristics of relavant 
modules, data distribution types, and custom distribution computation 
functions to the run-time system.

Within the run-time system itself, the tasks of distribution computation, 
maintaining bindings, and communication are segregated into subsystems with 
well defined interfaces between them.  This makes it possible to 
independently extend each subsystem.  Although we expect that the binding 
subsystem will rarely change, the number of distribution types needed will 
grow as Net-Fx is used to coordinate non-Fx modules.  Further, to 
achieve high performance, it will be necessary to specialize the 
communication system for different networks.  By clearly defining the 
interfaces between subsystems, we make this specialization possible and 
easy.  Indeed, all the subsystems are ``plug-and-play.''

The {\em communication subsystem} is the heart of the run-time system and 
exports the previously discussed intertask communication interface to the 
Net-Fx program.  It makes use of the {\em binding subsystem} to determine 
current module and distribution bindings.  In return it supplies the 
binding subsystem with a simple {\em binding communication interface} in 
case communication is necessary.  If the communication system determines 
that distribution computation must be done (for example, a dynamic 
distribution may have changed), it is performed by the {\em distribution 
help subsystem}.  This subsystem serves two purposes.  First, it dispatches 
the appropriate distribution computation function for the sender and 
receiver distribution types.  Second, it conforms the address relation 
returned by that function to the format desired by the communication 
system.  (The communication system can request that a message be assembled 
instead.) The subsystem uses the binding interface to bind the two 
distribution types to the distribution computation function.  This function 
may be one of the {\em distribution help library functions} linked to every 
Net-Fx executable, or a compiler-generated {\em custom distribution help 
function}.  In either case, the function conforms to the {\em standard 
distribution computation interface}.

Once address relations have been computed (or messages assembled),  the
communication system can use the module bindings to identify the 
actual processors to communicate with and perform the data transfer.   On
the receiving module, essentially the same steps are followed, with 
obvious changes.

\subsection{An Example}

\begin{figure}
        \centerline{\epsfxsize=5in \epsfbox{useint.eps}}
	\caption{Example of the steps followed for an intertask send using a 
	custom distribution help function.}
	\protect\label{fig:steps}
\end{figure}

In this section, we show how an intertask send using a custom distribution 
help function is performed.  Figure \ref{fig:steps} shows the steps 
graphically.  On the right of the figure are the actual calls performed by 
the node program and internally for the intertask send.  There is a 
dependence from module $M1$ to module $M2$, which the compiler has 
satisfied by a generating a send from $M1$ to $M2$ (and a matching receive 
in $M2$, which is not discussed here.) The compiler also produced a custom 
distribution computation function $distfunc1$ for this communication.  At 
runtime, $M1$'s first step is to use the registry interface to register the 
initial values of all relavent bindings.  The module bindings for $M1$ and 
$M2$ are registered as static, so their values will never change.  Also 
registered as static are the distribution types $DT1$ and $DT2$ on the $M1$ 
and $M2$, respectively.  Finally, it registers $distfunc1$ as the 
distribution computation function for $DT1$ and $DT2$.    

The second step (which may be repeated many times) is to invoke a send
of the distributed data structure $A$ to $M2$.  The intertask send call 
includes the target module number, $M2$, the first element of $A$ that 
is local to the process, the data type of $A$'s elements (REAL), the 
sender and receiver distribution types, $DT1$ and $DT2$, and the current 
sender and receiver distributions, $DISTA1$ and $DISTA2$, respectively.   
Now in the context of the intertask send call, the third step is to bind 
the module $M2$ to its current physical mapping, and the distribution
types and distributions to their current distributions.  Both of these 
steps are simple in this case because the bindings are static - they 
amount to local lookups.   In the fourth step, the communication subsystem 
calls the distribution help subsystem to build an address relation 
between each pair of sender and receiver processes $x$ and $y$.   The 
distribution help subsystem accomplishes this by binding the two 
distribution types, $DT1$ and $DT2$, to their corresponding distribution 
computation function, $distfunc1$ (step 5) and calling the function for 
each pair (step 6).   The relations that are returned are conformed to 
the format the communication subsystem wanted and returned to it.   
Finally, the relations are used to perform the actual communications in a 
environment-dependent manner.  The communication subsystem may decide to
keep the address relations handy to avoid steps 4-6 the next time the 
intertask send occurs.  Indeed, because the distributions are static, the 
relations could be kept indefinitely.


\section{Status and implementation}


\bibliography{/afs/cs/project/iwarp/member/jass/bib/all}

\end{document}
