\input epsf
%\input{psadobe}
\makeatletter
\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}}}
\makeatother
\documentstyle[fleqn]{art-nocolon}
\include{psfig}
%\include{caption}


\textheight =  9.0in
\textwidth = 7.0in              % assume output has this much centered

\topmargin = -.5 in 
\oddsidemargin = -.25 in
\evensidemargin = -.25 in

\renewcommand{\floatpagefraction}{1}
\renewcommand{\topfraction}{1}
\renewcommand{\bottomfraction}{1}
\renewcommand{\textfraction}{0}
\renewcommand{\baselinestretch}{1}
\renewcommand{\arraystretch}{1.5}


\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newcommand{\ignore}[1]{}

\renewcommand{\to}{\rightarrow}
\newcommand{\A}{{\cal A}}
\begin{document}
\bibliographystyle{acm}

%\load{\footnotesize}{\sc}
\title{ 
%\begin{flushleft}
%\vspace{-1.0in}
%\centerline{\scriptsize\em (Extended abstract submitted for PLDI 94)}
%\normalsize { }
%\end{flushleft}
%\vspace{0.4in}
Language and Runtime Support for Network Parallel Computing\\
}
\author{}
\date{   {email:\tt \{\}@cs.cmu.edu} \vspace*{.2in} \\
      School of Computer Science \\
         Carnegie Mellon University \\
         Pittsburgh PA 15213 \vspace*{.2in}}
%        Corresponding Author: Jaspal Subhlok \\
%        Phone: (412) 268 7893}


\maketitle
\thispagestyle{empty}
\abstract{
\normalsize
%Fast networks have become  very important 
%for high performance computing as many applications need hardware
%and software resources that are distributed  across networks. 
The
latencies and bandwidths available on networks have improved
dramatically making it possible to distribute applications on a
heterogeneous network and execute them efficiently. However, programming
support for network parallel applications is 
%virtually non-existent.
rudimentary, especially in the context of standard languages.
This paper describes how the Fx compiler at Carnegie Mellon
for a variant of  High Performance Fortran is being extended
to support network parallel computing. Task parallelism is used
to distribute and manage computations across the sequential
and parallel machines of a network. We introduce a network communication
interface  which makes it possible to efficiently
compile the programs for heterogeneous networks even in the 
absence of full runtime information, which is often the 
case in network parallel computing. We describe how the Fx
programs are compiled onto the communication interface and discuss
the issues in implementing the interface. The result is that
a programmer is able to write a single program that executes
efficiently across the network, even in the presence of architectural
heterogeneity and load imbalance.
}

%\voffset = -.5 in
\normalsize

\setcounter{page}{1}
\section{Introduction}\label{intro}
With the advent of standard parallel languages like High Performance
Fortran (HPF), parallelizing compilers have become the most convenient
and powerful way of developing programs for parallel machines. These
standard languages make it easy to run the same program efficiently on
a wide variety of sequential and parallel machines.

At the same time, recent advances in standard network protocols (i.e.,
HiPPI and ATM) have made it possible to link these different machines
together and build network parallel applications. The low latency and
high bandwidths of the new protocols offer the possibility of high
efficiency, and the standard languages offer the benefit of a common
programming environment.

However, standard parallel languages offer rudimentary support, at
best, for network communication. The low-level details of the network
protocols are usually exposed, making the
construction of an efficient network parallel application a difficult and
error-prone task in which the applications programmer must become a
network specialist. 

%Parallelizing compilers have become the primary way
%of programming parallel machines. This is made possible by 
%advances in compilation technology for parallel computers,
%and by the emergence of standard, portable  parallel languages, 
%in particular High Performance Fortran (HPF). This mode of parallel
%programming frees the programmer from the the details of
%communication between processors, and is critical
%for widespread use of parallel machines.
%
%Recently, the latency and bandwidth of networks have improved dramatically
%and standard protocols for high speed communication
%(i.e. HiPPI and ATM) are now available.
%%The result is that it is practical to  map an
%%application across a variety of sequential and parallel 
%%machines and still achieve good performance. Many real applications
%%need hardware and software resources distributed across networks.
%%e.g. ...
%%
%These networks provide the hardware resources to
%develop efficient network parallel applications, 
%and portable, standard languages make it possible to map the same program efficiently
%across a wide variety of sequential and parallel machines,
%but software support
%for network parallel computing is rudimentary. 
%The programmer has no help in managing the heterogeneity
%of the underlying system: different nodes in the network may have widely
%varying computation and communication capabilities, and even different ways
%of representing data. The result is that
%%Using a parallelizing compiler like Fx or Fortran D, parts of a parallel
%%application can be developed for different computers with
%%relative ease. However, 
%building the full application
%%requires
%%explicit communication across networks, which
%is a difficult
%and error prone task that forces the applications programmer to become a
%network specialist.

The goal of our research is to make it possible to program
the network applications in an easy and efficient way. Compiling 
for a variety of machines across a network presents several
new challenges. This paper shows how parallelism
inside a single machine and across a set of machines can be
expressed in a uniform framework. We also present the design
of a runtime system that enables efficient compilation of
programs for network parallel computing.

The programming model we use combines task and data 
parallelism. Procedures in an application are data parallel 
and each procedure is mapped to a homogeneous set of nodes,
which may be different nodes of a large parallel machine
or cluster of workstations. Different procedures are placed
on different sets of nodes on the network and the compiler
manages the communication between them.

%For several reasons,
The most challenging part of compilation for
a network is generating efficient communication. The most efficient
communication interface across a network is typically different from the 
communication interface between the nodes of a parallel machine, hence
the compiler has to manage multiple communication paradigms.
More important, the network environment is expected to be 
dynamic. Runtime load balancing is important even for applications
with a static computation structure since the external
load on different machines  can vary. Hence the compiler has
limited information about the runtime environment during 
code generation.

For generating efficient communication in a dynamic runtime
environment, it is important to distinguish between the
aspects of the communication that are constant for the
duration of the execution and those that are varying and 
are known only at runtime. In our model, the program is
partitioned into modules and each module is mapped to a
set of processors. The communication steps between modules
are determined at compile time, while the actual low level communication
calls may depend on the  runtime environment.

Our solution to generating efficient communication involves
defining an inter module communication interface. This 
interface provides mechanisms for communication between
modules which are mapped on groups of nodes. The compiler
generates code for this communication interface and the underlying
implementation of the interface ensures that appropriate low
level communication calls are generated.

We begin by motivating the need for network parallel applications with
a few examples. Then we describe our approach, which we call NetFx, as
it is based on our variant of High Performance Fortran, Fx.

\section{Motivation for Network Parallel Computing}

Network parallel programming makes it possible to build larger and more 
powerful application systems than can be built without it, but this is not 
the only or even the best motivation.  We are developing applications in 
which the network parallelism is an inherent feature, chiefly due to the 
need for heterogeneity, regardless of the size of the computer system used.  
These applications include real-world modeling and air quality modeling.

\subsection{Network Parallel Computation for Real-World Modeling}

High performance computers and their programming tools have
traditionally been applied to problems in scientific modeling:
simulation, analysis, and the like. We are applying these same systems
and tools to problems in real-world modeling, by which we mean the
three-dimensional shape modeling of objects and environments in the
real world by direct observations using computer vision techniques. In
so doing we hope to replace complex and specialized sensory systems
based on technology like laser range-finding by much simpler, more
flexible and scalable, but computationally more expensive, systems
like stereo vision.

In these applications the sensor system plays a key role. In some
situations, e.g., imaging moving objects, it is necessary to capture
imagery at high speeds, i.e., video rate (30 Hz) or faster. Capturing
data at this rate is complicated by the need to synchronously access
several cameras (recently the most significant advances in stereo
vision accuracy and reliability have come from the introduction of
multiple cameras.)  In other situations, e.g., imaging building
interiors, speed of image capture is not so important but the imaging
system must be portable.

\begin{figure}
\label{fig:netargos}
\centerline{\epsfxsize=4in \epsfbox{netargos.eps}}
\caption{foobar}
\end{figure}

We believe these conflicting requirements can be met through the
application of modern networking technology. The system we envision is
illustrated in Figure \ref{fig:netargos}. The camera system is a
detachable unit, and can be attached either to a large number of
workstations, giving a high frame capture rate, or to a smaller number
of portable workstations or PCs, giving portability at the expense of
speed. One the images are captured (and the portable computer system
is reattached to the network, if necessary), they are transformed into
stereo vision data through the application of parallel stereo vision
techniques we have invented and coded in Fx; because of Fx's
architecture independence we can run these programs on the
workstations themselves or on powerful tightly-coupled high
performance computers.  The resulting range imagery is then
transformed into a three-dimensional model, again in parallel using
Fx, and the resulting model can then be stored to disk or displayed
remotely, enabling sophisticated three-dimensional videoconferencing.

\subsection{Air Quality Modeling}

The air quality modeling application \cite{cmutaskparallelsuite} also 
motivates task parallelism on networks.  This application essentially has 
three data parallel steps, repeated indefinitely.  The first step, reading 
input data from storage and preprocessing it, is I/O intensive.  The second 
step, computing chemical interaction in the atmosphere, exhibits massive 
parallelism and little communication.  The third step, computing the 
movement of chemicals via the wind, has significant parallelism and 
communication.  Clearly, each of these steps is well suited to a different 
machine.  For example, an Fx implementation of the chemistry step ran as 
fast on four ethernet connected DEC Alpha workstations as on 32 nodes of an 
Intel Paragon! Net-Fx will make it possible to run each step of the 
application on an appropriate machine without dealing with maddening 
details of their interaction.

\section{Programming model for Net-Fx}
The basic programming model provides support for task and
data parallelism and is not changed from Fx~\cite{SSOG93,GrOS94}.
Support for data parallelism is similiar to that provided in
High Performance Fortran and is described in~\cite{StOG94,YWSO93}.
Task parallelism is used to map data parallel subroutines onto
different, possibly heterogeneous, groups of processors. We outline
how task parallelism is expressed in Net-Fx and illustrate how it
is used to exploit coarse grain parallelism.


\subsection{Task parallelism}

We have not introduced any new language features and rely entirely on
compiler directives for expressing task parallelism.  To simplify the
implementation, the current version of Fx relies on the user to
identify the side effects of the task-subroutines and to specify
them.
Directives are also used to guide the compiler in making
performance related decisions like program mapping.  In this section,
we describe the directives and hints that are used to express task
parallelism and illustrate their use for the  FFT-Hist example.

\subsubsection*{Parallel sections}
Calls to task-subroutines are permitted only in special code regions
called {\em parallel sections}, denoted by a {\tt begin parallel/end
parallel} pair. For example, the parallel section for the FFT-Hist
example has the following form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
c\$ \>{\bf begin parallel} \\
\>     do i = 1,m \\
\>\>     call colffts(A)\\
c\$\>\>  {\em input/output and mapping directives}\\ \vspace{3mm}
\>\>     call rowffts(A)\\
c\$\>\>  {\em input/output and mapping directives}\\ \vspace{3mm}
\>\>     call hist(A)\\
c\$\>\>  {\em input/output and mapping directives}\\
\>     enddo\\
c\$\>   {\bf end parallel}
\end{tabbing}
\rm
The code inside a parallel section can only contain loops and
subroutine calls.  These restrictions are necessary to make it
possible to manage shared data and shared resources (including
nodes) efficiently at compile time.

A parallel section corresponds to a mapping of task-subroutines to
nodes. The corresponding mapping outside the parallel section
is a simple data parallel mapping, where every routine is mapped
to all nodes. The current implementation does not allow
nesting of parallel sections.

\subsubsection*{Input/output directives}

The user includes {\tt input} and {\tt output} hints to
define the side-effects of a task-subroutine, i.e., the data space
that the subroutine accesses and modifies.  Every variable whose value
at the call site may potentially be used by the called subroutine must
be added to the input parameter list of the task-subroutine.
Similarly, every variable whose value may be modified by the called
subroutine must be included in the output parameter list.  A variable
in the input or output parameter list can be a scalar, an array, or an
array section.  An array section must be a legal Fortran~90 array
section, with the additional restriction that all the bounds and step
sizes must be constant.

For example, the input and output directives for the call to {\tt
rowffts} have the form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
\>\>     call rowffts(A)\\
c\$\>\>  {\bf input} (A),  {\bf output} (A)\\
c\$\>\>  {\em mapping directives}\\
\end{tabbing}
\rm
This tells the compiler the subroutine {\tt rowffts}
can potentially use values of, and write to the parameter array {\tt A}.
As another example, the input and output directives
for the the call to {\tt colffts} has the form:
\tt
\begin{tabbing}
tttttttt\=ttt\=ttt\= ttt\=ttt\= \kill
\>\>     call colffts(A)\\
c\$\>\>  {\bf output} (A)\\
c\$\>\>  {\em mapping directives}\\
\end{tabbing}
\rm

This tells the compiler that subroutine {\tt colffts} does not use the value
of any parameter that is passed but  can potentially write to array {\tt A}
(which is set to values read from a sensor by {\tt colffts}).

\subsubsection*{Mapping directives}

\subsection{Compiling for networks}

\section{Run-time Support for Tasking}

The Net-Fx run-time system is responsible for efficiently transporting data 
sets between modules.  This task is complicated by a number of 
requirements, including dynamic distribution and module bindings, 
asynchrony, the desire for compile-time customizability, and node and 
network heterogeneity.  The interface exported by the run-time system 
presents the abstraction of high-level module-to-module communication of 
distributed data sets via collective non-blocking send and receive 
operations.  Compile-time customizability is achieved by a facility for 
registering compiler-generated distribution functions with the 
run-time system.  This mechanism also provides a path for extending the 
range of supported distribution types.  The run-time system itself consists 
of three subsystems with well defined interfaces between them.  More than a 
good design practice, this allows us to independently specialize subsystems 
for particular environments in order to deal with heterogeneity.

In this section, we first define the terms and abstractions we use.  
Next, we describe the interface exported by the run-time system.  This is 
followed by an explanation of the run-time system's internal structure 
and interfaces.  Finally, we show how the interfaces are traversed during 
the execution of a simple send operation.


\subsection{Terms and Abstractions}

A {\em task} is a data parallel subroutine.  One or more tasks are
grouped into a {\em module}, where they execute consecutively.  A
module runs on a set of processes mapped to a {\em processor group},
which is a set of processors that share some common attributes.  There
is a {\em dependence} between two modules if a task in one produces a
data set for a task in the other.  The compiler resolves this
dependence by generating a send call for the producer module and a
receive call for the consumer module.  Every module essentially
repeats three phases.  First, it receives data sets from each
module it is dependent on.  Next, it executes its tasks.  Finally,
it sends data sets to each module that depends on it.

The elements of a data set are partitioned across the processes of a module 
according to some rule.  This rule is identified by a distribution type and 
a distribution.  A {\em distribution type} identifies a class of similar 
rules, while a {\em distribution} identifies a specific member of the 
class.  For example, the distribution type of $HPF_ARRAY$ includes the 
distribution $(*,BLOCK)$.  Two distribution types identify a method for 
mapping between their member distributions, a process called {\em 
distribution computation}.  Given two distributions of the appropriate type 
and a process pair, the method, or {\em distribution function}, returns an 
{\em offset relation} between offsets into the chunk of elements owned by 
each process.  Given the address of the initial element owned by each 
process and the machine architectures they run on, the offset relation can 
be trivially converted into an {\em address relation}, which maps memory 
addresses on one machine to memory addresses on the other.  Alternatively, 
the distribution function can directly assemble and disassemble message 
buffers.

The distribution of a data set may be {\em static} or {\em dynamic}.  A 
static distribution never changes while a dynamic one may change.  A data 
set's distribution type never changes.  For example, a load balanced distribution 
type that involves data movement will certainly result in a dynamic 
distribution.  The mapping of a module's processes to a processor group may 
also be static or dynamic.  The process of finding the current values of 
distributions or module mappings is refered to as {\em binding} and the 
results are {\em bindings}.  Bindings are static or dynamic depending on 
the distribution or module mappings they represent.

The run-time system presents the abstraction of ``module-to-module'' 
communication of a distributed data set between uniquely numbered modules.  
This is a natural extension of point-to-point message passing where the 
endpoints are groups of processes and the data sets are distributed.  
Within the run-time system, the next lower level of abstraction is a 
mapping of each element of a data set to an element offset from an initial 
address on each process of a set of consecutively numbered processes.  This 
is the abstraction used during distribution computation.  Notice that this 
abstraction makes distribution computation and the resulting offset 
relation architecture-independent.   The abstractions play a substantial 
role in making the components of the run-time system interchangable.

\subsection{Interface}

The interface between the Net-Fx executable and the run-time system has 
three components as shown in Figure \ref{fig:interface}.  The first is the 
{\em intermodule communication interface}, which consists of simple, 
high-level, non-blocking collective send and receive calls for transporting 
whole distributed data sets between modules.  It is required that if more 
than one data set is transfered between a pair of modules, the order of the 
sends and receives on the modules be the same.  Of course, the module must 
take care that all receives complete before executing its tasks.  The 
arguments of the send and receive calls are
\begin{itemize}
\item Target (send) or source (receive) module 
\item Address of first element local to the the calling process
\item Data type of each element
\item Sender distribution type and distribution
\item Receiver distribution type and distribution
\item Hint (implementation specific)
\end{itemize}
There is a single call to wait for all outstanding receives to complete.  

Before performing sends and receives, the Net-Fx program must describe the 
characteristics of relavant modules, distribution types, and custom 
distribution functions to the run-time system.  This is done through the 
{\em registry interface}.  The Net-Fx program explicitly registers module 
mappings or distribution types as dynamic or static.  All dynamic mappings 
or types must be registered.  In the simplest case, where module mappings 
are static, and only static, library-supported distribution types are used, 
only the inital module mappings are registered and predefined constants are 
used for distribution types.  To use a predefined distribution type that is 
dynamic, the program must register a dynamic instance of it.  Dynamic 
bindings are updated implicitly by examining send and receive calls.

For several reasons, a program may want to register its own distribution 
types.  One reaon is extensibility --- the distribution type may not be 
supported by the library.  Another is efficiency --- the compiler may have 
been able to generate a specialized distribution function for a particular 
communication.  In either case, the program must also register custom 
distribution functions for any communication that involves the new 
distribution type.  For example, suppose module 1 communicates from a new 
distribution type $DTX$ to predefined types $DT2$ on module 2 and $DT3$ 
on module 3.  All three modules must register distribution type $DTX$.
Module 1 must register new distribution functions $\lambda (DTX,DT2)$ and 
$\lambda (DTX,DT3)$, module 2 must register $\lambda (DTX,DT2)$, and module 
3 must register $\lambda (DTX,DT3)$.  All distribution functions must 
conform to a common calling convention, the {\em standard distribution
function interface}.  Each distribution function must be able to both
return offset relations or assemble/disassemble message buffers.  The 
run-time system decides which functionality will be used.

\subsection{Internal Structure}

\begin{figure}
\label{fig:interface}
\centerline{\epsfxsize=5in \epsfbox{Int1.eps}}
\caption{The Net-Fx run-time system: subsystems, interfaces, and
relation to the Net-Fx executable}
\end{figure}

%\caption[]{he Net-Fx run-time system: subsystems, interfaces, and 
%relation to the Net-Fx executable.  Each arrow represents an interface 
%and points from the caller of the interface to the callee.  
%Note that some callees have several callers.}

The run-time system presents the abstraction of ``module-to-module'' 
communication of a distributed data set between uniquely numbered modules.  
This requires distribution computation, binding maintenance, and efficient 
communication over heterogenous networks.  These tasks are segregated into 
three subsystems (distribution, binding, and communication) with well 
defined interfaces between them.  More than just good design practice, the 
ability to independently develop and enhance each subsystem is vital for 
extensibility and good performance.  Clearly, the number of distribution 
types will grow as Net-Fx is used to coordinate non-Fx modules.  Further, 
compile-time optimization of distribution computation is important for 
performance \cite{stichnot94}.  There are a variety of mechanisms that 
could be used to support dynamic bindings and it is unclear that any one is 
best for all environments.  Finally, to achieve high performance, it will 
be necessary to specialize the communication system for different networks.  
By clearly defining the interfaces between subsystems, we make this 
specialization possible and easy.  Indeed, all the subsystems are 
``plug-and-play.''

The internals of the run-time system are shown in figure 
\ref{fig:interface}.  The {\em communication subsystem} is the heart of the 
run-time system and exports the previously discussed intermodule 
communication interface to the Net-Fx program.  It makes use of the {\em 
binding subsystem} to determine current module and distribution bindings.  
In return it supplies the binding subsystem with a simple {\em binding 
communication interface} in case intermodule communication is necessary.  
If the communication subsystem determines that distribution computation 
must be done (for example, a dynamic distribution may have changed), it 
calls on the {\em distribution subsystem} via the {\em distribution 
interface}.  This subsystem serves two purposes.  First, it dispatches the 
appropriate distribution function for the sender and receiver distribution 
types.  Second, it conforms the offset relation returned by that function 
to the format desired by the communication system and converts it to an 
address relation.  (The communication system can request that a message be 
assembled or disassembled instead.) The subsystem uses the binding 
interface to bind the two distribution types to the distribution function.  
This function may be one of the {\em library distribution functions} linked 
to every Net-Fx executable, or a compiler-generated {\em custom 
distribution function}.  In either case, the function must conform to the 
{\em standard distribution function interface}.

Once address relations have been computed (or messages assembled), the 
communication system can use the module bindings to identify the actual 
processors to communicate with and perform the actual data transfer and a 
network and architecture dependent manner.  On the receiving module, 
essentially the same steps are followed, with obvious changes.

\subsection{Example}

\begin{figure}
        \centerline{\epsfxsize=5in \epsfbox{useint.eps}}
	\caption{Example of the steps followed for an intertask send using a 
	custom distribution function.}
	\protect\label{fig:steps}
\end{figure}

For clarity, we show how an intertask send using a custom distribution 
function is performed.  Figure \ref{fig:steps} shows the steps graphically.  
On the right are the actual calls performed by the node program and 
internally for the intertask send.  Notice that the calls are numbered to 
correspond with the interfaces used and with the steps discussed in this 
section.  There is a dependence from module $M1$ to module $M2$, which the 
compiler has satisfied by a generating a send from $M1$ to $M2$ (and a 
matching receive in $M2$, which is not discussed here.) The compiler also 
produced a custom distribution function $DISTFUNC$ for this communication.  
At runtime, $M1$'s first step is to use the registry interface to register 
the initial values of all relavent bindings.  The module bindings for $M1$ 
and $M2$ are registered as static, so their values will never change.  Also 
registered as static are the distribution types $DT1$ and $DT2$ .  Finally, 
$M1$ registers $DISTFUNC$ as the distribution function for $DT1$ and $DT2$.

The second step (which may be repeated many times) is to invoke a send of 
the distributed data set $A$ to $M2$.  The intermodule send call includes 
the target module number, $M2$, the first element of $A$ that is local to 
the process, the data type of $A$'s elements (REAL), the sender and 
receiver distribution types, $DT1$ and $DT2$, and the current sender and 
receiver distributions, $DISTA1$ and $DISTA2$, respectively.  Now in the 
context of the intermodule send call, the third step is to bind the two 
modules to their current physical mappings, and the distribution types and 
distributions to their current distributions.  Both of these steps are 
simple in this case because the bindings are static --- they amount to 
local lookups.  In the fourth step, the communication subsystem calls the 
distribution subsystem to build an address relation between each pair of 
sender and receiver processes $P1$ and $P2$.  The distribution subsystem 
accomplishes this by binding the two distribution types, $DT1$ and $DT2$, 
to their corresponding distribution function, $DISTFUNC$ (step 5) and 
calling the function for each pair (step 6).  The offset relations returned 
are conformed to the format the communication subsystem wanted and 
converted to address relations.  Finally, the relations are used to perform 
the actual communication in a environment-dependent manner.  The 
communication subsystem may decide to keep the address relations handy to 
avoid steps 4-6 the next time the intertask send occurs.  Indeed, because 
the distributions are static, the relations could be kept indefinitely.


\section{Status and implementation}

The Fx compiler currently supports task parallelism on the Intel iWarp, 
Intel Paragon, and on DEC Alpha workstations running PVM.  Compiler 
analysis of parallel sections (although not mapping directive support or 
code emmision) is the same for each target.  The run-time system used on 
the Paragon and the Alphas was initially developed for the Alphas with an 
eye towards portability and serves as a prototype for the system discussed 
above.  The prototype essentially has three components: a component that 
computes offset relations for any two (static) HPF distributions, a 
component that conforms these relations to a particular format and caches 
them, and a component that does communication using the relations.  Porting 
this run-time system to the Paragon (and acheiving good performance) 
required rewriting only the communication component.  Further details of 
the library are available in \cite{Dinda95}.  


\section{Other Work}

Run-time systems to support communicating data parallel tasks is a fairly 
new research area.  It seems to be agreed that high level communication 
support is the most important component.  Opus \cite{Hains95} supports 
communication between tasks using intermediaries called shared data 
abstractions (SDAs.) An SDA is essentially an independent object class 
whose private data is distributed over some set of threads and whose 
methods are protected with monitor semantics.  For example, one could 
define a queue SDA whose elements are block-distributed vectors.  To send a 
vector to another task would first send it to an SDA queue object (which requires a 
distribution computation) using the enqueue method.  The receiving 
task would invoke the dequeue method which would cause the SDA queue object to 
sent it a data set (again requireing a distribution computation.) SDAs 
can support many more tasking paradigms than our scheme, but can double the 
communication volume and result in unnecessary synchronization between 
tasks due to monitor semantics.

Closer to our scheme is the library described in \cite{Avalini95}.  This 
library supports communication between data parallel tasks with colllective 
sends and receives.  Sends (and receives) are split between an 
initialization step which performs distribution computation and build 
communication schedules, and an execution step that actually performs the 
send.  Initializations synchronize sending and receiving tasks, but can be 
amortized over many executions.  Only static module bindings and static HPF 
distributions are supported.  All distribution computation is done at with 
general purpose library routines.


\bibliography{/afs/cs/project/iwarp/member/jass/bib/all}

\end{document}
