\documentstyle{article}
\setlength{\topmargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textheight}{9in}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\marginparsep}{0in}
\setlength{\marginparwidth}{0in}



\begin{document}

\section{Run-time Support for Tasking}

\subsection{Motivation}

In a Net-Fx program, data parallel tasks running in submodules produce
and consume distributed data structures such as HPF arrays or DOME
load-balanced vectors.  Communicating distributed data structures
between submodules can involve quite complex {\em distribution
computation} performed by code generated by a compiler or built into a
run-time system.  Traditionally, this code has used low level
communication primitives supplied by an oblivious communication system
to actually transport data.  This works fine when the network are
simple and highly regular, such as the network of a parallel
supercomputer.  However, the network that Net-Fx orchestrated tasks
will use to communicate may have an arbitrary topology, links with
different speeds, and many different protocol stacks.  Further, the
network may have other, unrelated traffic.  A common operation may be
sending the data in a DOME vector distributed over several
workstations over a general use HIPPI network to an HPF array
distributed over the nodes of an Intel Paragon.  This added complexity
forces a rethinking of how to achieve high performance communication
between submodules.

\subsection{Overview}

The Net-Fx run-time system is primarily responsible for transporting
distributed data structures between submodules.  This communication
must avoid synchronizing the source and destination submodules, but at
the same time, it must support changing data distribution and
submodule bindings.  The details of these bindings, distribution
computation, and of actual data transport over a likely complex and
irregular network are hidden below simple, high-level, non-blocking
collective send and receive calls for whole distributed data
structures.

Within the run-time system itself, the tasks of distribution
computation, maintaining bindings, and communication are segregated
into subsystems with well defined interfaces between them.  This makes
it possible to orthogonally extend each subsystem.  Although we expect
that only a small set of binding semantics will have to be supported,
the number of distribution types needed will grow as Net-Fx is used to
coordinate non-Fx submodules.  Further, to achieve high performance,
it will be necessary to specialize the communication system for
different networks.  By clearly defining the interfaces between
subsystems, we make this specialization possible and easy.

The communication subsystem is the heart of the run-time system and
exports high-level send and receive calls to the Net-Fx program.  Well
defined interfaces connect it to the binding subsystem and the
distribution help subsystem.  The binding subsystem maps abstract
submodule and data structure identifiers into their current physical
realization and distributions, respectively.  The distribution help
subsystem performs distribution computation to convert between the
source and destination distributions and distribution types.  Because
some distribution types may not be supported directly by the
distribution help subsystem, or because a faster support function
exists, the Net-Fx program can register support functions for use by
the distribution help subsystem.  Both the registration interface and
the calling convention for such support functions are standardized so
that they are not tied to any implementation of the run-time system.

\subsection{Submodule and Data Distribution Bindings}

The run-time system must maintain bindings from submodule identifiers
to their current physical realization, and bindings from data
distribution identifiers to their current distribution descriptors.  A
Net-Fx program consists of some number of submodules, numbered $0
\dots N_s-1$.  Submodule $m$ is composed of some number of processes
numbered $0 \dots N_m-1$ which execute on the processors of a
processor group.  A processor group is a set of processors that share
common attributes.  For example, an Intel Paragon or a workstation
cluster may be a processor group.  A submodule binding is found via
the function
\begin{equation}
submodid \times process \rightarrow procgrp \times processor \times localid
\end{equation}
which binds a submodule identifier and process in that submodule to
the processor group, processor, and local process id of the
corresponding process.  The communication system uses the submodule
bindings to decide how to communicate between the processes in the
source and target submodules. If the distribution help subsystem is
used, it is invoked with the distributions and distribution types on
the source and target submodules, which are found via the function
\begin{equation}
submodid \times disttype \rightarrow distribution
\end{equation}
which binds a submodule and a distribution type identifier to that a
distribution.  

Both kinds of bindings can take one of three forms: simple, static, or
dynamic.  {\em Simple bindings} exist to simplify the use of legacy
codes as submodules.  With a simple submodule binding, the run-time
system knows only of a single {\em proxy process} for the submodule.
If a submodule binding is simple, its data distribution bindings must
also be simple --- they must all bind to non-distributed arrays.  For
example, a PVM master/slave program could be used as a submodule by
making the master the proxy process.  Note that with simple bindings
there may be processes which the run-time system (and compiler) are
unaware of (the slave processes in the example.)  On the other hand,
{\em static bindings} are known at compile time, fully specify the
submodule or data distribution, and do not change over the course of
executing the Net-Fx program.  Similarily, {\em dynamic bindings} are
initially supplied at compile time, are full specifications, but may
change during the course of execution, at the granularity of the task
procedure call.  Notice that dynamic bindings are at odds with the
goal of minimizing synchronization between submodules.

Maintaining these bindings in the run-time system serves three
functions.  First, it permits the run-time communications interface
can present an abstract view of communication of distributed data
structures between submodules.  Second, because of the submodule
binding, distribution computations are independent of the actual
machine configuration.  Finally, support for changing dynamic bindings
can be made implementation dependent.


\subsection{Programming Interface}

In this section, we examine the programming interface as it used
during the execution of a Net-Fx program.  The interface exported to
the Net-Fx node program from the communication system is discussed in
sections \ref{init}, \ref{comm}, and \ref{bindchg}.  Section
\ref{disthelp} discusses the interface the distribution help system
provides to the communication system.  Finally, sections \ref{custfunc},
\ref{custreg} discuss how custom distribution help functions are called
and registered.

\subsubsection{Initialization}
\label{init}
When a Net-Fx program begins, it specifies all initial submodule and
dynamic data distribution bindings to the run-time system.  Notice
that for each form of binding, initial values are available at compile
time.  Each process of the program initializes the run-time system by
supplying the initial bindings for {\em all} submodules and dynamic
data distributions.  

We also overload the data distribution type to support custom
distribution help functions.  However, a data distribution can be both 
dynamic and have a custom distribution help function.  


\subsubsection{Communication}
\label{comm}
Communication between submodules is accomplished via collective
\verb.t_send. and \verb.t_recv. calls.  These functions are called by 
every process of a submodule, except in the case of a simply bound
submodule, in which case only the proxy process invokes them.  The
\verb.t_send.  and \verb.t_recv. functions have the following prototypes:
\begin{center}
\begin{verbatim}
int t_send(unsigned target_submodule, 
           void     *local_chunk,
           unsigned data_type,
           void     *sender_distribution,
           unsigned sender_distribution_type,
           void     *receiver_distribution,
           unsigned receiver_distribution_type,
           void     *hint,
           unsigned hint_type)
 
int t_recv(unsigned source_submodule, 
           void     *local_chunk,
           unsigned data_type,
           void     *sender_distribution,
           unsigned sender_distribution_type,
           void     *receiver_distribution,
           unsigned receiver_distribution_type,
           void     *hint,
           unsigned hint_type)
\end{verbatim}
\end{center}
The target and source submodule numbers signify which submodule will
be sent to and received from, respectively.  \verb.Local_chunk. points
to the first data item of the distributed data structure owned by the
calling processor.  The array elements are of type \verb.data_type..
Some data types numbers are reserved.  

Notice that both calls include the source and target distributions.


The communication system




\subsubsection{Registering Custom Distribution Help Functions}


\subsection{Scheme}
Clearly, enabling high performance communication of distributed data 
structures over complex networks is difficult, so we propose a divide and 
conquer approach, dividing the task between the {\em communication system} 
and the {\em distribution help system}.  The job of the communication 
system is to provide high performance collective communication between the 
submodules.  The job of the distribution help system is to perform the 
necessary distribution computation to determine what to actually send and 
to whom.  

The communication system provides an {\em intertask communication API} to 
the Net-Fx program.  This interface consists of non-blocking collective 
send and receive calls, which include the source and target data 
distributions.  A send and receive is invoked on each process of the 
sending and receiving submodules, respectively.  The communication system 
can either interpret the distributions itself, or pass them to the 
distribution help system through the standard {\em distribution help 
interface}.  On the source submodule, the communication system can request 
that a message be assembled for a process in the destination submodule, or 
an {\em index relation} between the locations of data elements local to the 
source processor and their locations on the destination processor be 
computed and returned.   The communication system on the destination 
submodule has symmetric options.  

Some distribution computation functions may be built into the distribution 
help system, but we have found that specializing functions for particular 
distributions via compile-time optimizations can greatly improve 
performance.  On another note, new distribution types may need to be 
supported without adding that support directly to the distribution help 
system.  For these reasons we permit custom distribution computation 
functions linked to the Net-Fx executable to be {\em registered} for use
by the distribution help system.  All such function must adhere to a 
single interface.  For clarity, consider an example:  Before
performing a communication for which it has a custom distribution 
computation function, the Net-Fx program will register that function.
This establishes a binding between the distributions the function supports
and the function itself.  The program then invokes a send or receive,
passing those distribution as arguments.  The communication system will
then call the distribution help system with the distributions.  At this
point, it will be recognized that there exists a custom function for
the distributions and it will be invoked.  





\end{document}

