\input epsf
%\input{psadobe}
\makeatletter
\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}}}
\makeatother
\documentstyle[fleqn]{art-nocolon}
\include{psfig}
%\include{caption}


\textheight =  9.0in
\textwidth = 7.0in              % assume output has this much centered

\topmargin = -.5 in 
\oddsidemargin = -.25 in
\evensidemargin = -.25 in

\renewcommand{\floatpagefraction}{1}
\renewcommand{\topfraction}{1}
\renewcommand{\bottomfraction}{1}
\renewcommand{\textfraction}{0}
\renewcommand{\baselinestretch}{1}
\renewcommand{\arraystretch}{1.5}


\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newcommand{\ignore}[1]{}

\renewcommand{\to}{\rightarrow}
\newcommand{\A}{{\cal A}}
\begin{document}
\bibliographystyle{acm}

%\load{\footnotesize}{\sc}
\title{ 
%\begin{flushleft}
%\vspace{-1.0in}
%\centerline{\scriptsize\em (Extended abstract submitted for PLDI 94)}
%\normalsize { }
%\end{flushleft}
%\vspace{0.4in}
Language and Runtime Support for Network Parallel Computing\\
}
\author{}
\date{   {email:\tt \{\}@cs.cmu.edu} \vspace*{.2in} \\
      School of Computer Science \\
         Carnegie Mellon University \\
         Pittsburgh PA 15213 \vspace*{.2in}}
%        Corresponding Author: Jaspal Subhlok \\
%        Phone: (412) 268 7893}


\maketitle
\thispagestyle{empty}
\abstract{
\normalsize
%Fast networks have become  very important 
%for high performance computing as many applications need hardware
%and software resources that are distributed  across networks. 
The
latencies and bandwidths available on networks have improved
dramatically making it possible to distribute applications on a
heterogeneous network and execute them efficiently. However, programming
support for network parallel applications is 
%virtually non-existent.
rudimentary, especially in the context of standard languages.
This paper describes how the Fx compiler at Carnegie Mellon
for a variant of  High Performance Fortran is being extended
to support network parallel computing. Task parallelism is used
to distribute and manage computations across the sequential
and parallel machines of a network. We introduce a network communication
interface  which makes it possible to efficiently
compile the programs for heterogeneous networks even in the 
absence of full runtime information, which is often the 
case in network parallel computing. We describe how the Fx
programs are compiled onto the communication interface and discuss
the issues in implementing the interface. The result is that
a programmer is able to write a single program that executes
efficiently across the network, even in the presence of architectural
heterogeneity and load imbalance.
}

%\voffset = -.5 in
\normalsize

\setcounter{page}{1}
\section{Introduction}\label{intro}
With the advent of standard parallel languages like High Performance
Fortran (HPF), parallelizing compilers have become the most convenient
and powerful way of developing programs for parallel machines. These
standard languages make it easy to run the same program efficiently on
a wide variety of sequential and parallel machines.

At the same time, recent advances in standard network protocols (i.e.,
HiPPI and ATM) have made it possible to link these different machines
together and build network parallel applications. The low latency and
high bandwidths of the new protocols offer the possibility of high
efficiency, and the standard languages offer the benefit of a common
programming environment.

However, standard parallel languages offer rudimentary support, at
best, for network communication. The low-level details of the network
protocols are usually exposed, making the
construction of an efficient network parallel application a difficult and
error-prone task in which the applications programmer must become a
network specialist. 

%Parallelizing compilers have become the primary way
%of programming parallel machines. This is made possible by 
%advances in compilation technology for parallel computers,
%and by the emergence of standard, portable  parallel languages, 
%in particular High Performance Fortran (HPF). This mode of parallel
%programming frees the programmer from the the details of
%communication between processors, and is critical
%for widespread use of parallel machines.
%
%Recently, the latency and bandwidth of networks have improved dramatically
%and standard protocols for high speed communication
%(i.e. HiPPI and ATM) are now available.
%%The result is that it is practical to  map an
%%application across a variety of sequential and parallel 
%%machines and still achieve good performance. Many real applications
%%need hardware and software resources distributed across networks.
%%e.g. ...
%%
%These networks provide the hardware resources to
%develop efficient network parallel applications, 
%and portable, standard languages make it possible to map the same program efficiently
%across a wide variety of sequential and parallel machines,
%but software support
%for network parallel computing is rudimentary. 
%The programmer has no help in managing the heterogeneity
%of the underlying system: different nodes in the network may have widely
%varying computation and communication capabilities, and even different ways
%of representing data. The result is that
%%Using a parallelizing compiler like Fx or Fortran D, parts of a parallel
%%application can be developed for different computers with
%%relative ease. However, 
%building the full application
%%requires
%%explicit communication across networks, which
%is a difficult
%and error prone task that forces the applications programmer to become a
%network specialist.

The goal of our research is to make it possible to program
the network applications in an easy and efficient way. Compiling 
for a variety of machines across a network presents several
new challenges. This paper shows how parallelism
inside a single machine and across a set of machines can be
expressed in a uniform framework. We also present the design
of a runtime system that enables efficient compilation of
programs for network parallel computing.

The programming model we use combines task and data 
parallelism. Procedures in an application are data parallel 
and each procedure is mapped to a homogeneous set of nodes,
which may be different nodes of a large parallel machine
or cluster of workstations. Different procedures are placed
on different sets of nodes on the network and the compiler
manages the communication between them.

%For several reasons,
The most challenging part of compilation for
a network is generating efficient communication. The most efficient
communication interface across a network is typically different from the 
communication interface between the nodes of a parallel machine, hence
the compiler has to manage multiple communication paradigms.
More important, the network environment is expected to be 
dynamic. Runtime load balancing is important even for applications
with a static computation structure since the external
load on different machines  can vary. Hence the compiler has
limited information about the runtime environment during 
code generation.

For generating efficient communication in a dynamic runtime
environment, it is important to distinguish between the
aspects of the communication that are constant for the
duration of the execution and those that are varying and 
are known only at runtime. In our model, the program is
partitioned into modules and each module is mapped to a
set of processors. The communication steps between modules
are determined at compile time, while the actual low level communication
calls may depend on the  runtime environment.

Our solution to generating efficient communication involves
defining an inter module communication interface. This 
interface provides mechanisms for communication between
modules which are mapped on groups of nodes. The compiler
generates code for this communication interface and the underlying
implementation of the interface ensures that appropriate low
level communication calls are generated.

We begin by motivating the need for network parallel applications with
a few examples. Then we describe our approach, which we call NetFx, as
it is based on our variant of High Performance Fortran, Fx.

\section{Motivation for Network Parallel Computing}

Network parallel programming makes it possible to build larger and
more powerful
application systems than can be built without it, but this is not the
only or even the
best motivation. We are developing applications in which the network
parallelism is an inherent feature, chiefly due to the need for
heterogeneity, regardless of the size of the computer system used. One
of these applications is in real-world modeling.

\subsection{Network Parallel Computation for Real-World Modeling}

High performance computers and their programming tools have
traditionally been applied to problems in scientific modeling:
simulation, analysis, and the like. We are applying these same systems
and tools to problems in real-world modeling, by which we mean the
three-dimensional shape modeling of objects and environments in the
real world by direct observations using computer vision techniques. In
so doing we hope to replace complex and specialized sensory systems
based on technology like laser range-finding by much simpler, more
flexible and scalable, but computationally more expensive, systems
like stereo vision.

In these applications the sensor system plays a key role. In some
situations, e.g., imaging moving objects, it is necessary to capture
imagery at high speeds, i.e., video rate (30 Hz) or faster. Capturing
data at this rate is complicated by the need to synchronously access
several cameras (recently the most significant advances in stereo
vision accuracy and reliability have come from the introduction of
multiple cameras.)  In other situations, e.g., imaging building
interiors, speed of image capture is not so important but the imaging
system must be portable.

We believe these conflicting requirements can be met through the
application of modern networking technology. The system we envision is
illustrated in Figure xx. The camera system is a detachable unit, and
can be attached either to a large number of workstations, giving a
high frame capture rate, or to a smaller number of portable
workstations or PCs, giving portability at the expense of speed. One
the images are captured (and the portable computer system is
reattached to the network, if necessary), they are transformed into
stereo vision data through the application of parallel stereo vision
techniques we have invented and coded in Fx; because of Fx's
architecture independence we can run these programs on the
workstations themselves or on powerful tightly-coupled high
performance computers.  The resulting range imagery is then
transformed into a three-dimensional model, again in parallel using
Fx, and the resulting model can then be stored to disk or displayed
remotely, enabling sophisticated three-dimensional videoconferencing.

\section{Background and overview of NetFx}

Fx supports task parallelism which allows the exploitation
of paralleism between procedure calls. 

\section{Language}
\subsection{Task parallelism}
\subsection{Compiling for networks}

\section{Run-time Support for Tasking}

The Net-Fx run-time system is responsible for efficiently transporting data 
between submodules.  This task is complicated by a number of requirements, 
including dynamic distributions, asynchrony, compile-time customizability, 
and node and network heterogeneity.  In order to meet these requirements, 
the run-time system is specified as a three subsystems with well defined 
interfaces between them and the Net-Fx executable.

\subsection{Terms and Abstractions}

A {\em task} is a call to a data parallel subroutine.  One or more tasks 
are grouped into a {\em submodule} where they execute consecutively.  One or 
more submodules are grouped into a {\em module} where they execute 
concurrently.   The processes of a module are mapped onto a {\em 
processor group},  a set of processors that share some common 
attributes.  For the purposes of this section, it is useful to think of the
mapping of the processes of a submodule to a processoer group, which is, 
of course, just a subset of the module to processor group mapping. 

There is a dependence between two submodules if a task in one produces a 
data set for a task in the other.  The compiler resolves this dependence by 
generating a send call for the producer submodule and a receive call for 
the receiver module.  A submodule essentially repeats three phases.  First, 
it receives data sets from each submodule it is dependent on.  Next, it 
executes its tasks.  Finally, it sends data sets to each submodule that 
depends on it.  

The send and receive calls are collective and non-blocking, and present the 
abstraction of a ``point-to-point'' communication of a data item between 
uniquely numbered submodules.  Below this abstraction, on both the sending 
and the receiving submodule, we think of the elements of the data item 
being distributed across consecutively numbered processes.  This lower 
level abstraction is internal to the run-time system.  How the elements map 
to processes is defined by a distribution type and distribution.  A {\em 
distribution type} identifies a class of similar distributions, while a 
{\em distribution} identifies a specific member of the class.  For example, 
the distribution type of $HPF_ARRAY$ includes the distribution $(*,BLOCK)$.  
The combination of the sender and receiver distribution types identifies a 
method for mapping between the two distributions, a process that is called 
{\em distribution computation}.   The result of distribution computation is
an address relation which maps memory addresses in a sending process to
memory addresses in a receiving process.

The distribution of a data item may be {\em static} or {\em dynamic}.  A 
static distribution never changes while a dynamic one may change.  A 
distribution type never changes.  For example, a load balancing scheme that 
moves data will certainly result in a dynamic distribution.  The mapping of 
a submodule's processes to a processor group may also be static or 
dynamic.  The process of finding the current values of distributions or
submodule mappings is refered to as {\em binding} and the results are 
{\em bindings}.  Bindings are static or dynamic depending on the 
distribution or submodule mappings they represent.

\subsection{Subsystems and Interfaces}

\begin{figure}
%interfaces/subsystems
\caption{The Net-Fx run-time system: subsystems, interfaces, and relation 
to the Net-Fx executable.  Each arrow represents an interface and points
from the caller of the interface to the callee.  Note that some callee's 
have several callers.}
\label{fig:interface}
\end{figure}

This section summarizes the Net-Fx run-time system's subsystems and the 
interfaces between them and the Net-Fx executable.  Figure 
\ref{fig:interface} graphically shows these relationships.  The run-time system is 
primarily responsible for transporting distributed data structures between 
submodules.  This communication must avoid synchronizing the source and 
destination submodules, but at the same time, it must support changing data 
distribution and submodule bindings.  The details of binding, distribution 
computation, and of actual data transport over a likely complex and 
irregular network are hidden from the Net-Fx executable below two simple 
interfaces.  The first of these is the {\em intertask communication 
interface}, which consists of simple, high-level, non-blocking collective 
send and receive calls for transporting whole distributed data structures 
between submodules.  The second is the {\em registry interface}, which 
allows the Net-Fx node program to describe the characteristics of relavant 
submodules, data distribution types, and custom distribution computation 
functions to the run-time system.

Within the run-time system itself, the tasks of distribution computation, 
maintaining bindings, and communication are segregated into subsystems with 
well defined interfaces between them.  This makes it possible to 
independently extend each subsystem.  Although we expect that the binding 
subsystem will rarely change, the number of distribution types needed will 
grow as Net-Fx is used to coordinate non-Fx submodules.  Further, to 
achieve high performance, it will be necessary to specialize the 
communication system for different networks.  By clearly defining the 
interfaces between subsystems, we make this specialization possible and 
easy.  Indeed, all the subsystems are ``plug-and-play.''

The {\em communication subsystem} is the heart of the run-time system and 
exports the previously discussed intertask communication interface to the 
Net-Fx program.  It makes use of the {\em binding subsystem} to determine 
current submodule and distribution bindings.  In return it supplies the 
binding subsystem with a simple {\em binding communication interface} in 
case communication is necessary.  If the communication system determines 
that distribution computation must be done (for example, a dynamic 
distribution may have changed), it is performed by the {\em distribution 
help subsystem}.  This subsystem serves two purposes.  First, it dispatches 
the appropriate distribution computation function for the sender and 
receiver distribution types.  Second, it conforms the address relation 
returned by that function to the format desired by the communication 
system.  (The communication system can request that a message be assembled 
instead.) The subsystem uses the binding interface to bind the two 
distribution types to the distribution computation function.  This function 
may be one of the {\em distribution help library functions} linked to every 
Net-Fx executable, or a compiler-generated {\em custom distribution help 
function}.  In either case, the function conforms to the {\em standard 
distribution computation interface}.

Once address relations have been computed (or messages assembled),  the
communication system can use the submodule bindings to identify the 
actual processors to communicate with and perform the data transfer.   On
the receiving submodule, essentially the same steps are followed, with 
obvious changes.

\subsection{An Example}

\begin{figure}
%	\centerline{\BoxedEPSF{}}
	\caption{Example of the steps followed for an intertask send using a 
	custom distribution help function.}
	\protect\label{fig:steps}
\end{figure}

In this section, we show how an intertask send using a custom distribution 
help function is performed.  Figure \ref{fig:steps} shows the steps 
graphically.  On the right of the figure are the actual calls performed by 
the node program and internally for the intertask send.  There is a 
dependence from submodule $M1$ to submodule $M2$, which the compiler has 
satisfied by a generating a send from $M1$ to $M2$ (and a matching receive 
in $M2$, which is not discussed here.) The compiler also produced a custom 
distribution computation function $distfunc1$ for this communication.  At 
runtime, $M1$'s first step is to use the registry interface to register the 
initial values of all relavent bindings.  The module bindings for $M1$ and 
$M2$ are registered as static, so their values will never change.  Also 
registered as static are the distribution types $DT1$ and $DT2$ on the $M1$ 
and $M2$, respectively.  Finally, it registers $distfunc1$ as the 
distribution computation function for $DT1$ and $DT2$.    

The second step (which may be repeated many times) is to invoke a send
of the distributed data structure $A$ to $M2$.  The intertask send call 
includes the target submodule number, $M2$, the first element of $A$ that 
is local to the process, the data type of $A$'s elements (REAL), the 
sender and receiver distribution types, $DT1$ and $DT2$, and the current 
sender and receiver distributions, $DISTA1$ and $DISTA2$, respectively.   
Now in the context of the intertask send call, the third step is to bind 
the submodule $M2$ to its current physical mapping, and the distribution
types and distributions to their current distributions.  Both of these 
steps are simple in this case because the bindings are static - they 
amount to local lookups.   In the fourth step, the communication subsystem 
calls the distribution help subsystem to build an address relation 
between each pair of sender and receiver processes $x$ and $y$.   The 
distribution help subsystem accomplishes this by binding the two 
distribution types, $DT1$ and $DT2$, to their corresponding distribution 
computation function, $distfunc1$ (step 5) and calling the function for 
each pair (step 6).   The relations that are returned are conformed to 
the format the communication subsystem wanted and returned to it.   
Finally, the relations are used to perform the actual communications in a 
environment-dependent manner.  The communication subsystem may decide to
keep the address relations handy to avoid steps 4-6 the next time the 
intertask send occurs.  Indeed, because the distributions are static, the 
relations could be kept indefinitely.


\section{Status and implementation}

 

\bibliography{/afs/cs/project/iwarp/member/jass/bib/all}

\end{document}
