Dear Jonathan,

Thank you for your interest in the HyperC language. Please find
enclosed a technical overview of the language, as well as
availability and price information.

Any feedback you may have after reading this would be welcome.


>>>> BEGIN

HYPERC: a data parallel language designed for efficient
        compilation across SIMD & MIMD architectures.


1. PROBLEM

To avoid rewriting programs for parallel computers, programmers
need a portable, target independent language. But portability
cannot be achieved at the expense of efficiency. Because it is
unlikely that parallelizing compilers will generate efficient
code across varied platforms anytime soon, another approach is
needed. The HyperC language was developed to address this problem
by relying on explicitness to express the parallelism inherent
in an application.


2. DESIGN OBJECTIVES

The HyperC language was designed to provide the programmer with
high level constructs so that he/she is able to easily express the
parallelism inherent in an application (i.e. by coding, in a
logical fashion, the result of the design process). At the same
time, these high level constructs have been carefully selected to
allow the compiler to generate efficient code by providing it
with the freedom to perform needed optimizations, whether the
target architecture is SIMD or MIMD.

A key feature of the language is that it explicitly forces the
programmer to express the parallelism of an application in such
a way that:
 1) the amount of local computations is increased,
 2) communication requirements are minimized, and
 3) data dependencies are reduced, to permit overlapping of
    computations with communications.
We believe these to be the real issues that must be solved
in order to harness the power available from massively parallel
computers. The HyperC language provides a cohesive solution
to these problems (the training program we have developed covers
this in depth).


3. DATA PARALLEL MODEL

In many applications, the solution is conveniently described by
the execution of similar operations (ex: finite difference
computations) on each element of a large data set (ex: seismic
recordings). This leads to "data parallelism" where the
parallelism is achieved through the simultaneous execution of
the same operation across a set of data. One of the main
advantages of this type of parallelism is that it scales as the
size of the problem is increased.

From a programmerUs point of view, computations are expressed
in terms of operations performed upon each data element by each
processor. Control is strictly serial: there is only one flow of
control; each processor executes instructions in lockstep mode.
Processor interaction is done through expressions (which are
deterministic), rather than through explicit messages.
Consequently, programs in the data parallel model are easier to
write: the single thread of control simplifies understanding,
enables debugging by inserting traditional breakpoints, and
eliminates race conditions and deadlocks.

The HyperC language is a Data Parallel programming language,
constructed as an extension to the C language. The C philosophy
is kept intact: HyperC can be viewed as a portable macro assembler
for parallel systems; if needed, one need not leave the language
to develop efficient communication device drivers (by disabling
the virtualization process described below).


4. COLLECTIONS

Data sets upon which computations are expressed are called
"collections". Specifically, a collection describes the number
of elements it contains and the topological organization of
these elements with respect to each other (mesh, graph, etc.).
Computations are then expressed in terms of operations performed
upon each data element by each processor. All primitive operations
(add, multiply, minimum, etc.) are overloaded to express
elementwise operations between two data sets. Given a collection,
parallel variables can be instantiated by specifying the object
(unsigned short, double, structure, etc.) each data element will
hold.

Collections can be regular (vector, matrix, etc.), irregular
(graph), statically or dynamically allocated.


5. ARCHITECTURAL MODEL

To allow the programmer to clearly predict what the behavior of
programs will be, the HyperC language  provides an architectural
view of the underlying hardware on which programs are supposed
to run. This model describes an architecture consisting of:
 1) A scalar processor that is concerned with the execution of
    scalar operations and the sequencing of parallel operations,
 2) A parallel domain that associates each collection with a
    parallel machine which has one processor per element and where
    the interconnection between processors is given by the
    topology of the collection.
Objects in the scalar domain are located in the memory of the
scalar processor. Objects in the parallel domain are distributed
over the local memories of all processing elements of a given
parallel machine.


6. VIRTUALIZATION

The solution to a problem should be expressed only once. Otherwise
programs will need to be rewritten, and constraints defined by
the target architecture (number of physical processors,
interconnection network, etc.) will have to be dealt with. The
HyperC language offers a virtual point of view that results in a
unique solution to every problem: if a parallel variable requires
N elements, then a parallel machine is created with N virtual
processors.


7. HyperC PROGRAMS, SYNCHRONIZATION & DEBUGGING

HyperC programs are built by combining the code that is associated
with each virtual machine defined in the system. Each parallel
machine runs independently, possessing its own associated code and
executing it when told to by the scalar processor.

From the user's point of view, the programming model is SIMD
regardless of the actual execution model (SIMD or SPMD): each
processor on a parallel machine synchronously executes the same
program. This approach defines a global state for the system that
can be examined through the insertion of breakpoints in the
program. A global state is essential because it associates clear
and deterministic semantics with each construct in the language;
this in turn allows programmers to easily understand and predict
the behavior of their programs, thereby increasing overall
productivity.


8. PARALLEL FLOW CONTROL

With the SIMD synchronous programming model, it is the
responsibility of the scalar processor to synchronize tasks by
means of the traditional control structures. But to provide
efficient implementation on asynchronous architectures and a high
level of abstraction when dealing with collections, the flow of
control on individual processors must be able to vary. This is
accomplished through the concept of processor activity. The
conditional execution of parallel statements is performed by the
use of a "where" statement that determines the activity of each
processor.

The traditional control structures (while, for, do, etc.) have
been extended in the parallel domain: whilesomewhere, forwhere,
dowhere, etc. These extensions are necessary to allow the compiler
to perform efficient compilations across varied target
architectures: SIMD, MIMD, heterogeneous workstation networks.
In addition to providing a higher level of abstraction, a key
feature associated with such constructs is to allow the compiler
to eliminate many unnecessary and costly global synchronizations
between all processors.


9. FUNCTIONS

The prototype of functions specifies the type of arguments that
are passed to a function but does not reveal anything about the
type of computations performed by the function. This is
unfortunate because local computations should always be preferred
over data parallel computations which involve either
synchronization barriers or communications. To help the compiler
generate code that is as local as possible, directives have been
added to indicate the locality of computations when such property
can be asserted on a particular function. Functions without any
side effects, such as the ones defined in the "math.h" include
file, are all local.

In addition, functions can be overloaded for programming
convenience, and generic arguments can be defined.


10. COMMUNICATIONS & LINKS

HyperC computations make parallel variables interact. These
interactions are of 2 types:
 1) Local interactions between corresponding elements of a given
    collection, and
 2) Remote interactions between non corresponding elements of a
    given collection or between elements of variables belonging to
    separate collections.
Due to the distribution of data over all physical processors, remote
interactions between elements correspond, in many cases, to
physical communications. Because accesses to the memory of another
processor can be several orders of magnitude slower than local
memory accesses, it is important to minimize the number of remote
interactions. Achieving this objective guarantees a consistent
level of performance across varied platforms, but requires the
explicit specification of communications: it warns the programmer
of communication requirements and tell the compiler specifically
when communications may take place.

The HyperC language defines a general mapping mechanism which
allows virtual processors to send or receive values to each other.
The language introduces the concept of a "link" which is a
pointer to a virtual processor inside a collection. Links can
be combined into parallel variables to express simultaneous
data transfers between virtual processors. The concept of
strings of characters is extended to strings of links to
support irregular data sets and efficient neighborhood (local)
operations. One of the main features of links is that they allow
the compiler to compile communication patterns for greater
efficiency.


11. REGULAR COMMUNICATIONS

Many communication patterns (such as getting the value of a
neighbor in a grid topology) exhibit regularity as they express
accesses relative to a virtual processor location. A local
neighborhood operator (the "dot" operator) allows the
specification of such regular patterns (for which most
architectures provide efficient hardware support).


12. OVERLAPPING OF COMPUTATIONS WITH COMMUNICATIONS

In many applications, operations are expressed relative to each
virtual processor: a typical example is the local averaging of
the values held by the neighborhood of a virtual processor (this
concept is the same regardless of the topology of a collection -
it could be a regular image or an irregular graph). The order
in which such local computations are performed seldom affects
the result being computed. This is a case where such reduced
data dependencies can be used to overlap computations with
communications. To allow a compiler to effectively perform such
optimization, a high level construct must be provided to allow
the programmer to express the semantics of the computations, and
to provide the freedom to the compiler necessary to balance the
load between computations and communications as needed by the
particular target architecture.

Such overlapping is specified by means of an "along" construct
which usually stands for: "along" each neighbor, perform the
following computation.


13. COMMUNICATION LIBRARY

Some particular communication patterns must be viewed as atomic
operations in the parallel domain either because most parallel
architectures provide efficient implementations or because their
functionality is a convenient one for the programmer. Such
communication patterns are supported through library calls;
spreads and reductions are the most common ones. Other types are
also supported, such as segmented scans, tournaments (which
prevent problems related to collisions of sends) and buffered
sends.


14. POINTERS

The C language defines the concept of a pointer, which is a local
reference to a variable in main memory (processor's address
space). The HyperC language maintains the semantics associated
with pointers: a pointer is always a local reference. It is not
possible to use a pointer to point to the memory of another
virtual processor; the reason is that dereferencing such a pointer
would hide communications (which is contrary to the philosophy of
the language which is to make all communications explicit).
Consequently, there are only two types of pointers in the
parallel domain: a scalar pointer to a parallel variable and a
parallel pointer where each data element is a local pointer.


15. DATA PARALLEL I/O

When a data set is distributed over many virtual processors,
particular care must be taken when the data set needs to be read
or written (for example, from and to a disk, or to the screen for
display purposes). The HyperC language defines a unique I/O
number for each data element in a collection; this number, which
defines the order in which the data element will be read from or
written to a serial device, ensures a clear semantics for data
parallel I/O operations. Consequently, all I/O functions defined
by the C language are extended to the parallel domain in a
systematic way.


16. PORTABILITY

To facilitate the incremental parallelization of a program, a
programmer must be able to easily and quickly import entire
libraries (such as X11) for which a lot of work has already been
done, and for which most functions are control ones (such as
opening and closing a window) that do not require parallelization.
To make the port of such a code easy, one must be able to:
 1) Reuse include files needed by the software without modifying
    them.
 2) Easily parallelize data parallel functions.
In both cases, the HyperC language provides mechanisms to perform
such tasks in a straightforward fashion: entire "include" files
can be properly tagged to allow the compiler to use function
prototypes and variable declarations in the most efficient ways,
and the process of parallelizing functions can be automated once
parallel arguments have been defined in the data parallel
overloaded version of the original function.


17. Examples


17.1. DRAWING A DISK

The program below draws a disk inside a window.

#include <hyperc.h>
#include <graphic.h>
int main(int argc, char *argv[])
{
  /* Declare a collection. */
  collection [256,256]image;
  /* Instantiate image parallel variables. */
  image int x, y, disk;
  /* Disk attributes. */
  int center_x = 100, center_y = 100, radius = 60;
  /* Window. */
  int window;
  /* Open window and initialize colormap. */
  window = HcOpenDisplay(512,512);
  HcSetCmap(window, BlackAndWhite, 0, 0, -1);
  /* Determine the virtual address of each virtual processor. */
  x = HcCoord(image,0);
  y = HcCoord(image,1);
  /* Position the disk. */
  x -= center_x;
  y -= center_y;
  /* Draw the disk. */
  where ((x * x + y * y) < (radius * radius)) disk = 255;
  elsewhere disk = 0;
  /* Display the disk inside the display window. */
  HcFlash(window, disk, 0, 0, 1, 1);
}


17.2. HISTOGRAMMING

The program below reads an image from a file, displays it,
computes a histogram of the pixel values (number of occurences
of each value in the image), and displays the histogram as a bar graph.

#include <hyperc.h>
#include <graphic.h>
#include <support.h> /* to read an image. */
int main(int argc, char *argv[])
{	
  /* Define a collection of type image. Its topology is
     undefined. The topology is set by the function that
     loads the image */	
  collection image;
  /* Instantiate a pointer to an image parallel variable. */
  image char *packed image_src;	
  /* Instantiate a handle to a X11 display window. */
  int window;
	
  /* Load image from a file. Set the topology of an image. */	
  image_src = image_read_binary(&image, "../lib/lena.pic");
	
  /* Open display window, initialize colormap, display image. */
  window = HcOpenDisplay(HcDimof(image,0)+256,
                         100 >? HcDimof(image,1));
  HcSetCmap(window, BlackAndWhite, 0, 0, -1);
  HcFlash(window, *image_src, 0, 0, 1, 1);

  {
    /* Instantiate the histogram collection: 1-D topology. */    
    collection [256]histogram;
    /* Instantiate histogram parallel variables. */    
    histogram int bins, x;
    /* A scalar variable. */    
    int max;
	
    /* Histogram image (this statement does everything):
       use the image as a parallel address to accumulate
       (with the reduction operator "+<-") 1 in each histogram
       bin. */
    bins = 0;
    [[*image_src]]bins +<- (image int)1;

    /* Compute the maximum value in the histogram. */
    max = >?<-bins;
    /* Normalize histogram between 0 and 100. */
    if (max > 0) bins = bins * 100 / max;

    /* Display histogram by plotting vertical bars. */	
    x = HcDimof(image,1) + HcCoord(histogram,0);
    HcShowVector(window, x, 101, x, 100-bins, (int)255);
  }
}


17.3. MATRIX MULTIPLICATION

The program below multiplies two square images. It first
rearranges elements in each matrix so the multiplication can
be done using elementwise (local) computations and regular
communications.

#include <hyperc.h>
#include <stdio.h>
#include <support.h>

collection generic int
matrix_multiplication(generic int A, generic int B)
{
  /* Matrix variables. */
  generic int A_rotate_x, A_rotate_y, B_rotate_x, B_rotate_y;
  generic int product;
  /* Control flow. */
  int M;
  /* Verify that both matrices are square. */
  product = 0;
  if (HcDimof(generic, 0) != HcDimof(generic, 0))
    return(product);
  /* Initialize initial rotation mappings. */
  A_rotate_x = (HcCoord(generic,0)+HcCoord(generic,1))
               % HcDimof(generic,0);
  A_rotate_y = HcCoord(generic,1);
  B_rotate_x = HcCoord(generic,0);
  B_rotate_y = (HcCoord(generic,0)+HcCoord(generic,1))
               % HcDimof(generic,1);
  /* Rotate A & B matrices all elements at once. */
  A = [[A_rotate_x,A_rotate_y]]A;
  B = [[B_rotate_x,B_rotate_y]]B;
  /* Compute product iteratively. */
  product = 0;
  for (M = 0; M < HcDimof(generic,0); M++)
  {
    /* Multiply elementwise matrices A & B. */
    product += A * B;
    /* Shift left A. */
    A = [[.+1,.]]A;
    /* Shift up B. */
    B = [[.,.+1]]B;
  }
  return(product);
}

int main(int argc, char *argv[])
{
  /* matrix variables. */
  collection [3,3]matrix;
  matrix int A, B, product;
  /* Initialize A & B matrices. */
  A = random(A); A >>= 12; A &= 0xFF;
  B = random(B); B >>= 12; B &= 0xFF;
  /* matrix multiply  */
  product = matrix_multiplication(A, B);
}


17.4. LAPLACIAN CONVOLUTION

The program below convolves an image by the laplacian operator.
It uses a string of links and the along statement for efficiency.

#include <hyperc.h>
#include <stdio.h>

local int random(int);

/********************************************** convo_laplacian */

collection generic int convo_laplacian(generic int data_in)
{
  /* Convolution variables. */
  generic int data_inside;
  generic int data_out;
  /* Neighborhood. */
  generic link generic neighborhood[6];
  generic int weights[6];
  /* Control flow. */
  int X;

  /* Initialize inside variable (border = 1). */
  data_inside = 1;
  where (HcCoord(generic,0) <= 0) data_inside = 0;
  where (HcCoord(generic,0) >= (HcDimof(generic,0)-1))
    data_inside = 0;
  where (HcCoord(generic,1) <= 0) data_inside = 0;
  where (HcCoord(generic,1) >= (HcDimof(generic,1)-1))
    data_inside = 0;

  /* Initialize neighborhood. */
  neighborhood[0] = [.,.]generic;
  neighborhood[1] = [.,.+1]generic;
  neighborhood[2] = [.,.-1]generic;
  neighborhood[3] = [.+1,.]generic;
  neighborhood[4] = [.-1,.]generic;
  neighborhood[5] = 0;
  /* Initialize weights. */
  weights[0] = 4;
  weights[1] = -1;
  weights[2] = -1;
  weights[3] = -1;
  weights[4] = -1;

  /*
  ** Convolve image with laplacian operator:
  ** 4*CENTER-NORTH-SOUTH-EAST-WEST.
  */
  data_out = 0;
  where (data_inside)
  {
    along(neighborhood)
      data_out += weights[.] * [neighborhood]data_in;
  }
  return(data_out);
}

/********************************************************* main */

int main(int argc, char *argv[])
{
  /* Image variables. */
  collection [100,100]image;
  image int data; 

  /* Initialize image. */
  data = random(data); data >>= 12; data &= 0xFF;

  /* Convolve image. */
  data = convo_laplacian(data);
}


18. AVAILABILITY

The compiler is currently available on SUN, SGI, DEC & HP
workstations.

A PVM version for heterogeneous systems is currently being
developed; evaluation versions will be available by year end (93).


19. LICENSE FEES

The software package includes the compiler, the X11 display tool
and several application source code programs (sorting, delaunay
triangulation, etc.).

Each license allows the use of the compiler and related tools
to be used on one workstation (identified by its host I.D.)
by any number of users.

The price per license decreases depending on the number of
licenses purchased simultaneously (on the same purchase order) as
follows:

               Regular     Gov/Univ
1st              $1750        $1225
2nd .. 5th       $1135         $680
6th .. 10th       $740         $370
11th+             $480         $240

UNTIL SUPERCOMPUTING'93 (where we will be demonstrating our new
product - booth 303), we are offering a promotion whereby the
ABOVE LICENSE FEES ARE REDUCED BY 25%.

Do not miss out on this opportunity to evaluate this new language!


20. FOR FURTHER INFORMATION & ORDERING

Christian Fortunel
Fortunel Systems, Inc. - Ste. 311-5
1135 Kildaire Farm Road
Cary,  NC 27511	

Tel: 919-319-1624
Fax: 919-319-1749
E-mail: fortunel@vnet.net

<<<< END

Christian Fortunel C/O HyperParallel Technologies