Dear Jonathan, Thank you for your interest in the HyperC language. Please find enclosed a technical overview of the language, as well as availability and price information. Any feedback you may have after reading this would be welcome. >>>> BEGIN HYPERC: a data parallel language designed for efficient compilation across SIMD & MIMD architectures. 1. PROBLEM To avoid rewriting programs for parallel computers, programmers need a portable, target independent language. But portability cannot be achieved at the expense of efficiency. Because it is unlikely that parallelizing compilers will generate efficient code across varied platforms anytime soon, another approach is needed. The HyperC language was developed to address this problem by relying on explicitness to express the parallelism inherent in an application. 2. DESIGN OBJECTIVES The HyperC language was designed to provide the programmer with high level constructs so that he/she is able to easily express the parallelism inherent in an application (i.e. by coding, in a logical fashion, the result of the design process). At the same time, these high level constructs have been carefully selected to allow the compiler to generate efficient code by providing it with the freedom to perform needed optimizations, whether the target architecture is SIMD or MIMD. A key feature of the language is that it explicitly forces the programmer to express the parallelism of an application in such a way that: 1) the amount of local computations is increased, 2) communication requirements are minimized, and 3) data dependencies are reduced, to permit overlapping of computations with communications. We believe these to be the real issues that must be solved in order to harness the power available from massively parallel computers. The HyperC language provides a cohesive solution to these problems (the training program we have developed covers this in depth). 3. DATA PARALLEL MODEL In many applications, the solution is conveniently described by the execution of similar operations (ex: finite difference computations) on each element of a large data set (ex: seismic recordings). This leads to "data parallelism" where the parallelism is achieved through the simultaneous execution of the same operation across a set of data. One of the main advantages of this type of parallelism is that it scales as the size of the problem is increased. From a programmerUs point of view, computations are expressed in terms of operations performed upon each data element by each processor. Control is strictly serial: there is only one flow of control; each processor executes instructions in lockstep mode. Processor interaction is done through expressions (which are deterministic), rather than through explicit messages. Consequently, programs in the data parallel model are easier to write: the single thread of control simplifies understanding, enables debugging by inserting traditional breakpoints, and eliminates race conditions and deadlocks. The HyperC language is a Data Parallel programming language, constructed as an extension to the C language. The C philosophy is kept intact: HyperC can be viewed as a portable macro assembler for parallel systems; if needed, one need not leave the language to develop efficient communication device drivers (by disabling the virtualization process described below). 4. COLLECTIONS Data sets upon which computations are expressed are called "collections". Specifically, a collection describes the number of elements it contains and the topological organization of these elements with respect to each other (mesh, graph, etc.). Computations are then expressed in terms of operations performed upon each data element by each processor. All primitive operations (add, multiply, minimum, etc.) are overloaded to express elementwise operations between two data sets. Given a collection, parallel variables can be instantiated by specifying the object (unsigned short, double, structure, etc.) each data element will hold. Collections can be regular (vector, matrix, etc.), irregular (graph), statically or dynamically allocated. 5. ARCHITECTURAL MODEL To allow the programmer to clearly predict what the behavior of programs will be, the HyperC language provides an architectural view of the underlying hardware on which programs are supposed to run. This model describes an architecture consisting of: 1) A scalar processor that is concerned with the execution of scalar operations and the sequencing of parallel operations, 2) A parallel domain that associates each collection with a parallel machine which has one processor per element and where the interconnection between processors is given by the topology of the collection. Objects in the scalar domain are located in the memory of the scalar processor. Objects in the parallel domain are distributed over the local memories of all processing elements of a given parallel machine. 6. VIRTUALIZATION The solution to a problem should be expressed only once. Otherwise programs will need to be rewritten, and constraints defined by the target architecture (number of physical processors, interconnection network, etc.) will have to be dealt with. The HyperC language offers a virtual point of view that results in a unique solution to every problem: if a parallel variable requires N elements, then a parallel machine is created with N virtual processors. 7. HyperC PROGRAMS, SYNCHRONIZATION & DEBUGGING HyperC programs are built by combining the code that is associated with each virtual machine defined in the system. Each parallel machine runs independently, possessing its own associated code and executing it when told to by the scalar processor. From the user's point of view, the programming model is SIMD regardless of the actual execution model (SIMD or SPMD): each processor on a parallel machine synchronously executes the same program. This approach defines a global state for the system that can be examined through the insertion of breakpoints in the program. A global state is essential because it associates clear and deterministic semantics with each construct in the language; this in turn allows programmers to easily understand and predict the behavior of their programs, thereby increasing overall productivity. 8. PARALLEL FLOW CONTROL With the SIMD synchronous programming model, it is the responsibility of the scalar processor to synchronize tasks by means of the traditional control structures. But to provide efficient implementation on asynchronous architectures and a high level of abstraction when dealing with collections, the flow of control on individual processors must be able to vary. This is accomplished through the concept of processor activity. The conditional execution of parallel statements is performed by the use of a "where" statement that determines the activity of each processor. The traditional control structures (while, for, do, etc.) have been extended in the parallel domain: whilesomewhere, forwhere, dowhere, etc. These extensions are necessary to allow the compiler to perform efficient compilations across varied target architectures: SIMD, MIMD, heterogeneous workstation networks. In addition to providing a higher level of abstraction, a key feature associated with such constructs is to allow the compiler to eliminate many unnecessary and costly global synchronizations between all processors. 9. FUNCTIONS The prototype of functions specifies the type of arguments that are passed to a function but does not reveal anything about the type of computations performed by the function. This is unfortunate because local computations should always be preferred over data parallel computations which involve either synchronization barriers or communications. To help the compiler generate code that is as local as possible, directives have been added to indicate the locality of computations when such property can be asserted on a particular function. Functions without any side effects, such as the ones defined in the "math.h" include file, are all local. In addition, functions can be overloaded for programming convenience, and generic arguments can be defined. 10. COMMUNICATIONS & LINKS HyperC computations make parallel variables interact. These interactions are of 2 types: 1) Local interactions between corresponding elements of a given collection, and 2) Remote interactions between non corresponding elements of a given collection or between elements of variables belonging to separate collections. Due to the distribution of data over all physical processors, remote interactions between elements correspond, in many cases, to physical communications. Because accesses to the memory of another processor can be several orders of magnitude slower than local memory accesses, it is important to minimize the number of remote interactions. Achieving this objective guarantees a consistent level of performance across varied platforms, but requires the explicit specification of communications: it warns the programmer of communication requirements and tell the compiler specifically when communications may take place. The HyperC language defines a general mapping mechanism which allows virtual processors to send or receive values to each other. The language introduces the concept of a "link" which is a pointer to a virtual processor inside a collection. Links can be combined into parallel variables to express simultaneous data transfers between virtual processors. The concept of strings of characters is extended to strings of links to support irregular data sets and efficient neighborhood (local) operations. One of the main features of links is that they allow the compiler to compile communication patterns for greater efficiency. 11. REGULAR COMMUNICATIONS Many communication patterns (such as getting the value of a neighbor in a grid topology) exhibit regularity as they express accesses relative to a virtual processor location. A local neighborhood operator (the "dot" operator) allows the specification of such regular patterns (for which most architectures provide efficient hardware support). 12. OVERLAPPING OF COMPUTATIONS WITH COMMUNICATIONS In many applications, operations are expressed relative to each virtual processor: a typical example is the local averaging of the values held by the neighborhood of a virtual processor (this concept is the same regardless of the topology of a collection - it could be a regular image or an irregular graph). The order in which such local computations are performed seldom affects the result being computed. This is a case where such reduced data dependencies can be used to overlap computations with communications. To allow a compiler to effectively perform such optimization, a high level construct must be provided to allow the programmer to express the semantics of the computations, and to provide the freedom to the compiler necessary to balance the load between computations and communications as needed by the particular target architecture. Such overlapping is specified by means of an "along" construct which usually stands for: "along" each neighbor, perform the following computation. 13. COMMUNICATION LIBRARY Some particular communication patterns must be viewed as atomic operations in the parallel domain either because most parallel architectures provide efficient implementations or because their functionality is a convenient one for the programmer. Such communication patterns are supported through library calls; spreads and reductions are the most common ones. Other types are also supported, such as segmented scans, tournaments (which prevent problems related to collisions of sends) and buffered sends. 14. POINTERS The C language defines the concept of a pointer, which is a local reference to a variable in main memory (processor's address space). The HyperC language maintains the semantics associated with pointers: a pointer is always a local reference. It is not possible to use a pointer to point to the memory of another virtual processor; the reason is that dereferencing such a pointer would hide communications (which is contrary to the philosophy of the language which is to make all communications explicit). Consequently, there are only two types of pointers in the parallel domain: a scalar pointer to a parallel variable and a parallel pointer where each data element is a local pointer. 15. DATA PARALLEL I/O When a data set is distributed over many virtual processors, particular care must be taken when the data set needs to be read or written (for example, from and to a disk, or to the screen for display purposes). The HyperC language defines a unique I/O number for each data element in a collection; this number, which defines the order in which the data element will be read from or written to a serial device, ensures a clear semantics for data parallel I/O operations. Consequently, all I/O functions defined by the C language are extended to the parallel domain in a systematic way. 16. PORTABILITY To facilitate the incremental parallelization of a program, a programmer must be able to easily and quickly import entire libraries (such as X11) for which a lot of work has already been done, and for which most functions are control ones (such as opening and closing a window) that do not require parallelization. To make the port of such a code easy, one must be able to: 1) Reuse include files needed by the software without modifying them. 2) Easily parallelize data parallel functions. In both cases, the HyperC language provides mechanisms to perform such tasks in a straightforward fashion: entire "include" files can be properly tagged to allow the compiler to use function prototypes and variable declarations in the most efficient ways, and the process of parallelizing functions can be automated once parallel arguments have been defined in the data parallel overloaded version of the original function. 17. Examples 17.1. DRAWING A DISK The program below draws a disk inside a window. #include #include int main(int argc, char *argv[]) { /* Declare a collection. */ collection [256,256]image; /* Instantiate image parallel variables. */ image int x, y, disk; /* Disk attributes. */ int center_x = 100, center_y = 100, radius = 60; /* Window. */ int window; /* Open window and initialize colormap. */ window = HcOpenDisplay(512,512); HcSetCmap(window, BlackAndWhite, 0, 0, -1); /* Determine the virtual address of each virtual processor. */ x = HcCoord(image,0); y = HcCoord(image,1); /* Position the disk. */ x -= center_x; y -= center_y; /* Draw the disk. */ where ((x * x + y * y) < (radius * radius)) disk = 255; elsewhere disk = 0; /* Display the disk inside the display window. */ HcFlash(window, disk, 0, 0, 1, 1); } 17.2. HISTOGRAMMING The program below reads an image from a file, displays it, computes a histogram of the pixel values (number of occurences of each value in the image), and displays the histogram as a bar graph. #include #include #include /* to read an image. */ int main(int argc, char *argv[]) { /* Define a collection of type image. Its topology is undefined. The topology is set by the function that loads the image */ collection image; /* Instantiate a pointer to an image parallel variable. */ image char *packed image_src; /* Instantiate a handle to a X11 display window. */ int window; /* Load image from a file. Set the topology of an image. */ image_src = image_read_binary(&image, "../lib/lena.pic"); /* Open display window, initialize colormap, display image. */ window = HcOpenDisplay(HcDimof(image,0)+256, 100 >? HcDimof(image,1)); HcSetCmap(window, BlackAndWhite, 0, 0, -1); HcFlash(window, *image_src, 0, 0, 1, 1); { /* Instantiate the histogram collection: 1-D topology. */ collection [256]histogram; /* Instantiate histogram parallel variables. */ histogram int bins, x; /* A scalar variable. */ int max; /* Histogram image (this statement does everything): use the image as a parallel address to accumulate (with the reduction operator "+<-") 1 in each histogram bin. */ bins = 0; [[*image_src]]bins +<- (image int)1; /* Compute the maximum value in the histogram. */ max = >?<-bins; /* Normalize histogram between 0 and 100. */ if (max > 0) bins = bins * 100 / max; /* Display histogram by plotting vertical bars. */ x = HcDimof(image,1) + HcCoord(histogram,0); HcShowVector(window, x, 101, x, 100-bins, (int)255); } } 17.3. MATRIX MULTIPLICATION The program below multiplies two square images. It first rearranges elements in each matrix so the multiplication can be done using elementwise (local) computations and regular communications. #include #include #include collection generic int matrix_multiplication(generic int A, generic int B) { /* Matrix variables. */ generic int A_rotate_x, A_rotate_y, B_rotate_x, B_rotate_y; generic int product; /* Control flow. */ int M; /* Verify that both matrices are square. */ product = 0; if (HcDimof(generic, 0) != HcDimof(generic, 0)) return(product); /* Initialize initial rotation mappings. */ A_rotate_x = (HcCoord(generic,0)+HcCoord(generic,1)) % HcDimof(generic,0); A_rotate_y = HcCoord(generic,1); B_rotate_x = HcCoord(generic,0); B_rotate_y = (HcCoord(generic,0)+HcCoord(generic,1)) % HcDimof(generic,1); /* Rotate A & B matrices all elements at once. */ A = [[A_rotate_x,A_rotate_y]]A; B = [[B_rotate_x,B_rotate_y]]B; /* Compute product iteratively. */ product = 0; for (M = 0; M < HcDimof(generic,0); M++) { /* Multiply elementwise matrices A & B. */ product += A * B; /* Shift left A. */ A = [[.+1,.]]A; /* Shift up B. */ B = [[.,.+1]]B; } return(product); } int main(int argc, char *argv[]) { /* matrix variables. */ collection [3,3]matrix; matrix int A, B, product; /* Initialize A & B matrices. */ A = random(A); A >>= 12; A &= 0xFF; B = random(B); B >>= 12; B &= 0xFF; /* matrix multiply */ product = matrix_multiplication(A, B); } 17.4. LAPLACIAN CONVOLUTION The program below convolves an image by the laplacian operator. It uses a string of links and the along statement for efficiency. #include #include local int random(int); /********************************************** convo_laplacian */ collection generic int convo_laplacian(generic int data_in) { /* Convolution variables. */ generic int data_inside; generic int data_out; /* Neighborhood. */ generic link generic neighborhood[6]; generic int weights[6]; /* Control flow. */ int X; /* Initialize inside variable (border = 1). */ data_inside = 1; where (HcCoord(generic,0) <= 0) data_inside = 0; where (HcCoord(generic,0) >= (HcDimof(generic,0)-1)) data_inside = 0; where (HcCoord(generic,1) <= 0) data_inside = 0; where (HcCoord(generic,1) >= (HcDimof(generic,1)-1)) data_inside = 0; /* Initialize neighborhood. */ neighborhood[0] = [.,.]generic; neighborhood[1] = [.,.+1]generic; neighborhood[2] = [.,.-1]generic; neighborhood[3] = [.+1,.]generic; neighborhood[4] = [.-1,.]generic; neighborhood[5] = 0; /* Initialize weights. */ weights[0] = 4; weights[1] = -1; weights[2] = -1; weights[3] = -1; weights[4] = -1; /* ** Convolve image with laplacian operator: ** 4*CENTER-NORTH-SOUTH-EAST-WEST. */ data_out = 0; where (data_inside) { along(neighborhood) data_out += weights[.] * [neighborhood]data_in; } return(data_out); } /********************************************************* main */ int main(int argc, char *argv[]) { /* Image variables. */ collection [100,100]image; image int data; /* Initialize image. */ data = random(data); data >>= 12; data &= 0xFF; /* Convolve image. */ data = convo_laplacian(data); } 18. AVAILABILITY The compiler is currently available on SUN, SGI, DEC & HP workstations. A PVM version for heterogeneous systems is currently being developed; evaluation versions will be available by year end (93). 19. LICENSE FEES The software package includes the compiler, the X11 display tool and several application source code programs (sorting, delaunay triangulation, etc.). Each license allows the use of the compiler and related tools to be used on one workstation (identified by its host I.D.) by any number of users. The price per license decreases depending on the number of licenses purchased simultaneously (on the same purchase order) as follows: Regular Gov/Univ 1st $1750 $1225 2nd .. 5th $1135 $680 6th .. 10th $740 $370 11th+ $480 $240 UNTIL SUPERCOMPUTING'93 (where we will be demonstrating our new product - booth 303), we are offering a promotion whereby the ABOVE LICENSE FEES ARE REDUCED BY 25%. Do not miss out on this opportunity to evaluate this new language! 20. FOR FURTHER INFORMATION & ORDERING Christian Fortunel Fortunel Systems, Inc. - Ste. 311-5 1135 Kildaire Farm Road Cary, NC 27511 Tel: 919-319-1624 Fax: 919-319-1749 E-mail: fortunel@vnet.net <<<< END Christian Fortunel C/O HyperParallel Technologies