Fx Overview

David R. O'Hallaron
School of Computer Science
Carnegie Mellon University

Fx is a parallelizing Fortran compiler, based on a dialect of High Performance Fortran (HPF), that supports sensor-based and scientific computations. Fx has been ported to the Intel iWarp, IBM SP/2, Intel Paragon, Cray T3D, and networks of workstations. Fx pioneered (1) the development of real applications for HPF, (2) the integration of task and data parallelism, and (3) direct deposit message passing. This report discusses Fx applications, the performance gains from task and data parallelism, and the performance gains from direct deposit message passing.

1. Example applications

Fx has been used to implement a number of sensor-based and scientific applications. Here are some examples: Figure 1 shows the performance of a spotlight synthetic aperture radar code on the Intel Paragon. The Fx code was adapted by Peter Lieu at Carnegie Mellon from original F77 code provided by Sandia Laboratories. The performance of the single-node Fx code is comparable to that of the original F77, and the speedup as a function of the number of Paragon is nearly linear.

(a) Output image with four targets. (b) Measured performance on Paragon.

Figure 1: Spotlight synthetic aperture radar.

Figure 2 shows the performance of an earthquake ground motion simulation. The simulation, based on the method of boundary elements, was adapted from an original F77 program by Yoshi Hisadi while he was visiting Carnegie Mellon from the Southern California Earthquake Center. Again, speedup as a function of the the number of nodes is almost linear.

(a) Frequency response of 3D synthetic basin. (b) Performance of ground motion simulation on Paragon.

Figure 2: Earthquake ground motion simulation.

For additional information on Fx applications:

2. Integrating task and data parallelism in Fx

If a program consists of one or more functions that do not scale well because of internal communication, or if it operates on small data sets (typical of sensor-based computations), then exploiting a mix of task and data parallelism can have a significant impact on performance.

Figure 3 shows how Fx integrates task and data parallelism, while still retaining sequential semantics. Tasks are subroutines (called task-subroutines) within special code regions called parallel sections. For each task-subroutine, the programmer uses directives to tell the compiler whether each parameter is input, output, or both, and how many processors the task-subroutine will run on. The compiler can either ignore the directives and generate the typical data parallel code, or use the directives to generate task parallel code.

Figure 3: Approach: integrating task and data parallelism in Fx

Figure 4 shows the speedup on iWarp of mixed Fx task and data parallel programs over purely data parallel Fx programs.

Figure 4: Performance benefit of mixed task and parallel over purely data parallel mappings on iWarp

A suite of small, yet realistic F77 programs that can benefit from a mix of task and data parallelism is available to the public. For additional information on task parallelism in Fx:

3. Direct deposit message passing

Fx pioneered direct deposit message passing for efficient communication between processing nodes. The central idea behind direct deposit message passing is to decouple communication from synchronization. Figure 5 shows the basic approach.

Figure 5: Direct deposit message passing

Figure 6 shows the per-node communication performance of a transpose operation on the Cray T3D. Deposit message passing is compared to the vendor supplied PVM interface, the vendor-supplied tranpose instrinsic, and the transpose operation generated by the CRAFT compiler. The chief result is a factor of two gain over the vendor transpose.

Figure 6: Per-node communication performance for transpose operation on the Cray T3D.

Figure 7 shows the aggregate communication performance of T3D transpose. Again, performance using direct deposit message passing shows a factor of two gain over the vendor supplied intrinsic.

Figure 7: Aggregate communication performance for transpose operation on the Cray T3D.

A fast transpose library, based on direct deposit, has been developed for the T3D at Pittsburgh Supercomputing Center. For more information on direct deposit message passing, and generating communication code in general:

fx-compiler@cs.cmu.edu 3/96