Publications of James M. Stichnoth

J. Stichnoth. "Generating Code for High-Level Operations through Code Composition." PhD thesis, August, 1997.

A traditional compiler translates each expression or statement in a high-level language into a sequence of lower-level target statements (e.g., operations in an intermediate representation, or machine instructions), in a manner fixed by the compiler writer. The output is then subject to further optimization. This compilation strategy is called custom code generation, as the compiler generates custom code for each input construct.

An alternative strategy is to generate a call to a runtime library for each high-level language construct. This approach is attractive if the source language contains complex, powerful constructs, like the distributed array assignment statement in High Performance Fortran (HPF). The decision between custom code generation and use of a runtime library involves tradeoffs between efficiency (performance of the generated code), maintainability (ease of developing and maintaining the algorithm), and generality (implementation of the general case, rather than merely a simplified canonical case).

I introduce a new compilation strategy, high-level code composition, which combines the advantages of custom code generation and runtime libraries. The compilation of each construct is controlled by code templates, which contain both target code to be generated and compile-time control instructions that specify how the templates are composed together. The templates are external to the compiler, making them easy to write and modify. The composition system executes the control code at compile time, automatically generating and optimizing the code to be executed at run time.

In this dissertation, I motivate and explore the language and implementation issues of code composition. I describe Catacomb, my implementation of a composition system, which integrates with a high-level compiler to generate custom C code. I describe the challenges in enabling Catacomb to automatically generate the best code without sacrificing a clean user model. In addition, I develop a framework for the HPF array assignment, allowing an arbitrary array assignment algorithm to be coupled with an arbitrary communication architecture to form a complete implementation. This framework leads to the first compilation system that can incorporate and compare several array assignment algorithms on several communication architectures, for the fully general array assignment statement.

J. Stichnoth, D. O'Hallaron, and T. Gross. "Generating Communication for Array Statements: Design, Implementation, and Evaluation." Journal of Parallel and Distributed Computing, 21(1):150-159, April 1994.

Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a private memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern private memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver's address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node private memory iWarp system, using a number of different distributions.

J. Stichnoth, G.-Y. Lueh, and M. Cierniak. Support for Garbage Collection at Every Instruction in a Java Compiler. To appear in Proceedings of PLDI'99, Atlanta, Georgia, May, 1999. ACM.
A. Adl-Tabatabai, M. Cierniak, G.-Y. Lueh, V. M. Parikh, and J. Stichnoth. " Fast, Effective Code Generation in a Just-In-Time Java Compiler." In Proceedings of PLDI'98, pages 280-290, Montreal, Quebec, June, 1998. ACM.

A "Just-In-Time" (JIT) Java compiler produces native code from Java byte code instructions during program execution. As such, compilation speed is more important in a Java JIT compiler than in a traditional compiler, requiring optimization algorithms to be lightweight and effective. We present the structure of a Java JIT compiler for the Intel Architecture, describe the lightweight implementation of JIT compiler optimizations (e.g., common subexpression elimination, register allocation, and elimination of array bounds checking), and evaluate the performance benefits and tradeoffs of the optimizations. This JIT compiler has been shipped with version 2.5 of Intel's VTune for Java product.

J. Stichnoth and T. Gross. "Code Composition as an Implementation Language for Compilers." In Conference on Domain-Specific Languages, pages 119-131, Santa Barbara, CA, October, 1997.

Code composition is an effective technique for a compiler to implement complex high-level operations. The developer (i.e., the language designer or compiler writer) provides building blocks consisting of sequences of code written in, e.g., C, that are combined by a composition system to generate the code for such a high-level operation. The {\em composition system} can include optimizations not commonly found in compilers; e.g., it can specialize the code sequences based on loop nesting depth or procedure parameters. We describe a composition system, Catacomb, and illustrate its use for mapping array operations onto a parallel system.

L. Wang, J. Stichnoth, and S. Chatterjee. "Runtime Performance of Parallel Array Assignment: An Empirical Study." In Proceedings of Supercomputing '96, Pittsburgh, PA, November 1996.

Generating code for the array assignment statement of High Performance Fortran (HPF) in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the run-time performance of the code these algorithms generate. We classify these algorithms into several families, identify several issues of interest in the generated code, and present experimental performance data for the various algorithms. We demonstrate that the code generated for block-cyclic distributions runs almost as efficiently as that generated for block or cyclic distributions.

T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinrichs, and T. Gross. "Decoupling Synchronization and Data Transfer in Message Passing Systems of Parallel Computers." In Proceedings of the 9th International Conference on Supercomputing, pages 1-10, Barcelona, July 1995. ACM.

Synchronization is an important issue for the design of a scalable parallel computer, and some systems include special hardware support for control messages or barriers. The cost of synchronization has a high impact on the design of the message passing (communication) services. In this paper, we investigate three different communication libraries that are tailored toward the synchronization services available: (1) a version of generic send-receive message passing (PVM), which relies on traditional flow control and buffering to synchronize the data transfers; (2) message passing with pulling, i.e. a message is transferred only when the recipient is ready and requests it (as, e.g., used in NX for large messages); and (3) the decoupled direct deposit message passing that uses separate, global synchronization to ensure that nodes send messages only when the message data can be deposited directly into the final destination in the memory of the remote recipient. Measurements of these three styles on a Cray T3D demonstrate the benefits of the decoupled message passing with direct deposit. The performance advantage of this style is made possible by (1) preemptive synchronization to avoid unnecessary copies of the data, (2) high-speed barrier synchronization, and (3) improved congestion control in the network. The designers of the communication system of future parallel computers are therefore strongly encouraged to provide good synchronization facilities in addition to high throughput data transfers to support high performance message passing.

J. Subhlok, J. Stichnoth, D. O'Hallaron, and T. Gross. "Exploiting Task and Data Parallelism on a Multicomputer." In Proceedings of the 4th PPoPP, pages 13-22, San Diego, CA, May 1993.

For many applications, achieving good performance on a private memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to the compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

J. Stichnoth and T. Gross. "A Communication Backend for Parallel Language Compilers." In Proceedings of the 8th Workshop on Languages and Compilers for Parallel Computing, pages 224-236, Columbus, Ohio, August 1995. Springer Verlag.

Generating good communication code is an important issue for all compilers targeting parallel or distributed systems. However, different compilers for the same parallel system usually implement the communication generation routines (e.g., message buffer packing) independently and from scratch. As a result, these compilers either pursue a simple approach (calling a standard runtime library), which does not do justice to the capabilities of the system, or they incur high development costs. This paper describes a way to separate the communication issues from other compilation aspects (e.g., determining the distribution of data and computation). This organization places the responsibility for communication issues with the {\it communication backend}, and this backend can be shared by different compilers. It produces code that is customized for each communication step, based on the exact data distribution and the characteristics of the target parallel system. This approach has several advantages: (1) The communication backend can be shared by multiple compilers, e.g., for different parallel languages. (2) The communication backend provides a way to integrate regular and irregular communication, e.g., as required to deal with irregular codes. (3) Retargeting of a parallel compiler is simplified, since the communication backend deals with the interface to communication (and the single-node compiler). (4) The communication backend can optimize the code, e.g., by constant folding and constant propagation. Code produced by the communication backend is always at least as fast as library code, but the customization has the potential to significantly improve performance depending on what information is known at compile time.

T. Gross, S. Hinrichs, G. Lueh, D. O'Hallaron, J. Stichnoth, and J. Subhlok. "Compiling Task and Data Parallel Programs for iWarp." In Proceedings of the Second Workshop on Languages, Compilers, and Run-Time Environments for Distributed Memory Multiprocessors, pages 32-35, Boulder, CO, September 1992. SIGPLAN Notices 28(1), Jan 93.

No abstract; the paper is a 4-page extended abstract.

B. Yang, J. Webb, J. Stichnoth, D. O'Hallaron, and T. Gross, "Do&Merge: Integrating parallel loops and reductions." In Proc. Sixth Workshop on Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 169-183, Portland, OR, August 1993. Springer Verlag.

Many computations perform operations that match this pattern: first, a loop iterates over an input array, producing an array of (partial) results. The loop iterations are independent of each other and can be done in parallel. Second, a reduction operation combines the elements of the partial result array to produce the single final result. We call these two steps a Do&Merge computation. The most common way to effectively parallelize such a computation is for the programmer to apply a DOALL operation across the input array, and then to apply a reduction operator to the partial results. We show that combining the Do phase and the Merge phase into a single Do&Merge computation can lead to improved execution time and memory usage. In this paper we describe a simple and efficient construct (called the Pdo loop) that is included in an experimental HPF-like compiler for private-memory parallel systems.

J. M. Stichnoth. "Efficient Compilation of Array Statements for Private Memory Systems." Technical Report CMU-CS-93-109, School of Computer Science, Carnegie Mellon University, February 1993.

One of the core constructs of High Performance Fortran (HPF) is the array-slice assignment statement, combined with the rich choice of data distribution options available to the programmer. On a private memory multicomputer, the HPF compiler writer faces the difficult task of automatically generating the necessary communication for assignment statements involving arrays with arbitrary block-cyclic data distributions. In this paper we present a framework for representing array slices and block-cyclic distributions, and we derive efficient algorithms for sending and receiving the necessary data for array-slice assignment statements. The algorithms include a memory-efficient method of managing the layout of the distributed arrays in each processor's local memory. We also provide a means of converting the user's TEMPLATE, ALIGN, and DISTRIBUTE statements into a convenient array ownership descriptor. In addition, we present several optimizations for common distributions and easily-recognized communication patterns. The work presented makes minimal assumptions regarding the processor architecture, the communication architecture, or the underlying language being compiled.

P. Dinda, T. Gross, D. O'Hallaron, E. Segall, J. Stichnoth, J. Subhlok, J. Webb, and B. Yang, "The CMU task parallel program suite." Technical Report CMU-CS-94-131, School of Computer Science, Carnegie Mellon University, March, 1994.
The idea of exploiting both task and data parallelism in programs is appealing. However, identifying realistic, yet manageable example programs that can benefit from such a mix of task and data parallelism is a major problem for researchers. We address this problem by describing a suite of five applications from the domains of scientific, signal, and image processing that are of reasonable size, are representative of real codes, and can benefit from exploiting task and data parallelism. The suite includes fast Fourier transforms, narrowband tracking radar, multibaseline stereo imaging, and airshed simulation. Complete source code for each example program is available from the authors.