Date: Tue, 10 Dec 1996 22:29:40 GMT Server: NCSA/1.4.2 Content-type: text/html Last-modified: Tue, 09 Jan 1996 22:40:14 GMT Content-length: 12258
My current research centers on compiler-directed program restructuring techniques to improve the cache locality of array accesses in loops. Specifically, I am studying the use of array restructuring to optimize for spatial locality and, in parallel execution, to reduce false sharing.
I have also worked on improving the performance and applicability of runtime parallelization in an earlier project.
My advisor is Prof. John Zahorjan. For a list of my publications on these and other subjects, click here.
My current research focuses on compiler-directed program restructuring techniques to improve cache locality. Specifically, I am studying the use of array restructuring to optimize array accesses in loops. My work combines the development of algorithms within a formal framework, implementation of a prototype compiler, and extensive experimentation with benchmark loops. It shows how array restructuring can be applied automatically and efficiently to a wide class of applications.
Array restructuring is an approach to enhancing the locality of array accesses in loops. These accesses are targeted because they account for a major portion of the memory traffic in many array-based scientific computations. Moreover, since they are typically executed many times, the effort spent on optimizing a few of them in the program text can yield huge benefits in execution performance.
Under the array restructuring approach, the compiler analyzes how each array is accessed and lays out the array appropriately according to the access pattern. As a trivial example, if a two-dimensional array is accessed by rows, the compiler may decide to store it in row-major order, whereas if it is accessed by columns, the compiler would choose column-major storage. In contrast, traditionally the storage order is fixed for all arrays, forcing programmers concerned about program performance to write programs in such a way that the data access pattern matches the fixed data layout as far as possible.
My research in array restructuring is motivated in part by the observation that array restructuring in many ways complements loop restructuring --- an alternative approach that changes the execution order of loop iterations rather than the storage order of array elements --- but has received much less attention. For example, array restructuring can be easily applied to complicated loops that may hamper many automatic loop restructuring techniques. Also, array restructuring can improve spatial locality without jeopardizing temporal locality, whereas loop restructuring affects both types of locality. However, while loop restructuring has been widely studied, relatively little is known about how to apply array restructuring automatically and efficiently.
My research shows how array restructuring can be applied automatically and efficiently to a wide class of programs. This not only provides a new set of techniques to complement existing loop restructuring techniques, but also produces insights and experience that will, I believe, contribute to an integrated approach combining the strengths of the two. Specifically, my work makes four contributions: a framework to represent array transformations, algorithms to automate array restructuring within this framework, a prototype compiler to implement these algorithms, and experiments to evaluate their effectiveness.
I have formulated a framework to represent a general class of array transformations. In this framework, an array to be restructured (the "original array") is replaced by another array (the "restructured array") that contains the same elements but in a different order. Correspondence between elements of these two arrays is defined by an invertible linear transformation of their index vectors. In other words, instead of using an index vector to find an element in the original array, we apply a linear transformation to that vector and use the result to find the corresponding element in the restructured array. It may appear that the extra transformation imposes a significant overhead, but in fact this is not the case, for the following reason. Traditionally the memory address of an array element is a linear function of the indices. This condition is the basis of most compiler optimizations for reducing indexing overhead. Applying an extra linear transformation to the index vectors does not invalidate this condition and therefore does not entail any extra indexing overhead --- a property essential for the efficiency and thus viability of array restructuring.
I have developed algorithms within this framework for the key steps in array restructuring. My algorithms solve these problems with simple linear algebraic techniques in the common case where the array indices are linear functions of loop variables. These algorithms have also been adapted to deal with more general access patterns as well.
I have implemented a prototype compiler to perform array restructuring automatically. It is based on the SUIF compiler from Stanford University. SUIF itself comprises a number of compiler passes over an intermediate program representation. My implementation consists of an array restructuring pass (about 9,000 lines of C++) added after SUIF's own optimization passes, and a runtime library (about 2,000 lines of C and C++).
I have performed a series of experiments using the NASA7 kernels from the SPEC benchmarks and other loops from the literature. The experiments were designed to study how array restructuring affects performance for a range of problem sizes as well as how it compares and interacts with various loop restructuring techniques. They were carried out on four different platforms representing two types of machines: single-processor workstations (Alpha-based DEC 3000 and PowerPC-based IBM RS/6000) and shared-memory multiprocessors (R8000-based SGI Power Challenge and Kendall Square Research's KSR2, which has a proprietary processor).
Experimental results have been encouraging. On a DEC 3000 workstation, array restructuring decreased the execution time in most of the loops by 40-50% (over 80% in some cases), and increased it in none. The average improvement was 53% (or 31% if including loops for which the compiler did not apply array restructuring at all). This occurred for a wide range of problem sizes. Results for IBM RS/6000 were similar. On both platforms, performance improved because array restructuring led to better spatial locality. For the same reason, performance on KSR2 and SGI Power Challenge showed similar improvements for execution on any number of processors. Moreover, in several cases where false sharing had existed, array restructuring removed this performance bottleneck, producing performance benefits that increased with the number of processors.
Experiments also showed that in both applicability and performance, array restructuring techniques complemented many common loop restructuring techniques, including those performed by a production quality optimizing compiler from SGI. It was successfully applied to loops that these techniques could not automatically transform. It achieved comparable, often better, performance where both were applicable. In the few cases where it did not perform as well, simple forms of loop restructuring would have sufficed, again suggesting that loop and array restructuring are complementary.
More can be found in a technical report. A more concise version is here.
Runtime parallelization is a two-step strategy of parallelizing loops that may contain substantial parallelism but cannot be parallelized at compile time because of insufficient dependence information. To parallelize such a loop, the compiler generates two code fragments: the inspector and the executor. At run time, the inspector computes a parallel schedule of the iterations based on dependence information not available at compile time. The executor performs the iterations in parallel using this schedule.
My research in runtime parallelization has touched on both the inspector and the executor. I have proposed two ways to speed up the inspector. This work appeared in the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. The paper is also available as technical report 92-12-05. I have also studied various forms of the executor to improve its performance and extend its applicability to complex dependence patterns. (This research is reported in technical report 95-01-08.) My experiments on the KSR1, a shared address space multiprocessor, show that false sharing and poor spatial locality could seriously degrade executor performance. I have proposed and experimentally evaluated two simple techniques to address these problems by restructuring arrays according to the parallel execution schedule. (This research is reported in technical report 94-02-01.)