In this chapter we present our compiler algorithm for prefetching dense-matrix codes. These applications are a top priority since they consume large numbers of cycles on supercomputers today, they have poor caching behavior, and yet they have regular enough access patterns that prefetching has a reasonable chance of success. Therefore we start by handling these affine array accesses (i.e. where the index functions are affine expressions of the surrounding loop variables), and later build upon this core algorithm in subsequent chapters to handle other important cases such as indirect references and multiprocessor codes.
This chapter is organized in five sections. The first section discusses some key concepts for compiler-based prefetching, including the need to avoid unnecessary overhead. In light of these goals, Section provides an overview of our compiler algorithm, which includes both an analysis phase and a scheduling phase. Details of these two phases are presented in Sections and , respectively. The analysis phase uses locality analysis to predict which references should be prefetched, and the scheduling phase first uses loop splitting to isolate those dynamic miss instances, and then uses software-pipelining to schedule prefetches the proper amount of time in advance. Finally, Section makes our algorithm concrete by showing the output for some example code, and also discusses issues that arose when implementing the algorithm in the SUIF compiler, such as whether prefetch insertion should occur before or after scalar optimization. The success of this algorithm will be evaluated later in Chapter .