Microprocessor-based systems are increasingly becoming the workhorse for all scientific and engineering computation. With numerical processing capabilities that already rival older generations of supercomputers, the microprocessors used in these systems will continue to improve dramatically due to every-increasing clock rates and the exploitation of instruction-level parallelism. In contrast to the vector-based machines that have long dominated high-performance computing, these new scalar systems are considerably more cost-effective since they contain commercial microprocessors that are mass-produced for the large general-purpose computing market. In addition, these commodity microprocessors can be used to build large-scale multiprocessors capable of aggregate peak rates surpassing that of current vector machines.
Unfortunately, a high computation bandwidth is meaningless unless it is matched by a similarly powerful memory subsystem. Although microprocessor speeds have been increasing dramatically, the speed of memory has not kept pace. As illustrated in Figure , the speed of commercial microprocessors has been doubling roughly every three years, while the speed of commodity DRAM has improved by little more than 50%over the past decade. Part of the reason for this is that there is a direct tradeoff between capacity and speed, and the highest priority in improving DRAM has been increasing capacity. The result is that from the perspective of the processor, memory is getting slower at a dramatic rate. This will affect all computer systems, making it increasingly difficult to achieve high processor efficiencies. The latency problem is magnified in large-scale multiprocessors, where sheer physical dimensions result in large latencies to remote memory locations.
To deal with memory latency, most computer systems today rely on their cache hierarchy to reduce the effective memory access time. While the effectiveness of caches has been well established for general-purpose code, their effectiveness for scientific and engineering applications has not. One manifestation of this is that several of the scalar machines designed for scientific computation did not use caches .
This thesis investigates a technique called software-controlled prefetching which mitigates the impact of long cache miss penalties, thereby helping to unlock the full potential of microprocessor-based systems. The remainder of this chapter provides further motivation for improving cache performance, discusses software-controlled prefetching in light of other techniques for coping with memory latency, presents our research goals, and summarizes related work. We conclude this chapter with a list of the major contributions of this thesis and an overview of the remaining chapters.