We now focus on how each of these techniques, particularly prefetching, fits into the ``big picture'' of coping with memory latency. As a quick summary, Table presents the benefits and requirements of each technique. Given these techniques, we would like to apply them in the following order.
First, the latency should be reduced as much as possible, through caching and locality optimizations. Reducing latency is preferable over tolerating latency since it actually reduces the demand for main memory bandwidth, which can be crucial. Caches provide the foundation for all of these latency-hiding techniques, and locality optimizations are also attractive since they require no additional hardware support.
After reducing latency, we then want to tolerate any remaining latency, starting with the least expensive techniques before resorting to more costly techniques. The first step is buffering and pipelining accesses, which is an effective means of hiding write latency and requires only a lockup-free cache-a requirement common to all latency-tolerating techniques. To address read latency as well, the choices are either prefetching or multithreading. Software-controlled prefetching appears to be the more desirable solution since it requires significantly less hardware support than either hardware-controlled prefetching or multithreading, and perhaps more importantly it can speed up the performance of a single thread of execution, rather than requiring multiple concurrent threads as in the case of multithreading. If software-controlled prefetching cannot effectively hide read latency, the final step would be multithreading.
Thus the open question of just how effective software-controlled prefetching can be in practice is a key factor in deciding whether processor architectures should support prefetching, multithreading, or both. Addressing this open question is one of the goals of this dissertation, as we discuss further in the next section.