The two main techniques for tolerating read latency as well as write latency are prefetching and multithreading. The key to tolerating read latency is to split apart the request for data and the use of that data, while finding enough parallelism to keep the processor busy in between. The distinction between prefetching and multithreading is that prefetching finds the parallelism within a single thread of execution, while multithreading exploits parallelism across multiple threads. To hide the latency within a single thread, the request for the data (i.e. the prefetch request) must be moved back sufficiently far in advance of the use of the data in the execution stream. This effectively requires the ability to predict what data is needed ahead of time. In contrast, the multithreading approach splits read transactions by swapping out the currently executing thread when it suffers a cache miss, executing other concurrent threads for the duration of the miss to keep the processor busy, and finally resuming the initial thread once the memory access completes. Prefetching will be discussed in this section, while multithreading will be discussed in more detail in Section .
Figure is a simple illustration of how prefetching improves performance. In the case without prefetching (shown on the left), the processor stalls when it attempts to load two locations (A and B) that are not present in the cache. If prefetches for A and B can be issued far enough in advance in the instruction stream (as shown on the right), then the memory accesses for both locations will have completed before the loads are executed, and hence the processor will not stall. The key observation here is that prefetching not only allows memory accesses to be overlapped with computation, but it also allows memory accesses to be overlapped with other memory accesses (i.e. the accesses can be pipelined).
Prefetches on a scalar machine are analogous to vector memory accesses on a vector machine. In both cases, memory accesses are overlapped with computation and other accesses. Furthermore, similar to vector registers, prefetching allows caches in scalar machines to be managed by software. A major difference is that while vector machines can only operate on vectors in a pipelined manner, scalar machines can execute arbitrary sets of scalar operations well.
Prefetching can occur in many different forms. One common type of prefetching occurs whenever cache lines are longer than a single word. In these cases, additional words are brought into the cache on each cache miss. This is most useful when there is abundant spatial locality, such as when iterating across an array in a unit-stride manner. However, increasing the cache line size is not the most effective form of prefetching, since memory bandwidth is wasted whenever useless data is brought into the cache . In addition, long cache lines can aggravate miss rates in shared-memory multiprocessors by causing unnecessary amounts of false sharing . As we have seen already in Figures and , a significant amount of latency remains despite the prefetching benefit of multi-word cache lines.
Another form of prefetching could occur with non-blocking loads . With a non-blocking load, rather than stalling when the load is executed, any stalls are postponed until the load result is actually needed. So one could imagine that if the loads could be moved far enough in advance of the uses of data, then prefetching could be implemented in this manner. However, there are two important limitations on how far loads can be moved ahead of their uses. First, there is the problem of running out of registers. If the compiler attempts to extend register lifetimes to hundreds of cycles, it will run out of registers very quickly. Second, there is the problem of maintaining program correctness. For example, a load cannot be moved ahead of a store unless it is certain that they are to different locations. Since memory disambiguation is a difficult problem for the compiler to solve (particularly in codes with indirect references), this is likely to be a serious limitation.
Some elaborate prefetching schemes that are strictly hardware-based have also been proposed. We will discuss those schemes only briefly now, and will examine them in greater detail later in Section . Perhaps the most sophisticated of these techniques is the one proposed by Baer and Chen . With this scheme, the processor maintains a history table to keep track of the types of reference patterns it is seeing. If it detects a pattern of constant-stride access behavior for a particular instruction, it will attempt to prefetch ahead for that reference in the future. This prefetching occurs through a ``lookahead PC'', which walks ahead of the actual PC using branch prediction. The lookahead PC is used to look up these future instructions in the history table to see whether they should be prefetched. Another scheme proposed by Lee  attempted to decode future instructions using a lookahead buffer to detect memory references. One advantage of strictly hardware-based schemes is that they do not incur any instruction overhead, unlike software-controlled prefetching (which we will discuss next). However, their disadvantages include the fact that they are limited to prefetching constant-stride accesses, they are limited by branch prediction (which is less than perfect), and they may entail a significant hardware cost.
Finally, with software-controlled prefetching, explicit prefetch instructions are executed by the processor to move data into the cache. The format of these instructions resembles a normal load instruction, but without a register specifier (since the data is only placed in the cache). Prefetch instructions also differ from normal load instructions in that they are non-blocking and they do not take memory exceptions. The non-blocking aspect allows them to be overlapped with computation, and the fact that they do not take exceptions is useful because it permits more speculative prefetching strategies (e.g., dereferencing pointers before it is certain that they point to legal addresses). The challenges of software-controlled prefetching include the fact that some sophistication is needed to insert the prefetches into the code, and also that the new prefetch instructions will involve some amount of execution overhead. The advantages of software-controlled prefetching are that only a small amount of hardware support is necessary, and a broader class of reference patterns can be covered than simply constant stride accesses (e.g., indirect references, such as in sparse-matrix code).