Although relaxed consistency models are effective at eliminating write latency, they do not address the problem of read latency. While prefetching is one technique for hiding read latency, another technique is for the processor to support multiple hardware contexts [85][73][39][36][3] (also known as multithreading). As we mentioned earlier in Section , multithreading has two advantages over prefetching. First, it can handle arbitrarily complex access patterns-even cases where it is impossible to predict the accesses ahead of time (and therefore prefetching cannot succeed). This is because multithreading simply reacts to misses once they occur, rather than attempting to predict them. Multithreading tolerates latency by attempting to overlap the latency of one context with the computation of other concurrent contexts. The second advantage of multithreading is that it requires no software support (assuming the code is already parallelized), which as we mentioned in the previous section is only an advantage if the user is unwilling or unable to recompile old code. Multithreading has three limitations: (i) it relies on additional concurrency within an application, which may not exist; (ii) some amount of time is lost when switching between contexts; and (iii) to minimize context-switching overheads, a significant amount of hardware support is necessary. In this section, we will evaluate multithreading and explore its interactions with software-controlled prefetching.
The performance improvement offered by multithreading depends on several factors. First, there is the number of contexts. With more contexts available, the processor is less likely to be out of ready-to-run contexts. However, the number of contexts is constrained by hardware costs and available parallelism in the application. Previous studies have shown that given processor caches, the interval between long-latency operations (i.e. cache misses) becomes fairly large, allowing just a handful of contexts to hide most of the latency [85]. The second factor is the context switch overhead. If the overhead is a sizable fraction of the typical run lengths (time between misses), a significant fraction of time may be wasted switching contexts. Shorter context switch times, however, require a more complex processor. Thirdly, the performance depends on the application behavior. Applications with clustered misses and irregular miss latencies will make it difficult to completely overlap computation of one context with memory accesses of other contexts. Multithreading processors will thus achieve a lower processor utilization on these programs than on applications with more regular miss behavior. Lastly, multiple contexts themselves affect the performance of the memory subsystem. The different contexts share a single processor cache and can interfere with each other, both constructively (by effectively prefetching another context's working set) and destructively (by displacing another context's working set). Also, as is the case with release consistency and prefetching, the memory system is more heavily loaded by multithreading, and thus latencies may increase.
In this study, we use processors with two and four contexts. We do not consider more contexts per processor because 16 4-context processors require 64 parallel threads and some of our applications do not achieve very good speedup with that many threads. We use two different context switch overheads: 4 and 16 cycles. A four-cycle context switch overhead corresponds to flushing/loading a short RISC pipeline when switching to the new instruction stream. An overhead of sixteen cycles corresponds to a less aggressive implementation. In our study, we include additional buffers to avoid thrashing and deadlock when two contexts try to read distinct memory lines that map to the same cache line. All of these experiments assume an RC model.