In contrast with locality optimizations and relaxed consistency models, which are clearly complementary to prefetching, the interaction between multithreading and prefetching is more complex. In part this is because both techniques attempt to achieve the same goal, which is hiding read latency. Therefore if either technique is highly successful, there is little benefit of adding the second technique. For example, we observed that with four contexts (which are usually sufficient to hide most of the latency), there is little benefit of adding prefetching, and in fact the performance is often worse. On the other hand, we saw cases where prefetching outperforms multithreading because the applications do not scale well to large numbers of processes. Perhaps the most surprising of these latter cases is PTHOR, where although we had great difficulty inserting prefetches due to its highly irregular access patterns, it performs even worse with multithreading since it contains little additional task-level parallelism.
Prefetching and multithreading appear to be complementary when there are only two contexts-in all but two cases, prefetching improved by adding a second context, and in all but one case, two contexts improved by adding prefetching. In such cases, prefetching boosts the effectiveness of the smaller number of contexts by increasing the hit rate and thus the interval between context switches. At the same time, multithreading improves prefetching performance by hiding the latency of misses that are not prefetched. Therefore, if hardware costs dictate that only a very small number of contexts can be supported (too few to fully hide memory latency), the most attractive solution may be to combine multithreading and prefetching.