Next: Future Work Up: Tolerating Latency Through Software-Controlled Previous: Chapter Summary

Conclusions

Techniques for coping with memory latency are essential to achieve high processor utilization. Such techniques will become increasingly important in the future as the gap between processor and memory speeds continues to widen. As we discussed in Chapter , the best approach to dealing with latency is to first reduce it as much as possible, through techniques such as caching and locality optimizations, and then tolerate whatever latency remains. The latency of writes can be hidden by buffering and pipelining the write accesses, which is accomplished in shared-memory multiprocessors through relaxed consistency models. However, read latency is only addressed effectively through either prefetching or multithreading. Of these two techniques, software-controlled prefetching appears to be more attractive because it can speed up a single thread of execution, and because it requires much simpler hardware than multithreading. This dissertation has addressed the open question of how effective software-controlled prefetching can be in practice. We have addressed this question by proposing and implementing a new algorithm for inserting prefetches into array-based scientific and engineering codes.

The key results of this dissertation are the following:

Software-controlled prefetching can be quite effective at tolerating memory latency in scientific and engineering applications on both uniprocessor and large-scale multiprocessor architectures. In all but a few cases, 50%to 90%of the original memory stall time is eliminated, which translates into improvements in overall performance of over 45%for a majority of the applications we studied. In several cases, overall performance improved by a factor of two.
The compiler can do a very good job of inserting prefetches into code automatically, and it can cover a wide domain of scientific and engineering applications. Locality analysis is successful at predicting exactly which references should be prefetched, loop splitting techniques help minimize prefetching overhead, and software pipelining is effective at scheduling prefetches to hide memory latency. We demonstrated the effectiveness of our algorithm through a full compiler implementation and detailed performance studies. The success of the compiler algorithm is encouraging, since it relieves the programmer from the burden of inserting prefetches manually.
Prefetching is complementary to other latency-hiding techniques, including locality optimizations and relaxed consistency models. Locality optimizations complement prefetching by reducing the number of cache misses (thus reducing the resulting prefetching overhead), and prefetching hides much of the remaining latency. Similarly, relaxed consistency models complement prefetching by completely hiding write latency, and prefetching addresses the remaining read latency.
Latency-hiding techniques requiring expensive hardware support (e.g., hardware-controlled prefetching, multithreading) do not appear to be necessary for the classes of applications considered in this study.
Since prefetching can only improve performance if additional bandwidth is available in the memory subsystem, it is essential that the hardware provide this additional bandwidth through techniques such as lockup-free caches. We observed that supporting up to four outstanding cache misses can improve performance substantially. Providing sufficient memory bandwidth should be the focus of hardware design.

Future Work

Next: Future Work Up: Tolerating Latency Through Software-Controlled Previous: Chapter Summary

tcm@
Sat Jun 25 15:13:04 PDT 1994