Next: Future Work
Up: Tolerating Latency Through Software-Controlled
Previous: Chapter Summary
Techniques for coping with memory latency are essential to achieve high
processor utilization. Such techniques will become increasingly important
in the future as the gap between processor and memory speeds continues to
widen. As we discussed in Chapter , the best
approach to dealing with latency is to first reduce it as much as
possible, through techniques such as caching and locality optimizations,
and then tolerate whatever latency remains. The latency of writes can
be hidden by buffering and pipelining the write accesses, which is
accomplished in shared-memory multiprocessors through relaxed consistency
models. However, read latency is only addressed effectively through either
prefetching or multithreading. Of these two techniques, software-controlled
prefetching appears to be more attractive because it can speed up a single
thread of execution, and because it requires much simpler hardware than
multithreading. This dissertation has addressed the open question of
how effective software-controlled prefetching can be in practice. We have
addressed this question by proposing and implementing a new algorithm for
inserting prefetches into array-based scientific and engineering codes.
The key results of this dissertation are the following:
- Software-controlled prefetching can be quite effective at
tolerating memory latency in scientific and engineering applications on
both uniprocessor and large-scale multiprocessor architectures. In all
but a few cases, 50%to 90%of the original memory stall time is
eliminated, which translates into improvements in overall performance of
over 45%for a majority of the applications we studied. In several
cases, overall performance improved by a factor of two.
- The compiler can do a very good job of inserting prefetches
into code automatically, and it can cover a wide domain of
scientific and engineering applications. Locality analysis is
successful at predicting exactly which references should be
prefetched, loop splitting techniques help minimize prefetching
overhead, and software pipelining is effective at scheduling
prefetches to hide memory latency. We demonstrated the effectiveness of our
algorithm through a full compiler implementation and detailed performance
studies. The success of the compiler algorithm is encouraging, since it
relieves the programmer from the burden of inserting prefetches manually.
- Prefetching is complementary to other latency-hiding techniques,
including locality optimizations and relaxed consistency models.
Locality optimizations complement prefetching by reducing the number
of cache misses (thus reducing the resulting prefetching overhead), and
prefetching hides much of the remaining latency. Similarly, relaxed
consistency models complement prefetching by completely hiding write
latency, and prefetching addresses the remaining read latency.
- Latency-hiding techniques requiring expensive hardware support
(e.g., hardware-controlled prefetching, multithreading) do not appear
to be necessary for the classes of applications considered in this
study.
- Since prefetching can only improve performance if additional
bandwidth is available in the memory subsystem, it is essential that
the hardware provide this additional bandwidth through techniques
such as lockup-free caches. We observed that supporting up to four
outstanding cache misses can improve performance substantially.
Providing sufficient memory bandwidth should be the focus of
hardware design.
Next: Future Work
Up: Tolerating Latency Through Software-Controlled
Previous: Chapter Summary