In this chapter, we explored three major sets of architectural issues
related to prefetching: basic architectural support for prefetching,
possible enhancements to achieve even larger gains through prefetching,
and alternative latency-hiding techniques that require hardware support.
We now briefly summarize each section below.
The key issues in providing basic architectural support for prefetching
are the following:
- Unlike normal loads, prefetches are non-binding, non-blocking, and non-excepting. By giving prefetch instructions
unique opcodes, the bits normally used to specify a destination register
can be used instead for prefetching hints (e.g., exclusive-mode vs. shared-mode prefetches).
- While dropping prefetches is a complex issue, the best approach
appears to be dropping prefetches on TLB misses, and not dropping them
on full prefetch issue buffers (given selective prefetching).
- When performing a prefetch access, the caches should be checked
while searching for the data, and the data should be placed directly in
the primary cache.
- The main hardware support necessary for prefetching is a
lockup-free cache. Supporting up to four outstanding misses is useful,
and buffering more than four outstanding prefetch requests offers only
a limited advantage.
The following are the major issues in achieving even larger gains through
- Profiling feedback can potentially improve performance, but has
some important drawbacks. A more attractive technique for exploiting
dynamic information may be to generate adaptive code. Both of these
techniques will benefit from user-visible hardware miss counters.
- Providing associativity through either set-associative caches or
victim caches is an important step toward dealing with cache conflicts.
Neither of these solutions is as attractive as eliminating the problem
- To avoid excessive prefetching overhead, the compiler must be
careful to avoid register spilling. To reduce overheads further, block prefetches may be useful when spatial locality exists.
Finally, prefetching compares with other latency-hiding techniques as follows:
- Software-controlled prefetching appears to be superior to
hardware-controlled prefetching, since the software approach results in
better coverage and requires less hardware support.
- Prefetching and relaxed consistency models are complementary.
Relaxed consistency models eliminate write latency, and prefetching
addresses the remaining read latency.
- The interaction between prefetching and multithreading is
complex, and appears to be complementary only with very small numbers of