We now discuss some specific changes to the hardware that are necessary to
support prefetching. First, the processor obviously must be able to
decode and process the new prefetch instructions, as described in Section
. The main complications of doing so are ensuring that
they are non-blocking and are harmlessly dropped whenever the prefetch
address is invalid.
It is important to realize, however, that hardware support for prefetching
does not end with just adding prefetch instructions to the instruction set. It
is essential that the bandwidth of the memory hierarchy be increased to support
the extra demand imposed by prefetching. An important step toward increasing
memory hierarchy bandwidth is allowing multiple outstanding cache misses, which
is referred to as having a lockup-free cache [45]. This added bandwidth makes it possible to hide latency by
overlapping memory accesses with other memory accesses, not just
computation.
This subsection is organized as follows. We begin by discussing issues associated with implementing a lockup-free cache, and relate them to our uniprocessor architecture. We then evaluate the performance tradeoffs for two key parameters in the lockup-free cache design: the number of outstanding misses, and the number of prefetch issue buffer entries. Finally, we compare the performance of separate write and prefetch issue buffers with the performance of a combined buffer.