We now discuss some specific changes to the hardware that are necessary to support prefetching. First, the processor obviously must be able to decode and process the new prefetch instructions, as described in Section . The main complications of doing so are ensuring that they are non-blocking and are harmlessly dropped whenever the prefetch address is invalid.
It is important to realize, however, that hardware support for prefetching does not end with just adding prefetch instructions to the instruction set. It is essential that the bandwidth of the memory hierarchy be increased to support the extra demand imposed by prefetching. An important step toward increasing memory hierarchy bandwidth is allowing multiple outstanding cache misses, which is referred to as having a lockup-free cache [45]. This added bandwidth makes it possible to hide latency by overlapping memory accesses with other memory accesses, not just computation.
This subsection is organized as follows. We begin by discussing issues associated with implementing a lockup-free cache, and relate them to our uniprocessor architecture. We then evaluate the performance tradeoffs for two key parameters in the lockup-free cache design: the number of outstanding misses, and the number of prefetch issue buffer entries. Finally, we compare the performance of separate write and prefetch issue buffers with the performance of a combined buffer.