For our experiments in Chapter , we assume that if the processor attempts to issue a prefetch while the prefetch issue buffer is full, the processor stalls until an entry becomes available. We now consider what happens if prefetches are instead dropped on a full prefetch issue buffer.
In the architectural model presented so far, the memory subsystem has a finite (16-entry) prefetch issue buffer to hold outstanding prefetch requests. In our model, a prefetch is inserted into the buffer only if it misses in the primary cache and there is not already an outstanding prefetch for the same cache line. Also, a prefetch is removed from the issue buffer as soon as it completes (i.e. the buffer is not a FIFO queue); reordering may occur due to variations in miss latencies. Despite some of these optimizations, the buffer may still fill up if the processor issues prefetches faster than the memory subsystem can service them.
Once the prefetch issue buffer is full, the processor is unable to issue further prefetches. At that point the choices are either to stall the processor until a buffer entry becomes available, or else drop the prefetch and continue executing. Intuitive arguments might be presented to support either approach. On one hand, if the data is needed in the future and is not presently in the cache (since only prefetches that miss go into the buffer), it may appear to be cheaper to stall now until a single entry is free rather than to suffer an entire cache miss sometime in the future. On the other hand, since a prefetch is only a performance hint, perhaps it is better to continue executing useful instructions.
To understand this issue, we ran each of our uniprocessor benchmarks again using a model where prefetches are dropped rather than stalling the processor when the prefetch issue buffer is full. We ran this model for both the indiscriminate and selective prefetching algorithms. Figure shows the cases where this affected performance. We begin by focusing on the indiscriminate algorithm, and then later focus on the selective algorithm.
For all seven cases where the performance of the indiscriminate algorithm changed (shown in Figure (a)), the performance improved by dropping prefetches. The improvement is dramatic in the two cases that had previously stalled the most due to full buffers (CFFT2D and CG). There are two reasons why the performance improves substantially for the indiscriminate prefetching algorithm. The first reason is that dropping prefetches increases the chances that future prefetches will find open slots in the prefetch issue buffer. The second is that since the indiscriminate algorithm has a larger number of redundant (i.e. unnecessary) prefetches, dropping a prefetch does not necessarily lead to a cache miss. It is possible that the algorithm will issue a prefetch of the same line before the line is referenced. Dropping prefetches has the effect of sacrificing some amount of coverage (and therefore memory stall reduction) for the sake of reducing prefetch issue overhead. This effect is most clearly illustrated in the case of CG (see Figure (a)), where memory stall time doubles for the indiscriminate algorithm once prefetches are dropped.
The selective prefetch algorithm, in contrast, did not improve from dropping prefetches since it suffered very little from full prefetch issue buffers in the first place. In fact, in the three cases shown in Figure (b), the selective algorithm performed slightly worse when prefetches are dropped. The reason why is that since selective prefetching has eliminated many of the redundant prefetches, it is more likely that dropping a prefetch would translate into a subsequent cache miss. However, as we have already seen in Figure , the selective algorithm tends to suffer very little from full issue buffers, and therefore performs well in either case.