Next: Interaction with Locality-Optimizer Up: Experimental Results Previous: Sensitivity to Compile-Time

Dropping Prefetches vs. Stalling

In the architectural model presented so far, the memory subsystem has a finite (16-entry) prefetch issue buffer to hold outstanding prefetch requests. Our model includes a few hardware optimizations to help mitigate the negative effects of the finite buffer size. In particular, a prefetch is only inserted into the buffer if it misses in the primary cache and there is not already an outstanding prefetch for the same cache line. Also, a prefetch is removed from the issue buffer as soon as it completes (i.e. the buffer is not a FIFO queue). However, in spite of these optimizations, the buffer may still fill up if the processor issues prefetches faster than the memory subsystem can service them.

Once the prefetch issue buffer is full, the processor is unable to issue further prefetches. The model we use so far stalls the processor until a buffer entry becomes available. An alternative is to simply drop the prefetch and continue executing. Intuitive arguments might be presented to support either approach. On one hand, if the data is needed in the future and is not presently in the cache (since only prefetches that miss go into the buffer), it may appear to be cheaper to stall now until a single entry is free rather than to suffer an entire cache miss sometime in the future. On the other hand, since a prefetch is only a performance hint, perhaps it is better to continue executing useful instructions.

To understand this issue, we ran each of our benchmarks again using a model where prefetches are dropped rather than stalling the processor when the prefetch issue buffer is full. The results of this experiment are shown in Figure 7. Comparing this with Figure 4, we see that there is a difference in performance for seven of the cases (CFFT2D, CHOLSKY, BTRIX, GMTRY, VPENTA, TOMCATV, and CG). In each of these cases, the performance of the indiscriminate prefetching algorithm is improved by dropping prefetches. The improvement is dramatic in the two cases that had previously stalled the most due to full buffers (CFFT2D and CG). The selective prefetch algorithm, however, did not improve from dropping prefetches since it suffered very little from full prefetch issue buffers in the first place. In fact, in three of the cases (CHOLSKY, BTRIX and GMTRY), the selective algorithm performed slightly worse when prefetches are dropped. Dropping prefetches has the effect of sacrificing some amount of coverage (and therefore memory stall reduction) for the sake of reducing prefetch issue overhead. This effect is most clearly illustrated in the case of CG (compare the I bars in Figures 4 and 7), where memory stall time doubles for the indiscriminate algorithm once prefetches are dropped.

There are two reasons why the performance improves substantially for the indiscriminate prefetching algorithm. The first reason is that dropping prefetches increases the chances that future prefetches will find open slots in the prefetch issue buffer. The second is that since the indiscriminate algorithm has a larger number of redundant (i.e. unnecessary) prefetches, dropping a prefetch does not necessarily lead to a cache miss. It is possible that the algorithm will issue a prefetch of the same line before the line is referenced. Since selective prefetching has eliminated much of this redundancy, it is more likely that dropping a prefetch would translate into a subsequent cache miss. However, as we have already seen in Figure 4, the selective algorithm tends to suffer very little from full issue buffers, and therefore performs well in either case.

Next: Interaction with Locality-Optimizer Up: Experimental Results Previous: Sensitivity to Compile-Time

Robert French