Given that each prefetch brings a single line of data into the cache, the minimum amount of prefetch overhead (assuming perfect miss coverage) is one additional instruction for each cache miss. To reduce the overhead below this, the prefetch instructions must specify larger amounts of data to be fetched. For example, a prefetch could be defined to fetch multiple consecutive cache lines, rather than just a single cache line. We will refer to these multi-line prefetches as ``block prefetches''.
Block prefetching is advantageous when there is spatial locality. In such cases a single block prefetch supplants the multiple prefetches for the individual cache lines. For example, if a block prefetch fetches four cache lines, it can potentially eliminate up to 75%of the prefetch instruction overhead. Although block prefetches may be helpful when there is spatial locality, they may hurt performance in the absence of spatial locality by displacing useful data and wasting memory bandwidth. This negative effect has been observed in previous studies where additional consecutive cache lines were automatically prefetched by the hardware . Another potential downside is that bringing the additional lines into the cache earlier increases their chance of being displaced before use.
Therefore, rather than defining all prefetches to be block prefetches, a more attractive approach is for the compiler to intelligently select between using single-line prefetches and block prefetches based on whether the associated references enjoy spatial locality. Figure illustrated how both types of prefetches could be encoded in the instruction set architecture through the prefetching hint field. Incorporating block prefetches into the compiler algorithm in this manner is straightforward-block prefetches are used whenever there is spatial locality, and single-line prefetches are used otherwise. The compiler adjusts the line size parameter to match the block size when it schedules block prefetches, hence modifying the modulo factors in prefetch predicates involving spatial locality accordingly. Note that as block prefetching increases the number of iterations between prefetches, it will eventually become more attractive to use strip mining  (as described earlier in Section ) rather than unrolling to do the loop splitting, since the negative effects of loop unrolling dominate once they are unrolled too many times.
Block prefetching is a straightforward extension of normal prefetching. The main hardware complexity of supporting it is that the cache controller must handle single requests which fetch multiple lines. While block prefetches may help to reduce instruction overhead, it will only do so by at most the ratio of the larger prefetch block size to the normal cache line size. Given that large block sizes may hurt performance by causing additional primary cache conflicts, this ratio is likely to be relatively small.