 
  
  
  
 
Given that each prefetch brings a single line of data into the cache, the minimum amount of prefetch overhead (assuming perfect miss coverage) is one additional instruction for each cache miss. To reduce the overhead below this, the prefetch instructions must specify larger amounts of data to be fetched. For example, a prefetch could be defined to fetch multiple consecutive cache lines, rather than just a single cache line. We will refer to these multi-line prefetches as ``block prefetches''.
Block prefetching is advantageous when there is spatial locality. In such cases a single block prefetch supplants the multiple prefetches for the individual cache lines. For example, if a block prefetch fetches four cache lines, it can potentially eliminate up to 75%of the prefetch instruction overhead. Although block prefetches may be helpful when there is spatial locality, they may hurt performance in the absence of spatial locality by displacing useful data and wasting memory bandwidth. This negative effect has been observed in previous studies where additional consecutive cache lines were automatically prefetched by the hardware [64]. Another potential downside is that bringing the additional lines into the cache earlier increases their chance of being displaced before use.
Therefore, rather than defining all prefetches to be block prefetches, a 
more attractive approach is for the compiler to intelligently select 
between using single-line prefetches and block prefetches based 
on whether the associated references enjoy spatial locality. Figure 
 illustrated how both types of prefetches could be 
encoded in the instruction set architecture through the prefetching hint 
field. Incorporating block prefetches into the compiler algorithm in this 
manner is straightforward-block prefetches are used whenever there is 
spatial locality, and single-line prefetches are used otherwise. The 
compiler adjusts the line size parameter to match the block size when it 
schedules block prefetches, hence modifying the modulo factors in prefetch 
predicates involving spatial locality accordingly. Note that as block 
prefetching increases the number of iterations between prefetches, it will 
eventually become more attractive to use strip mining 
[64] (as described earlier in 
Section
 illustrated how both types of prefetches could be 
encoded in the instruction set architecture through the prefetching hint 
field. Incorporating block prefetches into the compiler algorithm in this 
manner is straightforward-block prefetches are used whenever there is 
spatial locality, and single-line prefetches are used otherwise. The 
compiler adjusts the line size parameter to match the block size when it 
schedules block prefetches, hence modifying the modulo factors in prefetch 
predicates involving spatial locality accordingly. Note that as block 
prefetching increases the number of iterations between prefetches, it will 
eventually become more attractive to use strip mining 
[64] (as described earlier in 
Section  ) rather than unrolling to 
do the loop splitting, since the negative effects of loop unrolling 
dominate once they are unrolled too many times.
) rather than unrolling to 
do the loop splitting, since the negative effects of loop unrolling 
dominate once they are unrolled too many times.
Block prefetching is a straightforward extension of normal prefetching. The main hardware complexity of supporting it is that the cache controller must handle single requests which fetch multiple lines. While block prefetches may help to reduce instruction overhead, it will only do so by at most the ratio of the larger prefetch block size to the normal cache line size. Given that large block sizes may hurt performance by causing additional primary cache conflicts, this ratio is likely to be relatively small.
 
  
  
 