The total instruction overhead depends not only on how frequently prefetches are issued, but also on how many instructions are necessary to issue each prefetch. Ideally only a single instruction would be needed for each prefetch-the prefetch instruction itself. However, Table shows that several additional instructions are often needed to generate prefetch addresses. In some cases, as many as 20 extra instructions were added for each prefetch. From our experience with the SUIF compiler, the primary cause for these large overheads is register spilling. Register spilling occurs whenever the register allocator runs out of registers, and therefore must ``spill'' values by saving and restoring them from memory. Once register spilling occurs, the instruction count within a loop body can increase dramatically.
One of the keys to avoiding register spilling is to reuse addressing registers between prefetches and loads and stores whenever possible. This works because prefetch addresses are often separated by a constant distance from load or store addresses. This difference can be folded into the constant offset field in the ``base-plus-offset'' addressing mode. Therefore the prefetch can be issued without consuming any additional registers, as we illustrated earlier in Figure . The importance of this optimization was demonstrated during our early experience with the compiler, when the scalar optimizer had not yet implemented this optimization. The instruction overhead was quite large in this early code because of the large amount of register spilling. Once the scalar optimizer began to exploit these common base registers, the overheads were dramatically reduced in many cases.
The second potential cause of register spilling is the loop unrolling which our algorithm performs to exploit spatial locality. Once a loop is unrolled, the compiler can optimize across the several replicated copies of the loop body. In most cases this allows the compiler to reduce the overall instruction count for the following reasons: (i) branch instructions can be eliminated between unrolled iterations; (ii) it is more likely that ``hazard'' slots after multi-cycle operations can be filled with independent instructions; and (iii) register allocation can be optimized across the unrolled iterations, possibly eliminating loads and stores. One such example is MXM, where the savings through loop unrolling resulted in fewer instructions in the code with prefetching than in the original code (see Figure ). On the other hand, the potential downside of loop unrolling is creating too many intermediate values for the register allocator to handle, thus resulting in register spilling. This occurred in the uniprocessor version of OCEAN, where each prefetch ended up costing 18 instructions. This number is so high because register spilling occurs for a large number of values in the loop, and these additional instructions are averaged over only the small number of prefetches. The result of this spilling is that OCEAN's instruction overhead is noticeably large despite the fact that misses occur infrequently (only once every 46 instructions). One way for the compiler to avoid these register-spilling problems is to perform loop unrolling with a greater awareness of its effect on register pressure. In our algorithm, we only suppressed loop unrolling to avoid code explosion, and did not take register pressure into consideration.
Although these techniques for avoiding register spilling are strictly compiler-based, we will now discuss ways to further reduce instruction overhead which also require architectural support.