The goal of loop splitting is to isolate the cache miss instances while introducing as little instruction overhead as possible. To quantify the advantage of loop splitting, we implemented the naive alternative for isolating cache miss instances-placing conditional statements inside the loops. Figure shows that for 5 of the 13 benchmarks (MXM,BTRIX,VPENTA,CG and MG), the performance advantage of loop splitting is greater than 25%.
A good measure of the success of loop splitting is the instruction overhead per prefetch issued. Ideally, isolating the cache miss instances will not increase the instruction overhead. One of the advantages of having implemented the prefetching schemes in the compiler is that we can quantify this instruction overhead. Previous studies have only been able to estimate instruction overhead .
Table shows the number of instructions required to issue each prefetch. For the indiscriminate prefetching scheme, the overhead per prefetch ranges from 1 to 4 instructions. This is well within the bounds of what one would intuitively expect. One of the instructions is the prefetch itself, and the rest are for address calculation. For the selective prefetching scheme, we show overhead both with respect to the original code and with respect to code where the loop splitting transformations have been performed but no prefetch instructions are inserted. Loop splitting generally increases the overhead per prefetch. In a few cases, the overhead with respect to the original code is actually negative, due to the savings through loop unrolling (MXM, IS and CG). In two cases (OCEAN and MG), the overhead has become quite large-more than 17 instructions per prefetch. In the case of OCEAN, the loop bodies are quite large, and the combination of loop unrolling and software pipelining makes it necessary for the compiler to spill registers. The penalty for register spills is averaged over just the prefetches, and this penalty can become quite high. In the case of MG, the number of prefetches has been drastically reduced (by a factor of 21). Averaging all the loop and transformation overheads over only a small number of prefetches results in a high instruction-per-prefetch overhead. In most of the cases, however, the overhead per prefetch remains low-5 or fewer instructions with respect to the original code for 9 of the 13 benchmarks.
Finally, after predicting and isolating the dynamic miss instances, the last step in our algorithm is to schedule prefetches ahead of the references using software pipelining.