Ideally, only iterations satisfying the prefetch predicate should issue prefetch instructions. A naive way to implement this is to enclose the prefetch instructions inside an IF statement with the prefetch predicate as the condition. However, such a statement in the innermost loop can be costly (probably at least as expensive as issuing prefetches all the time), and thus defeats the purpose of reducing the prefetch overhead. We can eliminate this overhead by decomposing the loops into different sections so that the predicates for all instances for the same section evaluate to the same value. This process is known as loop splitting.
Table shows the loop splitting transformations for the various prefetch predicates. If the prefetch predicate is either ``True'' or ``False'', then no transformation is necessary since we either prefetch all the time or not at all. An ``i = 0'' predicate resulting from temporal locality requires the first iteration of the loop to be isolated through peeling. The generic schema for peeling a loop is illustrated in Figure . Finally, the ``(i mod ) = 0'' predicate resulting from spatial locality requires the loop to be either unrolled or strip-mined  to isolate one in every iterations. Figures and show the generic schemas for unrolling and strip-mining a loop, respectively. In general, the unrolling transformation is preferable for small or moderate values of . However, as becomes large, perhaps due to large cache line sizes, the overhead of replicating the loop body will eventually become more expensive than the extra loop control overhead involved in strip-mining. For the implementation of the compiler algorithm used in this dissertation, however, only the unrolling transformation was considered for spatial locality, since our cache lines are moderately sized (16 and 32 bytes).
For nested loops, these loop splitting transformations can be applied recursively to handle each prefetch predicate. However, peeling and unrolling multiple levels of loops can potentially expand the code by a significant amount. This may reduce the effectiveness of the instruction cache; also, existing optimizing compilers are often ineffective for large procedure bodies. Our algorithm keeps track of how large a loop is growing. We suppress peeling or unrolling when the loop becomes too large. For temporal locality, if the loop is too large to peel, we simply drop the prefetches. For spatial locality, when the loop becomes too large to unroll, we introduce a conditional statement. This is a reasonable choice because when the loop body becomes this large, the cost of a conditional statement is relatively small.
Now that our compiler knows what to prefetch and has isolated those instances in the code with minimal overhead, the final step is to schedule the prefetches the proper amount of time in advance to hide the memory latency through software pipelining.