Figure summarizes the major steps of our prefetching algorithm. The example code and resulting data locality are shown in Figures (a) and (b), respectively. Figure (c) shows the result of the analysis phase of our algorithm, where we have constructed predicates indicating when all three array references should be prefetched. Finally, Figure (d) shows the resulting code with prefetching, which is generated by the scheduling phase of our algorithm as follows.
First, our algorithm performs loop splitting based on the prefetch predicates. The ``i = 0'' predicate resulting from the temporal locality of the B[j+1] reference causes the compiler to peel the i loop. As we see in Figure (d), the prefetches for the B matrix occur only in the peel of i. Also, to isolate the spatial locality of the A[i][j] reference, the ``(j mod 2) = 0'' predicate causes the j loop to be unrolled by a factor of 2-both in the peel and in the main iterations of the i loop. As a result, Figure (d) shows that there is only one prefetch of the A matrix for every two copies of the loop body.
Once the miss instances have been isolated through loop splitting, the compiler then software pipelines the prefetches as follows. Using the algorithm in Figure , our compiler determines that the shortest path through the loop body is 36 instructions long. Given that memory latency is 100 cycles for this example, equation () yields that iterations are sufficient to hide the latency. Once this iteration count is determined, the code transformation is mechanical. In Figure (d) we see the resulting prolog, steady state, and epilog loops in both the peel and the main iterations of the i loop. Thus the code in Figure (d) issues all of the useful prefetches early enough to overlap the memory accesses with computation on other data. (This is a source-level representation of the actual code generated by our compiler for this case).