Software pipelining  is a technique that allows us to hide memory latency by overlapping the prefetches for a future iteration with the computation of the current iteration. While this transformation is straightforward, the key parameter is how far in advance the prefetches should be scheduled. On the one hand, prefetches must be issued early enough to hide memory latency. But on the other hand, if prefetches are issued too early, the data may be flushed from the cache before they can be used.
We choose the number of iterations to be the unit of time scheduling in our algorithm. The number of iterations to prefetch ahead is
where is the expected memory latency and is the length of the shortest path through the loop body. In our algorithm, is a parameter given to the compiler, which we set to be the largest expected memory latency, including contention. The parameter is computed by the compiler for each loop nest, using the Shortest_Path algorithm shown in Figure . The interesting cases in this algorithm are: (i) conditional statements, where we choose the shorter of the ``then'' and ``else'' paths; (ii) loops, where we use the iteration count if it is known-otherwise, we assume at least a single iteration is executed; and (iii) procedure calls, where we use the length of the procedure body, unless there is recursion. We take the ceiling of the ratio to ensure that all of the latency is hidden.
Figure shows a simple example of how software pipelining would be used to schedule prefetches. Assuming equation () yields that 5 iterations are sufficient to hide the memory latency for the loop in Figure (a), we begin in Figure (b) with a prolog that issues the first several prefetches. Next, a steady state loop is executed where both prefetches and computation occur. Finally, an epilog loop completes the last several iterations of computation.
Since our scheduling quantum is an iteration, this scheme prefetches a data item at least one iteration before it is used. If a single iteration of the loop can fetch so much data that the prefetched data may be replaced (i.e. the loop is outside the localized iteration space), we suppress issuing the prefetch.