Software pipelining [65][48] is a
technique that allows us to hide memory latency by overlapping the
prefetches for a future iteration with the computation of the current
iteration. While this transformation is straightforward, the key parameter
is how far in advance the prefetches should be scheduled. On the one hand,
prefetches must be issued early enough to hide memory latency. But on the
other hand, if prefetches are issued *too early*, the data may be
flushed from the cache before they can be used.

We choose the number of iterations to be the unit of time scheduling in our algorithm. The number of iterations to prefetch ahead is

where is the expected memory latency and is the length of the shortest
path through the loop body. In our algorithm, is a parameter given to the
compiler, which we set to be the largest expected memory latency, including
contention. The parameter is computed by the compiler for each loop nest,
using the `Shortest_Path` algorithm shown in
Figure . The interesting cases in this algorithm
are: (i) conditional statements, where we choose the shorter of the ``*then*'' and ``*else*'' paths; (ii) loops, where we use the iteration count
if it is known-otherwise, we assume at least a single iteration is executed;
and (iii) procedure calls, where we use the length of the procedure body,
unless there is recursion. We take the ceiling of the ratio to ensure that all
of the latency is hidden.

Figure shows a simple example of how
software pipelining would be used to schedule prefetches. Assuming equation
() yields that 5 iterations are sufficient to hide the
memory latency for the loop in Figure (a),
we begin in Figure (b) with a *prolog*
that issues the first several prefetches. Next, a *steady state*
loop is executed where both prefetches and computation occur. Finally,
an *epilog* loop completes the last several iterations of computation.

Since our scheduling quantum is an iteration, this scheme prefetches a data item at least one iteration before it is used. If a single iteration of the loop can fetch so much data that the prefetched data may be replaced (i.e. the loop is outside the localized iteration space), we suppress issuing the prefetch.

Sat Jun 25 15:13:04 PDT 1994