Prefetches must be issued early enough to hide memory latency. They
must not be issued too early lest the data fetched be flushed out of
the cache before they are used. We choose the number of iterations to
be the unit of time scheduling in our algorithm. The number of
iterations to prefetch ahead is
where
is the prefetch latency and
is the length of the
shortest path through the loop body.
In our example in Figure 2(a), the latency is 100 cycles, the shortest path through the loop body is 36 instructions long, therefore, the j loops are software-pipelined three iterations ahead. Once the iteration count is determined, the code transformation is mechanical.
Since our scheduling quantum is an iteration, this scheme prefetches a data item at least one iteration before it is used. If a single iteration of the loop can fetch so much data that the prefetched data may be replaced, we suppress issuing the prefetch.