The goal of software pipelining is to issue prefetches the proper amount of time in advance, such that the data will be found in the cache when it is actually needed. The number of iterations to prefetch ahead must be carefully chosen (see equation ()), since too few iterations will not provide enough time to hide the latency, but too many iterations may cause the data item to be replaced from the cache before it can be referenced.
To evaluate the effectiveness of our software pipelining algorithm, Figure shows a breakdown of the impact of prefetching on the original primary cache misses. This breakdown contains three categories: (i) those that are prefetched and subsequently hit in the primary cache (pf-hit), (ii) those that are prefetched but remain primary misses (pf-miss), and (iii) those that are not prefetched (nopf-miss). The effectiveness of the software pipelining algorithm is reflected by the size of the pf-miss category. A large value means that the prefetches are either not issued early enough, in which case the line does not return to the primary cache by the time it is referenced, or are issued too early, in which case the line has already been replaced in the cache before it is referenced.
The results in Figure indicate that the scheduling algorithm is generally effective. The exceptions are CHOLSKY and TOMCATV, where over a third of the prefetched references are not found in the cache. The problem in these cases is that cache conflicts remove prefetched data from the primary cache before they can be referenced. To adjust for this, one might consider decreasing the prefetch latency compile-time parameter (i.e. parameter in equation ()), which was set to 300 cycles for these experiments. We will evaluate this possibility later in Section . However, we observe that when cache conflicts are the problem, they often occur frequently enough that they cannot be avoided by simply adjusting the software pipelining parameters. Later, in Section , we examine these cases in more detail and evaluate whether increasing the cache associativity can help.
Even in cases where prefetched data is replaced from the primary cache before it can be referenced, there is still a performance advantage since the data tends to remain in the secondary cache. Therefore although the miss latency is not eliminated, it is often reduced from a main memory access to a secondary cache access. This was shown earlier in Table , where selective prefetching reduces the average miss penalty from 24.8 to 12.3 cycles for CHOLSKY, and from 36.6 to 12.5 cycles for TOMCATV.