In this section we evaluate the effectiveness of our core prefetching algorithm, which was described earlier in Chapter . We start with a brief, high-level evaluation of the overall performance of the compiler algorithm. We then focus in greater detail on each of the three key aspects of the compiler algorithm-locality analysis, loop splitting and software pipelining-in Sections , , and , respectively.
The results of our first set of experiments are shown in Figure and Table . Figure shows the overall performance improvement achieved through our selective prefetching algorithm. For each benchmark, the two bars correspond to the cases with no prefetching (N) and with selective prefetching (S). In each bar, the bottom section is the amount of time spent executing instructions (including instruction overhead of prefetching), and the section above that is the memory stall time. For the prefetching cases, there is also a third component-stall time due to memory overheads caused by prefetching. Specifically, the stall time corresponds to two situations: (1) when the processor attempts to issue a prefetch but the prefetch issue buffer is already full, and (2) when the processor attempts to execute a load or store when the cache tags are already busy with a prefetch fill.
As shown in Figure , the overall speedup ranges from 5%to 100%, with 6 of the 13 benchmarks improving by over 45%. The memory stall time is significantly reduced in all the cases. Table indicates that this is accomplished by reducing both the primary miss-rate and the average primary-miss penalty. The miss penalty is reduced because even if a prefetched line is replaced from the primary cache before it can be referenced, it is still likely to be present in the secondary cache. Also, the miss latency may be partially hidden if the miss occurs while the prefetch access is still in progress. Overall, 33%to 90%of the original memory stall cycles are eliminated.
Having established the benefits of prefetching, we now focus on the costs. Figure shows that the instruction overhead of prefetching causes less than a 15%increase in instruction count in over half of the benchmarks. In fact, in two of those cases (MXM and IS) the number of instructions actually decreased due to savings through loop unrolling. In other cases (CHOLSKY, BTRIX, VPENTA, TOMCATV, OCEAN), the number of instructions increased by 25%to 50%. Finally, the stalls due to prefetching memory overhead are typically small-never more than 15%of original execution time. In each case, we observe that the overheads of prefetching are low enough compared to the gains that the net improvement remains large.
These high-level results suggest that our prefetching algorithm is successful at improving performance. To evaluate the algorithm in greater depth, the following three subsections focus specifically on each key aspect of the selective prefetching algorithm. We begin with the first step in our algorithm, which is using locality analysis to determine which references should be prefetched.