Since software-controlled prefetching has a cost as well as a benefit, care must be taken when inserting prefetches that the cost does not offset much of the latency-hiding benefit. The first step toward minimizing cost is prefetching selectively to avoid the pure overhead of unnecessary prefetches. Our results in Sections and demonstrate that selective prefetching can reduce much of the prefetching overhead, and we discussed ways to improve this analysis further in Section . While the remaining overhead after selective prefetching is typically quite small in comparison with the reduction in memory stall time, there are still a few cases where additional speedups of at least 10%could be achieved if it was possible to eliminate the remaining instruction overhead. In this section we will address the second step toward reducing prefetching cost, which is minimizing the instruction overhead of the prefetches that are issued.
Before we begin this discussion, let us consider how future trends are likely to affect the relative importance of prefetching instruction overhead. The first relevant trend is that the gap between processor and memory speeds will continue to grow. As this occurs, the cost of even the current level of instruction overhead will diminish relative to the latency-hiding benefit of each useful prefetch. The second important trend is continued improvements in the ability of processors to exploit instruction-level parallelism through techniques such as superscalar processing . Since prefetch instructions can always be executed in parallel with other operations (because no other operations depend upon their completion), they should benefit well from the exploitation of instruction-level parallelism. Therefore the absolute overhead of processing prefetch instructions is likely to decrease. The combined effect of both of these trends is that prefetch instruction overhead should become less significant in the future.
Given these trends, why do we care about prefetch instruction overhead at all? The first reason is that although prefetch instructions can theoretically be executed in parallel with other operations, this will only result in no overhead if there are resources available for executing the prefetches that are normally idle. However, the functional units needed to compute prefetch addresses and issue prefetches will also be busy handling normal loads and stores. Due to competition for these critical resources, it is unlikely that prefetch instruction overhead will be completely hidden. The second reason is that prefetch instruction overhead is an inherent problem in applications where there are few instructions between cache misses. For these applications, the difference of only a single instruction per prefetch can result in a large fractional increase in total instructions. For example, consider CHOLSKY in Figure , where selective prefetching increases the instruction count by roughly 50%. In this case the analysis is nearly perfect-only 9%of prefetches are unnecessary, and the miss coverage is 97%. The large instruction overhead is because cache misses occur rather frequently (once every 11 instructions), and issuing each prefetch requires several instructions (5, on average). Eliminating only a single instruction per prefetch would decrease the instruction count by roughly 10%in this case.
Therefore, since the instruction overhead of useful prefetches may be a concern in some cases but is probably not a major hindrance in general, we will discuss techniques for reducing this overhead only briefly in this section.