The goal of using locality analysis is to reduce overhead by prefetching only those references that cause cache misses. To evaluate whether locality analysis is successful, we performed the following experiment. We modified our compiler to prefetch all instances of affine array references, which corresponds to setting the prefetch predicate for each reference to ``True'' (see Section ). We refer to this algorithm as indiscriminate (as opposed to selective) prefetching. To isolate the impact of locality analysis, we kept all other components of the two algorithms the same. For example, the indiscriminate algorithm uses the same software pipelining technique as selective prefetching to schedule prefetches far enough in advance. However, it has no need for locality analysis or loop splitting.
Ideally the selective prefetch algorithm will achieve the same level of memory stall reduction as indiscriminate prefetching, while decreasing the overheads associated with issuing unnecessary prefetches. The results of this experiment are shown in Figure and Table .
Figure shows that the speedup offered by prefetching selectively rather than indiscriminately ranges from 1%to 100%. In 6 of the 13 cases, the speedup is greater than 20%. As we see in both Figure and Table , most of the benchmarks sacrifice very little in terms of memory stall reduction by prefetching selectively. On the other hand, Figure shows that indiscriminate prefetching suffers more from both increased instruction overhead and stress on the memory subsystem. Overall, selective prefetching is effective. In some cases (CFFT2D and MG), selectiveness even turns prefetching from a performance loss into a performance gain.
We evaluate the selective algorithm in more detail by using the following two concepts. The coverage factor is the fraction of original misses that are prefetched. A prefetch is unnecessary if the line is already in the cache or is currently being fetched. An ideal prefetching scheme would provide a coverage factor of 100%and would generate no unnecessary prefetches.
Figure shows the fraction of unnecessary prefetches and the coverage factor for the two prefetching algorithms. In general, we see the encouraging result that selective prefetching reduces the number of unnecessary prefetches without sacrificing much in terms of coverage factor. Let us consider these numbers in more detail.
The coverage factor of the indiscriminate case is interesting, since it represents the fraction of cache misses that are within the domain of our analysis in most cases. Looking at Figure (b), we notice that in 5 of the 13 cases (CFFT2D, CHOLSKY, BTRIX, GMTRY, and VPENTA) the coverage factor is well over 90%, but in 5 of the other cases (TOMCATV, OCEAN, IS, CG, and MG) it is 50%or lower. In the case of CG, the coverage is only 38%because it is a sparse matrix algorithm and we are only prefetching the dense references. The same is true for IS (integer sort), which contains many indirect array references, despite not being a sparse matrix algorithm. (We will improve the coverage for both of these cases later in Section when we also prefetch indirect references.) MXM (a blocked matrix multiplication kernel) is a surprising case, since all of the important references are obviously affine, yet the coverage factor is only 65%. This is a result of the way we account for prefetches in our experiment; we associate a prefetch only with the very next reference to the same cache line. Suppose the algorithm issues two prefetches for the same line (a likely scenario without locality analysis) followed by references to two consecutive words in that cache line; we say that the first reference is prefetched but not the second. In the case of MXM, cache conflicts between accesses to different arrays can cause the second access to the same cache line to miss. Similar behavior also occurs in TOMCATV, OCEAN, and MG. Finally, in the cases of EMIT and EP, many of the remaining cache misses occur in library routines (which are not compiled by our prefetching compiler) rather than the programs themselves. Furthermore, library routines present a difficult challenge because the surrounding contexts of the call sites (e.g., surrounding loop nests) are not known in advance. Later, in Section , we will suggest how profiling feedback or dynamically-adaptive code may help in such cases.
Figure (a) shows that a large fraction of prefetches issued under the indiscriminate scheme are unnecessary. In all but one case, this fraction ranged from 60%to 95%. These unnecessary prefetches can lead to large instruction overheads (MG) or significant delays due to a saturated memory subsystem (CFFT2D and CG).
A selective algorithm is successful if it can maintain a similar coverage while lowering the number of unnecessary prefetches. Figure (a) shows that in 11 of the 13 cases, the coverage is reduced by less than 10%. Table also supports this by showing that the miss rates have not increased substantially, and the reduction in memory stall cycles is comparable. In the cases where the coverage did go down, the problem is typically due to the presence of cache conflicts. Comparing the percentages of unnecessary prefetches in Figure (a), we see that the improvement from selective prefetching is dramatic in many cases (CFFT2D, CHOLSKY, IS, EP, MG).
The advantage of selective prefetching is summarized by the ratio of indiscriminate to selective prefetches in Table . Prefetching selectively can reduce the number of prefetches by as much as a factor of 21. At the same time, the coverage factor remains competitive. Overall, this selection process appears to be quite successful.
Once we have used locality analysis to predict which dynamic data references should be prefetched, the next step in our algorithm is to isolate those cases through loop splitting.