The goal of using locality analysis is to reduce overhead by
prefetching only those references that cause cache misses. To evaluate
whether locality analysis is successful, we performed the following
experiment. We modified our compiler to prefetch all instances of
affine array references, which corresponds to setting the prefetch
predicate for each reference to ``True'' (see
Section ). We refer to this algorithm as indiscriminate (as opposed to selective) prefetching. To isolate the
impact of locality analysis, we kept all other components of the two algorithms
the same. For example, the indiscriminate algorithm uses the same software
pipelining technique as selective prefetching to schedule prefetches far
enough in advance. However, it has no need for locality analysis or loop
splitting.
Ideally the selective prefetch algorithm will achieve the same level of
memory stall reduction as indiscriminate prefetching, while decreasing the
overheads associated with issuing unnecessary prefetches. The results of
this experiment are shown in Figure and Table
.
Figure shows that the speedup offered by
prefetching selectively rather than indiscriminately ranges from 1%to
100%. In 6 of the 13 cases, the speedup is greater than 20%. As we see in
both Figure
and Table
,
most of the benchmarks sacrifice very little in terms of memory stall
reduction by prefetching selectively. On the other hand, Figure
shows that indiscriminate prefetching suffers more
from both increased instruction overhead and stress on the memory
subsystem. Overall, selective prefetching is effective. In some cases
(CFFT2D and MG), selectiveness even turns prefetching from a performance
loss into a performance gain.
We evaluate the selective algorithm in more detail by using the following two concepts. The coverage factor is the fraction of original misses that are prefetched. A prefetch is unnecessary if the line is already in the cache or is currently being fetched. An ideal prefetching scheme would provide a coverage factor of 100%and would generate no unnecessary prefetches.
Figure shows the fraction of unnecessary
prefetches and the coverage factor for the two prefetching algorithms. In
general, we see the encouraging result that selective prefetching reduces
the number of unnecessary prefetches without sacrificing much in terms of
coverage factor. Let us consider these numbers in more detail.
The coverage factor of the indiscriminate case is interesting, since it
represents the fraction of cache misses that are within the domain of our
analysis in most cases. Looking at Figure (b), we
notice that in 5 of the 13 cases (CFFT2D, CHOLSKY, BTRIX, GMTRY, and
VPENTA) the coverage factor is well over 90%, but in 5 of the other cases
(TOMCATV, OCEAN, IS, CG, and MG) it is 50%or lower. In the case of CG,
the coverage is only 38%because it is a sparse matrix algorithm and we
are only prefetching the dense references. The same is true for IS (integer
sort), which contains many indirect array references, despite not being a
sparse matrix algorithm. (We will improve the coverage for both of these
cases later in Section
when we also prefetch indirect
references.) MXM (a blocked matrix multiplication kernel) is a surprising
case, since all of the important references are obviously affine, yet the
coverage factor is only 65%. This is a result of the way we account for
prefetches in our experiment; we associate a prefetch only with the very
next reference to the same cache line. Suppose the algorithm issues two
prefetches for the same line (a likely scenario without locality analysis)
followed by references to two consecutive words in that cache line; we say
that the first reference is prefetched but not the second. In the case of
MXM, cache conflicts between accesses to different arrays can cause the
second access to the same cache line to miss. Similar behavior also occurs
in TOMCATV, OCEAN, and MG. Finally, in the cases of EMIT and EP, many of
the remaining cache misses occur in library routines (which are not
compiled by our prefetching compiler) rather than the programs themselves.
Furthermore, library routines present a difficult challenge because the
surrounding contexts of the call sites (e.g., surrounding loop nests) are
not known in advance. Later, in Section
, we
will suggest how profiling feedback or dynamically-adaptive code may help
in such cases.
Figure (a) shows that a large fraction of
prefetches issued under the indiscriminate scheme are unnecessary. In
all but one case, this fraction ranged from 60%to 95%. These
unnecessary prefetches can lead to large instruction overheads (MG) or
significant delays due to a saturated memory subsystem (CFFT2D and
CG).
A selective algorithm is successful if it can maintain a similar coverage
while lowering the number of unnecessary prefetches.
Figure (a) shows that in 11 of the 13 cases, the
coverage is reduced by less than 10%. Table
also supports this by showing that the miss rates have not increased
substantially, and the reduction in memory stall cycles is comparable. In
the cases where the coverage did go down, the problem is typically due to
the presence of cache conflicts. Comparing the percentages of unnecessary
prefetches in Figure
(a), we see that the
improvement from selective prefetching is dramatic in many cases (CFFT2D,
CHOLSKY, IS, EP, MG).
The advantage of selective prefetching is summarized by the ratio of
indiscriminate to selective prefetches in Table .
Prefetching selectively can reduce the number of prefetches by as much
as a factor of 21. At the same time, the coverage factor remains
competitive. Overall, this selection process appears to be quite
successful.
Once we have used locality analysis to predict which dynamic data references should be prefetched, the next step in our algorithm is to isolate those cases through loop splitting.