In this section, we study the effects of varying the cache size. Our
motivation is twofold. First, we would like to build additional confidence
in our results, given our cache scaling methodology. Recall from
Section that in order to achieve realistic
problem-size to cache-size ratios, the results presented so far in this
chapter are based on an 8K/64K cache hierarchy, which is scaled down from
the full 64K/256K caches in DASH. By varying the cache size, we can explore
the effects of cache scaling on performance. Second, by running the same
problem on varying cache sizes, we effectively change the mixture of misses
due to replacements (because of the finite cache size) versus misses
due to coherence (because of the communication of shared data). This
allows us to evaluate how well our compiler algorithm prefetches both types
of misses.
The results of our experiments are presented in
Figure , where the performance of each
application is shown on three different sets of cache sizes. The middle
pair of bars are for the scaled cache sizes of 8K/64K that we used
throughout this chapter. The rightmost pair of bars are for the full-sized
64K/256K caches in DASH. The leftmost pair of bars are for a smaller cache
hierarchy of 2K/8K. We begin by focusing on how the code without
prefetching performs across the different cache sizes, and then later
compare this to the behavior of the code with prefetching.
If we focus our attention first on the code without prefetching, we
see a variety of different behaviors in
Figure . At one extreme is MP3D, where varying
the cache size has virtually no impact on performance. This is because MP3D
sweeps through a very large data set on each time step, and even the
64K/256K caches are not large enough to retain the data (each processor's
particles are 200 Kbytes, and the space cell array is 288 Kbytes). This
region, where the the data set is much larger than the cache size, is
expected to be the case in real-life runs of MP3D, so our results are
indicative of the performance on real machines. At the other extreme are
applications like LU, CHOLESKY, and LOCUS, where there is a noticeable knee
in performance once a key data structure fits in the cache. For LU, this
knee occurs between the 8K/64K and 64K/256K cache sizes, and corresponds to
when the columns owned by a processor (roughly 20 Kbytes) fit in the
primary cache. For both CHOLESKY and LOCUS, the knee occurs between the
2K/8K and 8K/64K cache sizes. In the case of CHOLESKY, this corresponds to
an entire panel fitting in the primary cache. For LOCUS, it corresponds to
the portion of the cost array where a wire is being routed fitting in the
cache. Therefore we see dramatic performance swings (more than 80%improvement) as the knee is crossed in three of the five cases without
prefetching.
In contrast, the performance of the code with prefetching varies
more gradually over the range of cache sizes. For example, while the
performance of LU without prefetching improves by 82%when the cache sizes
increase from 8K/64K to 64K/256K, the performance with prefetching changes
by only 7%. Similarly, the performance improvement as the cache sizes are
increased from 2K/8K to 8K/64K is 156%without prefetching vs. 14%with
prefetching for CHOLESKY, and 85%without prefetching vs. 37%with
prefetching for LOCUS. Why is the prefetching performance less sensitive to
cache size? As the cache size is reduced, we see more replacement misses in
the code without prefetching. However, since our algorithm is effective at
prefetching replacement misses (as demonstrated already in
Chapter ), the resulting latency is therefore
hidden, thus making the code with prefetching less sensitive to cache
size. In fact, in three of the five cases (OCEAN, LU, and MP3D), the
performance of code with prefetching on 2K/8K caches was better than code
without prefetching on 64K/256K caches.
We now consider how cache size variations affect the relative performance
improvement offered by prefetching. In general,
Figure shows that prefetching improves
performance more with smaller cache sizes. The obvious reason for this is
that smaller caches have more latency to hide due to replacement misses,
and thus there is more room for improvement. For example, prefetching
offers only modest speedups for CHOLESKY and LOCUS on 8K/64K caches, but
provides dramatic speedups on 2K/8K caches (139%and 48%, respectively).
With these smaller caches, the prefetches for the panels in CHOLESKY and
the cost array in LOCUS are almost always useful; with larger caches, these
structures only miss during communication, and therefore the prefetches are
not as beneficial. As evidence of this reduced benefit, the instruction
overhead offsets much of the reduction in memory stalls for both CHOLESKY
and LOCUS with 64K/256K caches.
Can these instruction overheads with large caches be reduced by prefetching only the coherence misses? For CHOLESKY, the answer is yes, but only if special techniques for handling procedures are used. The difficulty in CHOLESKY is that each access to a panel occurs as a separate call to the same procedure. To isolate only the coherence misses, we would like to prefetch a panel only the first time it is accessed by a given processor. This cannot occur normally, but would be possible with either inlining or procedure cloning [15]. (Note that a very similar situation occurs in LU.) For LOCUS, however, the answer is no, coherence misses cannot be isolated. The reason why is that coherence misses in LOCUS only occur when another processor routes a wire in the same portion of the cost array as the given processor. Since these misses are unpredictable and occur sporadically, they cannot be isolated, and therefore we must conservatively prefetch the cost array all the time.
In summary, we have seen that the performance advantages of prefetching
often remain significant even as the cache size is varied. This robustness
occurs despite the fact that we conservatively assume that data does not
remain in the cache across synchronization statements (see
Section ). Since our algorithm effectively
handles replacement misses, the performance with prefetching is relatively
insensitive to the cache size, unlike code without prefetching, where
performance can suffer greatly once key data structures no longer fit in
the cache. Once the caches are large enough that little latency remains, it
becomes increasingly important to minimize instruction overhead by
prefetching only the coherence misses. However, in some cases it is
impossible to isolate only the coherence misses, since they cannot be
predicted, and therefore the compiler is forced to be conservative. In such
cases, profiling feedback information may potentially be useful, as we will
discuss later in Section
.