Although incorporating dynamic feedback information into the compilation process gives the compiler more information to reason with, it has a few shortcomings. First, the feedback process itself is a bit cumbersome, since it requires that the program be compiled twice. Also, depending on the level of detail of the information being collected, the process of executing the code to collect the information may be quite slow. Another difficulty is whether the dynamic profile that was captured is representative of all input data sets. This is a concern because optimizations tailored to one particular input data set may actually hurt performance on other data sets. In some cases it may not be possible to find a single ``representative'' data set, especially if the cache behavior depends critically upon whether the data set does or does not fit in the cache. This is particularly problematic when the problem size is determined at run-time, in which case it may be impossible to generate a single static piece of code that is appropriate for all input data sets. A similar problem arises in procedures such as library routines where it is impossible to analyze the calling arguments in all situations.
Rather than generating a single static piece of code that contains prefetching, another possibility is to generate code that adapts dynamically at run-time. This involves ``specialization'' of the prefetching code. The good news is that there are typically only a small number of different cases to specialize for-usually the data either should or should not be prefetched. Therefore when the compiler is uncertain about which case to generate, it could generate both cases, and choose the appropriate one to execute at run-time. The decision of which version to execute might be based on several different things: (1) the problem size (e.g., loop bounds defined at run-time), (2) data alignment in order to detect pathological cache conflict cases, or (3) hardware miss counters.
For example, Figure shows a case where checking the problem size is useful. The temporal reuse of the A[j] reference in Figure (a) may or may not result in temporal locality, depending on how large n is relative to the cache size. Figure (b) illustrates how the value of n can be checked at run-time to select the appropriate strategy for prefetching A[j].
Another problem that can potentially be detected dynamically is pathological cache interference between array references. When this occurs, the associated prefetches may also be ineffective, so it may be best to avoid the instruction overhead of issuing these useless prefetches. This situation can occur when two array references have similar affine access functions, such as references A[i] and B[i] in Figure (a), and when the difference in base addresses is a multiple of the cache size. Figure (b) illustrates code that adapts to this situation at run-time.
Finally, Figure illustrates how hardware miss counters can be used to dynamically detect whether data is already in the cache. These miss counters keep a running total of primary cache misses. It may be useful to have four miss counters, with each one devoted to counting one of the following types of misses: (i) loads, (ii) stores, (iii) prefetches, and (iv) exclusive-mode prefetches. Keeping separate counts is useful because it provides more information to the software. Also, there is a significant difference between seeing loads and stores hit in the cache (which is a good thing), and seeing prefetches hit in the cache (which is a bad thing). Miss counters should not be too expensive to implement since they only consume a small amount of chip area and should not affect cycle time.
As we see in Figure , the run-time overhead of using miss counters can be minimized as follows. First reset the counter before executing a loop. Then after executing some number of loop iterations, stop and check the counter to see if the data has been in the cache so far. If so, do not prefetch the data for the remaining loop iterations. Otherwise, continue prefetching. Assuming there are a reasonable number of loop iterations (the software can also test for this), the relative overhead of this check should be quite small. One way to detect misses is to start off the first several iterations without prefetching, and then check to see how many load or store misses occur. The drawback of this approach is that it does not eliminate the miss latency of those first several iterations. Perhaps a better approach, as illustrated in Figure (b), is to start off issuing prefetches for the first several iterations, and then test to see whether the prefetches have been hitting in the cache. If they have, assume the data is already in the cache and do not issue further prefetches.
While Figure shows how miss counters can be used within a single-level loop nest, their run-time overhead can be reduced even further by using them across outer loop iterations in multi-level loop nests. For example, Figure demonstrates how miss counters can be used to adapt to temporal locality along an outer loop.
Having described how hardware miss counters can be used, we will now show experimental results for two different benchmarks: BCOPY, and LU. We start with BCOPY, which is a ``block copy'' library routine for copying a block of data from one location to another. Although BCOPY is a very simple routine, it is interesting for two reasons. First, since it is a library routine, the compiler cannot make any assumptions about the input parameters and cannot analyze the call sites for locality. Second, since BCOPY is frequently called by the operating system to move data around, improving BCOPY may have a significant impact on system performance. We rewrote BCOPY by hand to take advantage of hardware miss counters, similar to the code in Figure . We added hardware miss counters to our simulator, and made them visible to the software through special function calls. We used a simple workload to drive BCOPY, since we were mainly interested in testing the ends of spectrum. The workload consisted of a loop which repeatedly called BCOPY with two distinct arrays and a given block size as arguments. Since identical block copies are performed on subsequent iterations, there can potentially be a temporal locality benefit if the block can remain in the cache. Figure shows the performance of BCOPY using various block sizes and loop iteration counts.
As we see in the ``500x10'' case (where BCOPY is called ten times with 500 byte blocks) in Figure , the original code without prefetching suffers a significant amount of miss latency. These misses occur the first time the routine is called, since the data remains in the cache for subsequent calls. As we see from the middle bar, the code which statically prefetches the blocks all the time actually performs worse than the original case, due to the large instruction overhead of the many unnecessary prefetches. The dynamically adaptive code (shown as the righthand bar) performs the best, since it recognizes that the data should be prefetched the first time, but not on subsequent calls. The other two cases in Figure show larger block sizes that do not fit in the cache. In these cases it is best to prefetch all the time, so stopping to check the miss counters is pure overhead. However, we see that this overhead is small enough that it has no noticeable impact on performance. Therefore the adaptive code provides the benefit of eliminating unnecessary prefetch overhead without introducing any significant cost.
The other application we modified to exploit hardware miss counters was LU, which performs parallel LU-decomposition on dense matrices. In LU, the same procedure is called repeatedly to apply a pivot column to other columns. If the columns are large relative to the cache size, then it is best to prefetch the pivot column each time it is referenced. On the other hand, if a column can fit in the cache, then the pivot column will only suffer misses the first time it is referenced. However, since this code is in a separate procedure, and since our compiler does not perform procedure cloning , the only static option is to prefetch the column all the time. Once again, we modified this procedure by hand, similar to BCOPY, to use hardware miss counters. The results of our experiments with LU are shown in Figure .
We ran LU with both large and small problem-size to cache-size ratios. The lefthand side of Figure shows the small ratio, meaning that columns tend to fit in the cache. In this case, the static approach of prefetching the columns all the time suffers from a significant amount of instruction overhead. In contrast, the adaptive code eliminates much of this unnecessary overhead while still hiding much of the memory latency, thereby resulting in the best overall performance. Looking at the righthand side of Figure , where the data size is large relative to the cache, we see the interesting result that the adaptive code still offers the best performance. This is because LU performs a triangular matrix solve, and therefore the last several columns always tend to fit in the cache. The savings of not issuing unnecessary prefetches for these last several columns more than offsets the instruction overhead of the dynamic test in this case. Therefore the adaptive code is clearly the best choice for LU.