Dynamic information can be used at compile-time through feedback from earlier runs. The two types of feedback that may be useful for prefetching are control-flow feedback and memory behavior feedback, which we will briefly discuss.
Control-flow feedback (also known as ``branch profiling'') records how frequently each control path in the program is executed, thereby measuring branch outcome frequencies. Such information is often used in aggressive instruction-scheduling algorithms where it is important to schedule code beyond branches . Control-flow feedback would be useful in our prefetching compiler algorithm (described in Chapter ) when computing the localized iteration space for loop nests that include symbolic loop bounds. In these cases, the number of loop iterations is not a compile-time constant, and therefore it is difficult to predict whether the volume of data referenced by a loop exceeds the effective cache size. Using control-flow feedback, the compiler can take into account the average number of loop iterations of previous runs when making this decision. Control-flow feedback might also be useful for loops that contain conditional statements. For example, the number of iterations to software-pipeline prefetches ahead depends upon the expected number of instructions in a single loop iteration (see Section ). If the loop contains a conditional statement, we currently schedule for the worst case (i.e. the shortest path). However, if the branch outcome almost always favors a much longer path, the compiler algorithm may want to schedule for this longer path instead. Control-flow information is relatively inexpensive to collect, since it simply involves instrumenting the code to count basic blocks.
Feedback on memory behavior helps identify which references in the code are suffering the most from memory latency. This information can be collected at various levels of granularity. For example, the Mtool utility  collects information at the granularity of a single loop nesting. This is useful for identifying which loop nests suffer the most from latency, and which loop nests are insignificant. At the more fine-grained end of the spectrum, the feedback information might consist of precise miss rates for each reference in the program. One of the challenges of collecting any type of memory behavior information is that the behavior of the memory subsystem is usually hidden from the software. Mtool collects its information through either fine-grain timers or by sampling the program counter. Because of the potential distortion introduced by the large overheads of these techniques, it is only possible to collect information at the level of a loop nest. In order to collect individual miss rates in a program, it is necessary to either simulate the memory system (which is rather costly), or else make use of user-visible hardware miss counters. However, few commercial microprocessors currently provide support for such miss counters, and therefore collecting individual miss rates remains a non-trivial task.
Memory behavior feedback may be useful in a number of different situations. First, it may be useful when the localized iteration space (LIS) concept would work, but the LIS has been chosen improperly-perhaps because the loop bounds are unknown, or because the effective cache size is incorrect. In this case, the LIS can be adjusted to better match the observed behavior. Second, memory behavior feedback can help detect cases where cache conflicts result in more misses than locality analysis predicts. Similarly, in a multiprocessor environment, it can help detect cases where coherence activity causes additional misses beyond those predicted by locality analysis. Finally, it will be useful for determining whether to prefetch references outside the scope of locality analysis, such as indirect references.
To experiment with feedback, we used our simulator to collect both control-flow and fine-grain memory behavior feedback information. This information was incorporated into the compiler algorithm as follows. The control-flow information was used to compute the average number of iterations for each loop (also known as the ``trip count''). As we described earlier in this subsection, this count was used by the compiler to compute the localized iteration space whenever the loop bounds were unknown at compile-time.
The memory feedback information was used to annotate each load and store with both the number of times the reference was executed and the number of times it missed in the primary data cache-hence giving the precise miss rate. After performing locality analysis, we compare the predicted miss rate with the observed miss rate. If they agree within a certain acceptable margin, we schedule the code as usual. If they disagree, then we first try to find an explanation for the miss rate that is consistent with the intrinsic data reuse. This is important because to schedule the prefetches properly, we need to know when the misses occur.
For example, consider the two loop nests in Figure , and assume that from memory feedback we know the miss rate for loading A[j][k] to be 25%in both cases. Since the intrinsic data reuse for A[j][k] is quite different between the two cases, the interpretation of the miss rate also differs. For the code in Figure (a), where A[j][[k] has temporal reuse along the i loop (k is loop-invariant), a 25%miss rate corresponds to each load of A[j][k] missing on the first iteration of i and hitting on the remaining three iterations. Therefore the compiler would peel the i loop to schedule prefetches as normal for temporal locality. In contrast, the A[j][k] reference in Figure (b) has spatial rather than temporal reuse, and therefore we would not expect the same dynamic miss behavior. Assuming there are four A[j][k] elements per cache line, the 25%miss rate would correspond to missing on every fourth iteration of the inner loop (k) as cache line boundaries are crossed. So rather than peeling the outer loop, in this case the compiler would unroll the inner loop by a factor of four to schedule the prefetches.
This example in Figure illustrates that the miss rate alone does not necessarily provide enough information to schedule prefetches properly, since fractional miss rates (i.e. between 0%and 100%) are ambiguous in terms of which dynamic references are hits and which are misses. This ambiguity arises because the information relating individual misses to when they occur is lost in the course of summarizing them as a single miss rate. Even after examining the data reuse within a loop nest, the dynamic misses may still be unclear. For example, the 25%miss rate for loading A[j][k] in Figure (a) may correspond to at least two different miss patterns. First, if the A[j][k] references are not already in the cache when the loop is executed, then all of the misses would occur during the first iteration of the i loop, as we described earlier. However, if the A[j][k] references were already in the cache (perhaps because they were referenced in an earlier loop nest), then the misses may have occurred sporadically across all of the i loop iterations, perhaps due to occasional cache conflicts with other references. Prefetches scheduled for the former case would not cover the misses of this latter case. Therefore the compiler is faced with choosing the most likely explanation for the observed miss rates.
The approach we take is to first look for explanations consistent with locality analysis by adjusting the LIS and repartitioning equivalence classes as necessary. For example, if the actual miss rate is lower than expected, the compiler checks whether increasing the size of the LIS (thereby increasing the expected locality) would explain the miss rate. Similarly, it considers decreasing the LIS to explain higher-than-expected miss rates. This is particularly useful when the size of the LIS is unclear from static analysis alone. For example, given an 8 Kbyte primary data cache with 16 byte lines, it would be difficult to predict whether the LIS in Figure (a) included both surrounding loops since the amount of data referenced in a single iteration of the outer loop is very close to the cache capacity. With only static information, the compiler might normally predict that only the inner loop was within the LIS, and hence the expected miss rate of loading A[j][k] would be 100%. However, given the observed miss rate of 25%from feedback, the compiler would increase the LIS to include both loops. Therefore the A[j][k] reference would be predicted to have temporal locality, and its predicted miss rate would match the observed rate of 25%.
The compiler adjusts the partitioning of equivalence classes (sets of references predicted to have group locality) based on feedback as follows. First the compiler checks that the leading reference (i.e. the reference predicted to actually suffer the misses) of each equivalence class is the only one with a significant miss rate. If multiple references in the same equivalence class have significant miss rates, then the equivalence class is divided accordingly. If none of the references in the equivalence class has a significant miss rate (including the leading reference), then that equivalence class is marked as one that should not be prefetched. To determine which references should not be prefetched, the compiler ranks all references in descending order based on their contribution to total misses. A reference is not prefetched if it has a small miss rate and is ranked lower than the references accounting for the first 95%of the misses. This is a better approach than simply using a miss rate threshold, since it takes the frequency of reference into account. In other words, if a reference has a low miss rate but is executed so often that it accounts for most of the total misses, we still want to prefetch it. On the other hand, a reference with a higher miss rate that is rarely executed and therefore makes a negligible contribution to total misses can safely be ignored.
To evaluate the benefit of using feedback, we modified our compiler to automatically make use of control-flow and fine-grain memory feedback information. We simulated each application once without prefetching, using our simulator to automatically collect the feedback information. We then compiled each application a second time, this time using the feedback information. Figure shows the performance of the one case that improved significantly using feedback: OCEAN.
The difficulty for static analysis in OCEAN is that the critical two-level loop nesting is split across separate files-the outer loop is in one file, and the inner loop is inside a procedure call in another file. This loop performs a nearest-neighbor SOR computation, so there is a significant amount of group locality among the references. However, since our compiler does not perform interprocedural analysis across separate files, the prefetching algorithm does not recognize the group locality due to the outer loop, and therefore issues too many prefetches. Once feedback information is available, the compiler immediately recognizes the group locality, thus avoiding the unnecessary prefetches. Interestingly enough, eliminating prefetches actually reduces the memory stall time in this case by eliminating register spilling, since the spilled references were often conflicting with other data references. OCEAN illustrates how feedback may help the compiler overcome the fact that some types of analysis may be too expensive to implement in the compiler.
The main reason why feedback did not improve the other benchmarks is that static analysis alone was already quite successful for these types of codes. Profiling feedback is likely to be more useful in codes that are more difficult to analyze, such as ones containing pointers and recursive data structures. For the scientific codes we studied, however, the benefits of feedback do not appear to be large.