For our experiments so far, the effective cache size has been set to 500 bytes, which is only a small fraction of the actual cache size (8 Kbytes). Recall that the reason for this is to approximate the effects of cache conflicts in a direct-mapped cache. When the effective cache size is set to the full 8 Kbytes, our compiler generates identical code for 7 of the 13 benchmarks. For 5 of the 6 benchmarks that do change, the difference in performance is negligible. The one case that changes significantly is CFFT2D, as shown in Figure (b). In this case, fewer prefetches are issued with the larger effective cache size. However, the prefetches that are eliminated happen to be useful, since they fetch data that is replaced due to cache conflicts. As a result, the performance suffers, as we see Figure (b). (Note that this is in contrast with the effect we see in Figure (a), where issuing more prefetches hurts performance.) In the case of CFFT2D, many critical loops reference 2 Kbytes of data, and these loops happen to suffer from cache conflicts. An effective cache size of 500 bytes produces the desired result in this case.
In general, we observe that the volume of data accessed by a single iteration of a loop tends to fall into one of three categories: (i) a small constant amount (often less than 256 bytes), which is particularly common in inner loops; (ii) a very large constant amount (much larger than 8 Kbytes), which is common when the loop contains an inner loop with constant loop bounds; or (iii) an unknown amount. In all three cases, variations within the reasonable range of effective cache sizes have no effect, since in the first case the loop is definitely localized, in the second case the loop is definitely not localized, and in the third case only the policy on unknown loop bounds matters. Overall, the results appear to be robust with respect to effective cache size.