In the case of GMTRY, the locality optimizer is able to ``block'' the critical loop nest. In other words, rather than iterating over large matrices that are too large to fit in the cache, the code is restructured to iterate over smaller ``blocks'' within the matrices, such that each block does fit in the cache. As the data within in each block is reused many times before proceeding to the next block, these reuses will result in cache hits (i.e. locality) since the reused data can now be retained by the cache.
With this locality optimization alone, 90%of the original memory stall time is eliminated. Comparing blocking with prefetching, we see that blocking had better overall performance than prefetching in this case. Although prefetching reduces more of the memory stall cycles, blocking has the advantage of not suffering any of the instruction or memory overhead of prefetching. Comparing the prefetching schemes before and after blocking, we see that blocking has improved the performance of both of the prefetching schemes. One reason is that memory overheads associated with prefetching have been eliminated with blocking since less memory bandwidth is consumed. Also, the selective prefetching scheme reduces its overhead by recognizing that blocking has occurred and thereby issuing fewer prefetches.
The best performance overall occurs with the blocking optimization alone. When blocking is combined with indiscriminate prefetching, the performance gets worse due to the instruction overhead of issuing large numbers of unnecessary prefetches. However, when blocking is combined with selective prefetching, the selective algorithm is clever enough to avoid these unnecessary prefetches and therefore does not hurt performance.