As in the uniprocessor prefetching algorithm, the goal of locality analysis is to determine which references will hit in the cache so as to reduce prefetching overhead. Locality analysis is slightly different for multiprocessors because it takes explicit synchronization into account when computing the localized iteration space, as we described earlier in Section . To evaluate the effectiveness of locality analysis, we perform an experiment similar to that in Section by comparing indiscriminate prefetching (where array references are prefetched all the time) with selective prefetching (which uses locality analysis). The results of this experiment are shown in Figures and .
Figure shows that prefetching selectively rather than indiscriminately offers speedups ranging from 23%to 39%. As we saw earlier in the uniprocessor results, the reduction in memory stall time is comparable between the two schemes, but the real difference is that selective prefetching has significantly less overhead than indiscriminate prefetching. In two cases (CHOLESKY and LOCUS), prefetching selectively makes the difference between a performance loss and a performance gain.
Figure presents two useful metrics for evaluating locality analysis: the percentage of unnecessary prefetches, and the coverage factor. Comparing both of these metrics between the two schemes, we see that once again selective prefetching eliminates a significant fraction of unnecessary prefetches, while giving up very little in terms of coverage factor. The reduction in total prefetches ranges from a factor of 2.2 to a factor of 7.7.
Looking at the magnitude of unnecessary prefetches remaining after selective prefetching in Figure (a), we see that three applications are below 25%(OCEAN, MP3D, and CHOLESKY), while the other two are above 45%(LU and LOCUS). The problem in both of these latter cases is that they contain references to important data structures which tend to fit in the cache, but which are occassionally modified by other processors, thus resulting in coherence misses. While it is difficult to predict which dynamic references to these structures will suffer misses, it is even more difficult to isolate those miss instances-i.e. techniques such as loop splitting are not applicable in these cases. Since our algorithm is conservative about coherence misses, it prefetches these references all the time, thus resulting in a relatively large fraction of prefetches being unnecessary. We now describe these two cases in more detail.
In LU, the important data structure is the pivot column. A pivot column is produced by a single processor, and is then used repeatedly by the other processors to modify their local columns. This process of applying a pivot column to a local column occurs inside a procedure, which is called once for each local column. The first time a given pivot column is applied to a local column, the pivot column will miss in the cache, since it still resides in the cache of the processor that produced the pivot column. However, on subsequent calls of the same procedure with the same pivot column, the column will hit since it is retained in the cache. Since our prefetching algorithm does not look across procedure boundaries, it cannot recognize the temporal locality of the pivot column, and therefore prefetches it each time the procedure is called. Overcoming this limitation statically would require not only interprocedural analysis, but also the ability to either inline or clone  the procedure body so that prefetches can be scheduled for only the first use of a pivot column. Another possibility is to use dynamic information in the form of hardware miss counters, thus allowing the procedure to adapt dynamically to whether the pivot column is in the cache. As we will see later in Section , this latter method is effective at eliminating unnecessary prefetches in LU. However, even without any of these improvements, we observe that despite the unnecessary prefetching overhead in LU, it achieves the greatest speedup of all the applications (more than twofold) since the advantage of prefetching the pivot columns when they do miss is quite large.
In the case of LOCUS, the important data structure is the cost array, which represents the current placement of wires. Each processor's portion of the cost array tends to fit in its cache, but parts of it are occassionally invalidated as other processors rip up or put down new wires. Since these modifications occur erratically, the compiler cannot predict when they will occur and simply prefetches the cost array each time it is traversed. Therefore a significant fraction of the prefetches are unnecessary, but this is the only way to cover these coherence misses, and it does account for the bulk of the reduction in memory stalls.
LOCUS has the distinction of not only having the highest fraction of unnecessary prefetches, but also of having the lowest coverage factor (75%), as we see in Figure (b). The reason why the coverage factor is not higher (which is also true for CHOLESKY) is that many of the important array references are inside while loops rather than for loops. Since a while loop has no induction variable (either explicit or implicit), neither locality analysis nor software pipelining is applicable.
Finally, we observe that locality analysis is successful at reducing unnecessary prefetches while maintaining a respectable coverage factor despite our conservative assumption that shared data does not remain in the cache across synchronization statements. In LU and LOCUS, the remaining unnecessary prefetches were not caused by adjusting the localized iteration space due to explicit synchronization (as described in Section ). The problem in LU is procedure boundaries, and there is no synchronization in the loops for LOCUS, since the cost array is not protected by locks. In fact, when we recompiled each application and did not take explicit synchronization into account, we got precisely the same performance for each application. In other words, explicit synchronization never changed our locality predictions in significant ways. This occurred for two reasons. First, programmers realize that synchronization is costly and therefore avoid putting it in inner loops. Instead, synchronization tends to happen in outer loops and between loop nests, which makes it less likely to affect loop-level data locality. Second, in some cases the programmers decide that the performance advantage of not locking data structures is worth whatever distortion it causes in the result, as is the case in both MP3D and LOCUS. Therefore although communication takes place, there is no explicit synchronization for our algorithm to use as a hint. However, even if the compiler did realize that coherence misses were occurring in these cases, the only choice it has is to prefetch all the time, since the misses occur erratically. Therefore the desired effect was achieved in our experiments despite the lack of sophisticated analysis of communication patterns.
Overall, prefetching selectively through the use of locality analysis appears to be quite successful both for uniprocessors and multiprocessors. Once locality analysis has predicted which references to prefetch, the next step in our algorithm is scheduling those prefetches so that they are effective at hiding latency.