Ideally, we would like to incorporate the notion of coherence misses into our framework of locality analysis, so that a single model could predict all forms of misses. How can this be accomplished? As you may recall from Section , data locality occurs whenever a cache line is reused, provided that the line has not been ejected from the cache since it was last used. In a uniprocessor architecture, a cache line will only be ejected from the cache if it is replaced by another reference that maps into the same cache entry. Our algorithm predicts that such replacements will occur whenever the volume of data accessed between reuses exceeds the effective size of the cache. In a multiprocessor architecture, a cache line can also be ejected from the cache if the line is modified by another processor during the interval between reuses, hence invalidating it from the given processor's cache (we assume an invalidation-based coherence protocol). Under locality analysis, the ejection of data from the cache is modeled through the localized iteration space, which is the set of loops that can carry data locality. Therefore, to predict coherence misses, we extend the concept of the localized iteration space to include not only the volume of data accessed, but also whether lines are being modified by other processors.
Accurately predicting when another processor modifies a given data line is a very difficult problem for the compiler, and requires a complete understanding of the communication and synchronization patterns of an application, as well the data addresses being accessed. In cases where the application has been parallelized by the compiler, this may be feasible since presumably the compiler must understand the communication patterns very well to perform the parallelization. However, even in such cases, factors such as dynamic scheduling may make it difficult to predict exactly when the modifications occur relative to other processors, and which lines are being modified. In addition, while it may be tractable to understand when particular data items are shared among processors (i.e. true sharing), it is more difficult to predict the coherence misses that only occur because separate items fall within the same cache line (i.e. false sharing) [81][19]. Finally, when the compiler is dealing with explicitly-parallelized programs, as is the case in our experiments, it is difficult (if not impossible) for the compiler to extract the communication patterns, since much of this semantic information is contained only in the programmer's head.
Although the compiler cannot precisely understand the communication patterns in the explicitly-parallelized codes used in our experiments, one thing that serves as a useful hint is the explicit synchronization statements inserted by the programmer to ensure that shared data are accessed safely. Assuming that a program is ``properly labeled'' [27] or ``data-race-free'' [2], synchronization statements should exist between the time when one processor modifies data and a second processor reads that data. Therefore, the compiler interprets explicit synchronization as a hint that data communication may be taking place. Ideally, it would be nice to know the data being protected by any given synchronization variable, since such information would allow the compiler to reason more precisely about which particular variables may have been modified. However, since such semantic information is kept only in the programmer's head, our approach is to conservatively assume the worst, which is that all shared objects may have been modified, and therefore no locality is carried across explicit synchronization. We incorporate this notion into our localized iteration space model by saying that a loop is not localized if it either (i) accesses too much data (to model replacement misses, as described earlier in Section ), or (ii) contains explicit synchronization (to model coherence misses).