In this section, we evaluate the benefits of exclusive-mode prefetching, which helps to reduce both the miss latencies and the message traffic associated with writes. Unlike read misses, which directly stall the processor for their entire duration, write misses affect performance more indirectly, since writes can be buffered. A processor stalls while waiting for writes to complete in two situations: (i) when executing a write instruction if the write buffer is full, and (ii) during a read miss if previous writes must complete before the read miss can proceed. The impact of the former effect can be reduced through larger write buffers. Throughout this study, we use 16-entry write buffers, which we have found to be large enough to make the full-buffer effect negligible. The impact of the latter effect depends on whether reads are permitted to bypass writes (as allowed by the release consistency model), and whether the cache permits multiple outstanding accesses (as allowed by a lockup-free cache).
As we described earlier in Section , our compiler uses an exclusive-mode (rather than a shared-mode) prefetch whenever any member of an equivalence class (i.e. a set of references that share group locality) is a write. This catches the important read-modify-write cases, and potentially eliminates as much as half of the message traffic. Table shows the fraction of prefetches that were exclusive-mode for each of the applications. To evaluate the case where exclusive-mode prefetches are not available, we replace each exclusive-mode prefetch with a normal ``shared-mode'' prefetch of the same address. Since the multiprocessor architecture we have used so far includes both release consistency (which allows writes to be buffered and reads to bypass pending writes) and lockup-free caches, write latency has no direct impact on performance. Consequently exclusive-mode prefetching has a negligible performance impact on this architecture. It does, however, reduce the amount of message traffic, as shown in Table . If the architecture was bandwidth-limited (which in our case it is not), then this reduction in message traffic could have a direct payoff in improved performance.
To evaluate the benefit of exclusive-mode prefetching in an architecture where write latency is not already completely hidden, we performed the same experiment on an architecture that uses sequential consistency rather than release consistency. With this stricter consistency model, the processor must stall after every shared access until that access completes. (We will discuss consistency models in greater detail later in Section .) The results of this experiment are shown in Figure . Notice that the memory stall time in Figure has been broken down further into write stall time and read stall time (under the release consistency model assumed so far in this chapter, nearly all of the memory stall time is read stall time).
Figure shows that exclusive-mode prefetching can result in dramatic performance improvements in an architecture using sequential consistency: OCEAN and MP3D achieved speedups of 73%and 37%, respectively. The speedups for CHOLESKY and LOCUS were understandably smaller (10%and 3%) since they make less use of exclusive-mode prefetching, as shown in Table . In the case of LU, the write latency is small to begin with since the processors only write to their local columns, which tend to fit in the secondary caches.
In summary, exclusive-mode prefetching can provide significant performance benefits in architectures that have not already eliminated write stall times through aggressive implementations of weaker consistency models with lockup-free caches. Even if write stall times cannot be further reduced, exclusive-mode prefetching can improve performance somewhat by reducing the traffic associated with cache coherency.