The first issue we examine is the distinction between binding and non-binding prefetches. With a binding prefetch, the data value is ``bound'' at the time the prefetch is performed by placing it in either a buffer or a register such that that is the actual value that will be seen by a subsequent access. The problem with a binding prefetch is that if another processor modifies that location during the interval between when the prefetch is issued and the data value is used, the value delivered will be stale. This places significant restrictions on when binding prefetches can be issued.
With a non-binding prefetch, however, the data value is not bound until it is referenced by a subsequent load operation. This is implemented by placing the prefetched data in the hardware-coherent cache, rather than a register or separate buffer. Therefore the prefetched data will remain visible to the coherence mechanism, which ensures that later accesses will always see the correct value (albeit at the potential expense of not finding the data in the cache if an invalidation occurs).
To illustrate the difference between binding and non-binding prefetches, consider the code fragment in Figure . Here a shared variable (x) is modified within a critical section. Since there obviously is not enough time within the critical section to hide the latency of prefetching x, we would like to move the prefetch of x outside the critical section to schedule it far enough in advance. However, with a binding prefetch, this would be illegal, since another processor may succeed in entering the critical section first, thus modifying x. If this occurred, then the value bound at the time of prefetch would be too small, since it would not reflect the more recent increment, and therefore the program would behave incorrectly. With a non-binding prefetch, however, this code sequence would be perfectly legal, since the correct value of x would always be observed inside the critical section.
Thus the key advantage of non-binding prefetching is that it frees the compiler from the burden of preserving correctness, and instead allows it to focus on the real issue, which is improving performance. Preserving correctness under binding prefetching is a difficult challenge for the compiler, since it requires a full understanding of the communication behavior and any explicit or implicit synchronization that occurs. As a result, compiler algorithms that insert binding prefetches spend most of their effort worrying about whether prefetching is legal , and are often forced to be conservative due to complications such as imperfect memory disambiguation and explicitly-parallelized code.
To make significant headway in prefetching for multiprocessors, we must move beyond the distractions of correctness and focus instead on the deeper performance issues. For example, even for the relatively simple code in Figure , it is not clear whether prefetching should be used. On the one hand, if there is significant contention for the lock (e.g., if x is the current bound in a branch-and-bound algorithm), then it is unlikely that the prefetched line will remain in the cache, since another processor is likely to enter the critical section first, thus invalidating the line when it is written. On the other hand, if there is little contention for the lock (e.g., if x is a migratory object such as a circuit element in a logic simulator), then prefetching outside the critical section may result in large performance gains.
Since latencies in multiprocessors can easily be quite large (e.g., over 100 cycles in DASH), the compiler will need all of the flexibility it can get to move prefetches far enough in advance of the references. Therefore the advantages of non-binding prefetches are essential when compiling for multiprocessors.
How does non-binding prefetching affect our compiler algorithm? The good news is that it does not affect it at all. If we wanted to, we could use exactly the same prefetching algorithm we used for the uniprocessor code, ignoring the fact that we are compiling for a multiprocessor, and the code would still get the correct answer. Therefore our uniprocessor algorithm will serve as the starting point for our multiprocessor prefetching algorithm. In the following two subsections, we discuss the two changes that we do make to the algorithm to optimize it for multiprocessor performance.