Memory Consistency Models

Daniel Neill    and    Adam Wierman



Presentation Slides are available in [ps] [ppt]


The memory consistency model of a shared-memory multiprocessor system determines the order in which memory operations (reads and writes) appear to execute. We focus on three different models: sequential consistency (SC), processor consistency (PC), and release consistency (RC). Each model makes different guarantees as to the results of a sequence of memory operations; as long as these guarantees are upheld, the processors are free to reorder and overlap memory operations in order to improve performance.

The SC model, used in HP and MIPS systems, makes the strongest guarantees. First, memory operations appear to execute one at a time, and in some sequential order. Second, the operations of each individual processor appear to execute in program order. SC's strong guarantees make it the easiest model to program: it works correctly for both of our code examples, without the need for the programmer to include explicit synchronization instructions. However, its inability to perform aggressive reordering results in the lowest performance of the three models.

The PC model, used in Intel systems, allows reads following a write (to a different memory address) to execute out of program order. Thus writes may not be immediately visible to other processors, but when they do become visible, this occurs in program order. PC works for the first of our two code examples, but not the second; it achieves neither the programmability of SC, nor the performance of RC.

The RC model, used in SPARC, Alpha, and PowerPC systems, is the most relaxed of the three models: it allows all reads and writes (to different addresses) to be performed out of order. RC can aggressively reorder memory instructions, allowing it to achieve the highest performance of the three models. However, it requires the programmer to identify memory conflicts and make sure that they are synchronized; neither of our two code examples works correctly for RC models unless synchronization commands are added.

Thus one major area of research is reducing the performance gap between SC and RC: for straightforward implementations, RC is nearly twice as fast. However, recent hardware optimizations including coherent caching (per-processor caches with global mechanisms to ensure coherence), non-binding prefetches (speculative movement of data from memory to cache), and multithreading reduce this performance gap to under 20%. To further reduce this gap, the SC++ system (Gniady et al) allows loads and stores to speculatively bypass each other. Since the reordering is done speculatively, the system must detect when sequential consistency is violated, and "roll back" to a sequentially consistent processor state in the event that a violation occurs. Gniady et al show that these rollbacks are infrequent, and thus the system achieves performance nearly identical to RC while maintaining sequential consistency. However, SC++ requires a very large speculative state for rollbacks, and thus significantly more hardware is required for its implementation. Another boost to the performance of SC is provided by compiler optimizations, which compute ahead of time which instructions can and cannot be reordered; this reduces the need for speculative reordering. The combination of speculation and compiler optimizations allows SC to achieve nearly the same performance as RC while maintaining a simple parallel semantics.