In this chapter, we turn our attention from uniprocessors to multiprocessors. Multiprocessor architectures are promising because they have the potential of achieving very high computation rates. However, these machines tend to suffer from memory latency even more than uniprocessors, since they have the additional complication of communicating shared data among processors. Also, the physical dimensions of large-scale multiprocessors result in large latencies to remote memory locations. Therefore latency-hiding techniques such as prefetching are essential to exploit the full potential of multiprocessors.
For this study, we focus on a class of machine with the following properties: (i) physically-distributed main memory, (ii) a single address space, and (iii) hardware-coherent caches. These properties are important for achieving high performance while providing ease of programmability for the following reasons. First, main memory is physically distributed across the machine so that each processor continues to have high bandwidth and low latency to its local memory, even as the size of the machine is scaled up. Second, a single address space simplifies the naming of shared objects, and provides an intuitive programming model for the user (the ``shared memory'' model). Finally, hardware-coherent caches help reduce memory latency by allowing shared writable data to be cached and replicated. Examples of such machines include Kendall Square Research's KSR1 , Stanford's DASH , and MIT's Alewife .
One of the keys to achieving high utilization of a multiprocessor is effectively overlapping communication and computation. Message-passing machines allow the programmer to do this by sending explicit non-blocking messages. However, under the shared-memory abstraction, communication occurs when a processor reads a memory location that has been written by another processor. Since processors typically stall on reads, this means that communication is not overlapped with computation. Prefetching, however, provides a mechanism by which shared-memory applications can overlap communication (i.e. memory accesses) with computation.
This chapter is organized as follows. We begin in Section by discussing the prefetching issues that are unique to multiprocessing, and how they affect our compiler algorithm. Next, we describe our framework for performing multiprocessor experiments in Section . Then in Section we present the results of these experiments, including a detailed evaluation of each major component of the compiler algorithm. In Section , we vary the cache size to build additional confidence in our results, and to evaluate the effectiveness of the compiler algorithm in handling different mixtures of replacement misses and coherence misses. Section compares compiler-inserted prefetching with programmer-inserted prefetching, both for cases where the compiler succeeded in inserting prefetches and for cases where it failed to do so. Finally, in Section we summarize the major results of these multiprocessor experiments.