Underlying any realization of a parallel random-access machine (PRAM) is a communication network that conveys information between processors and memory banks. Yet in most PRAM models, communication issues are largely ignored. The basic assumption in these models is that in unit time each processor can simultaneously access one memory location. For truly large parallel computers, however, computer engineers may be hard pressed to implement networks with the communication bandwidth demanded by this assumption, due in part to packaging constraints. The difficulty of building such networks threatens the validity of the PRAM as a predictor of algorithmic performance. This paper introduces a more restricted PRAM model, which we call a distributed random-access machine (DRAM), to reflect an assumption of limited communication bandwidth in the underlying network.
In a communication network, we can measure the cost of communication in terms of the number of messages that must cross a cut of the network, as in  and . Specifically, a cut S of a network is a subset of the nodes of the network. The capacity is the number of wires connecting processors in S with processors in the rest of the network , i.e., the bandwidth of communication between S and . For a set M of messages, we define the load of M on a cut S to be the number of messages in M whose source is in S and whose destination is in or vice versa. The load factor of M on S is
and the load factor of M on the entire network is
The load factor provides a simple lower bound on the time required to deliver a set of messages. For instance, if there are 10 messages to be sent across a cut of capacity 3, the time required to deliver all 10 messages is at least the load factor .
There are two commonly occurring types of message congestion that the load factor measures effectively. One is the ``hot spot'' phenomenon identified by Pfister and Norton . When many processors send messages to a single other processor, large delays can be experienced as messages queue for access to that other processor. In this situation, the load factor on the cut that isolates the single processor is high. The second phenomenon is message congestion due to pinboundedness. In this case, it is the limited bandwidth imposed by the packaging technology that can cause high load factors. For example, the cut of the network that limits communication performance for some set of messages might correspond to the pins on a printed-circuit board or to the cables between two cabinets.
The load-factor lower bound can be met to within a polylogarithmic factor as an upper bound on many networks, including volume and area-universal networks, such as fat-trees [10, 18], as well as the standard universal routing networks, such as the Boolean hypercube . The lower bound is weak on the standard universal routing networks because every cut of these networks is large relative to the number of processors in the smaller side of the cut, but these networks may be more difficult to construct on a large scale because of packaging limitations. Networks for which the load factor lower bound cannot be approached to within a polylogarithmic factor as an upper bound include linear arrays, meshes, and high-diameter networks in general.
In the PRAM model, the issue of communication bandwidth does not arise even though most parallel computers implement remote memory accesses by routing messages through an underlying network. In the PRAM model, a set of memory accesses is presumed to take unit time, reflecting the assumption that all sets of messages can be routed through the network with comparable ease. In the DRAM model, a set of memory accesses takes time equal to the load factor of the set of messages, which reflects the unequal times required to route sets of messages with different load factors.
This paper gives DRAM algorithms that solve many graph problems with efficient communication. Our algorithms can be executed on any of the popular PRAM models because a PRAM can be viewed as a DRAM in which communication costs are ignored. In fact, our DRAM algorithms can all be performed on an exclusive-read, exclusive-write PRAM, and they are nearly as efficient as corresponding concurrent-read, exclusive-write PRAM algorithms in the literature.
The remainder of this paper is organized as follows. Section 2 contains a specification of the DRAM model and the implementation of data structures in the model. The section demonstrates how a DRAM models the congestion produced by techniques such as ``recursive doubling'' that are frequently used in PRAM algorithms. Section 3 defines the notion of a conservative algorithm as a concrete realization of a communication-efficient DRAM algorithm, and gives a ``Shortcut Lemma'' that forms the basis of the conservative algorithms in this paper. Section 4 presents a conservative ``recursive pairing'' technique that can be used to perform many of the same functions as on lists as recursive doubling. Section 5 presents a linear-space, conservative ``tree contraction'' algorithm based on the ideas of Miller and Reif . Section 6 presents treefix computations, which are generalizations of the parallel prefix computation [3, 7, 23] to trees. We show that treefix computations can be performed using the tree contraction algorithm of Section 5. Section 7 gives short, efficient, parallel algorithms for tree and graph problems, most of which are based on treefix computations. Section 8 discusses the relationship between the DRAM model and more traditional PRAM models, as well as the ramifications of using the DRAM model in practical situations.