Underlying any realization of a parallel random-access machine (PRAM)
is a communication network that conveys information between processors
and memory banks. Yet in most PRAM models, communication issues are
largely ignored. The basic assumption in these models is that in unit
time each processor can simultaneously access one memory location.
For truly large parallel computers, however, computer engineers may be
hard pressed to implement networks with the communication bandwidth
demanded by this assumption, due in part to packaging constraints.
The difficulty of building such networks threatens the validity of the
PRAM as a predictor of algorithmic performance. This paper introduces
a more restricted PRAM model, which we call a *distributed
random-access machine* (DRAM), to reflect an assumption of limited
communication bandwidth in the underlying network.

In a communication network, we can measure the cost of communication
in terms of the number of messages that must cross a cut of the
network, as in [10] and [18].
Specifically, a *cut* *S* of a network is a subset of the
nodes of the network. The *capacity* is the number
of wires connecting processors in *S* with processors in the rest of
the network , i.e., the bandwidth of communication
between *S* and . For a set *M* of messages, we define
the *load* of *M* on a cut *S* to be the number of messages in *M*
whose source is in *S* and whose destination is in or
vice versa. The *load factor* of *M* on *S* is

and the *load
factor* of *M* on the entire network is

The load factor provides a simple lower bound on
the time required to deliver a set of messages. For instance, if
there are *10* messages to be sent across a cut of capacity *3*, the
time required to deliver all *10* messages is at least the load factor
.

There are two commonly occurring types of message congestion that the load factor measures effectively. One is the ``hot spot'' phenomenon identified by Pfister and Norton [24]. When many processors send messages to a single other processor, large delays can be experienced as messages queue for access to that other processor. In this situation, the load factor on the cut that isolates the single processor is high. The second phenomenon is message congestion due to pinboundedness. In this case, it is the limited bandwidth imposed by the packaging technology that can cause high load factors. For example, the cut of the network that limits communication performance for some set of messages might correspond to the pins on a printed-circuit board or to the cables between two cabinets.

The load-factor lower bound can be met to within a polylogarithmic factor as an upper bound on many networks, including volume and area-universal networks, such as fat-trees [10, 18], as well as the standard universal routing networks, such as the Boolean hypercube [29]. The lower bound is weak on the standard universal routing networks because every cut of these networks is large relative to the number of processors in the smaller side of the cut, but these networks may be more difficult to construct on a large scale because of packaging limitations. Networks for which the load factor lower bound cannot be approached to within a polylogarithmic factor as an upper bound include linear arrays, meshes, and high-diameter networks in general.

In the PRAM model, the issue of communication bandwidth does not arise even though most parallel computers implement remote memory accesses by routing messages through an underlying network. In the PRAM model, a set of memory accesses is presumed to take unit time, reflecting the assumption that all sets of messages can be routed through the network with comparable ease. In the DRAM model, a set of memory accesses takes time equal to the load factor of the set of messages, which reflects the unequal times required to route sets of messages with different load factors.

This paper gives DRAM algorithms that solve many graph problems with efficient communication. Our algorithms can be executed on any of the popular PRAM models because a PRAM can be viewed as a DRAM in which communication costs are ignored. In fact, our DRAM algorithms can all be performed on an exclusive-read, exclusive-write PRAM, and they are nearly as efficient as corresponding concurrent-read, exclusive-write PRAM algorithms in the literature.

The remainder of this paper is organized as follows.
Section 2 contains a specification of the DRAM model and the
implementation of data structures in the model. The section
demonstrates how a DRAM models the congestion produced by techniques
such as ``recursive doubling'' that are frequently used in PRAM
algorithms. Section 3 defines the notion of a *
conservative algorithm* as a concrete realization of a
communication-efficient DRAM algorithm, and gives a ``Shortcut Lemma''
that forms the basis of the conservative algorithms in this paper.
Section 4 presents a conservative ``recursive pairing''
technique that can be used to perform many of the same functions as on
lists as recursive doubling. Section 5 presents a
linear-space, conservative ``tree contraction'' algorithm based on the
ideas of Miller and Reif [22]. Section 6
presents *treefix computations,* which are generalizations of the
parallel prefix computation [3, 7, 23] to
trees. We show that treefix computations can be performed using the
tree contraction algorithm of Section 5. Section 7
gives short, efficient, parallel algorithms for tree and graph
problems, most of which are based on treefix computations.
Section 8 discusses the relationship between the DRAM
model and more traditional PRAM models, as well as the ramifications
of using the DRAM model in practical situations.

Thu Jul 25 19:12:36 EDT 1996