Sixth Annual ACM Symposium on Parallel Algorithms and
Architectures, June 27-29, 1994, Cape May, New Jersey, pp. 104-113.
List Ranking and List Scan on the Cray C-90
Margaret Reid-Miller
List ranking and list scan are two primitive operations used in many
parallel algorithms that use list, trees, and graph data structures.
But vectorizing and parallelizing list ranking is a challenge because
it is highly communication intensive and dynamic. In addition, the
serial algorithm is very simple and has very small constants. In
order to compete, a parallel algorithm must also be simple and have
small constants. A parallel algorithm due to Wyllie is such an
algorithm, but it is not work efficient--its performance degrades for
longer and longer linked lists. In contrast, work efficient PRAM
algorithms developed to date have very large constants. We introduce
a new fully vectorized and parallelized algorithm that both is work
efficient and has small constants. It does not achieve O(log n)
running time, but we contend that work efficiency and small constants
is more important, given that vector and multiprocessor machines are
used for problems that are much larger than the number of processors
and, therefore, the O(log n) time is never achieved in practice. In
particular, to the best of our knowledge, our implementation of list
ranking and list scan on the Cray C-90 is the fastest implementation
to date. In addition, it is the first implementation of which we are
aware that outperforms fast workstations. The success of our
algorithm is due to its relatively large grain size and simplicity of
the inner loops, and the success of the implementation is due to
pipelining reads and writes through vectorization to hide latency,
minimizing load balancing by deriving equations for predicting and
optimizing performance, and avoiding conditional tests except when
load balancing.