List Ranking and List Scan on the CRAY C-90

Margaret Reid-Miller.
Journal of Computer and System Sciences, December 1996 (and Sixth Annual ACM Symposium on Parallel Algorithms and Architectures, June 1994 and CMU-CS-94-101, with Guy E. Blelloch).

JCSS version: 94k compressed postscript
SPAA version: 73k compressed postscript
TR version: 88k compressed postscript

Abstract: Although parallel algorithms using linked lists, trees, and graphs have been studied extensively by the research community, implementations have met with limited success, even for the simplest algorithms. In this paper we present our results of a careful implementation study of parallel list ranking (and the related list-scan operation) and show that it can have substantial speed up over fast workstations.

Obtaining good parallel performance for list ranking is a challenge for two reasons. First, although there are many asymptotic work-efficient algorithms, it is hard to keep the constants comparable to those of the sequential algorithm. In this paper we introduce a new parallel algorithm that both is work efficient and has small constant but sacrifices logarithmic time; it achieves only an O(log^2 n) time. We contend, however, that work efficiency and small constants are more important, given that multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the optimal O(log n) time is never achieved in practice. Second, list ranking is highly communication intensive and its memory access patterns are dynamic. We show, however, that by using high memory bandwidth multiprocessors, such as Cray C-90 computers, programmed with virtual processors to hide latency we can ameliorate these problems.

To the best of our knowledge, our implementation of list ranking and list scan on the Cray C-90 is the fastest to date and is the first implementation that substantially outperforms fast workstations. The success of our algorithm is due to its moderate grain size and simplicity; the success of the implementation is due to pipelining reads and writes through vectorization to hide latency and optimizing performance by analyzing the expected execution time of the algorithm.

@article{jcss-listscan96
        author = "Margaret Reid-Miller",
        title = "List Ranking and List Scan on the {C}RAY {C}-90",
        journal = "Journal of Computer and System Sciences",
	volume = 53,
	number = 3,
	month = dec,
	year = "1996"}

@inproceedings{spaa-listscan94,
        author = "Margaret Reid-Miller",
        title = "List Ranking and List Scan on the {C}RAY {C}-90",
        booktitle = "Proceedings Symposium on 
                Parallel Algorithms and Architectures",
	address = "Cape May, NJ",
	pages = "104--113",
	month = jun,
	year = "1994" }

@techreport{listscan94,
        author = "Margaret Reid-Miller and Guy E.~Blelloch ",
        title = "List Ranking and List Scan on the {C}RAY {C}-90",
       	institution = "School of Computer Science, Carnegie Mellon 
		       University",
	number = "CMU-CS-94-101",
	month = feb,
	year = "1994" }

mrmiller@cs.cmu.edu Last updated 4 Feb 1998