- Making Parallel Programming Easy and Portable -

For parallel computing to succeed in the mainstream parallel programming needs to be made as easy as sequential programming, or at least almost as easy. Currently many problems that can be solved in a couple dozen lines of sequential code require hundreds or sometimes thousands of lines of code to be solved efficiently in parallel. Furthermore the parallel code is typically much harder to understand, modify and debug than its sequential counterpart. This has limited parallel programming to experts, and to applications in which the performance is absolutely critical. Our contention, along with many others, is that programming parallel algorithms and applications is not inherently difficult, but rather is difficult at present for the following reasons:

- Quicksort: A motivational example -

To appreciate that parallelism is not inherently difficult, consider the Quicksort algorithm. Here is pseudocode for the algorithm from Aho, Hopcroft and Ullman's text book ``The Design and Analysis of Computer Algorithms''.
  procedure QUICKSORT(S):
  if S contains at most one element then return S
  else
     begin
        choose an element a randomly from S;
        let S_1, S_2 and S_3 be the sequences of elements in S less 
           than, equal to, and greater than a, respectively;
        return (QUICKSORT(S_1) followed by S_2 followed by
                QUICKSORT(S_3))
     end
This algorithm was considered by the authors as a sequential algorithm and can be expressed concisely in most sequential languages. The interesting aspect, however, is that the algorithm is in fact highly parallel in nature. There are two places in which parallelism appears in the algorithm: Although Quicksort can be considered a "toy" example, the algorithm is actually quite involved relative to many other parallel applications because it is highly dynamic (one does not know the sizes of the recursive calls ahead of time), and has multiple sources of parallelism (the recursive calls and the subselection for generating S_1).

- Expressing Quicksort in Parallel -

If Quicksort is indeed inherently parallel, it seems that we should be able to express it in a parallel language as concisely as in a sequential language. Much of the earlier work on parallelizing quicksort had only considered the parallelism from recursive calls [Hal85], which leas to very little parallelism --- in such a version just partitioning the elements will require O(n) time so the best speedup we could hope for over a sequential implementation would be O(lg n). Here we are interested in a version that expressed both forms of parallelism.

To appreciate the difficulty of programming a "toy" example such as Quicksort, we coded a parallel version of the algorithm that takes advantage of both forms of parallelism in C + MPI (MPI is a message passing interface for programming parallel machines that can be used as a library with C or Fortran). This required 1700 lines of code, including explicit load-balancing: hardly a toy. Here is the code for Quicksort in NESL:


In NESL parallelism is expressed with curly brackets. For example, in the above code the expression
  {e in S | e < a}
can be read as "in parallel for each e in S select all es that are less than a". Similarly
  {QUICKSORT(v): v in [S_1, S_3]}
can be read as "in parallel for each v in the sequence [S_1, S_3], QUICKSORT(v)."

Of course having concise parallel code does not imply an efficient implementation, and a major effort in our project has been in compiler techniques.

Our overall goal can therefore be summarized as: being able to write parallel code as concisely and clearly as sequential code, while achieving close to peak performance.


Back to the NESL page.