Nested Parallelism

Next: Pairs Up: Introduction Previous: Parallel Operations on Sequences

Nested Parallelism

In NESL the elements of a sequence can be any valid data item, including sequences. This rule permits the nesting of sequences to an arbitrary depth. A nested sequence can be written as

[[2, 1], [7,3,0], [4]]

This sequence has type: [[int]] (a sequence of sequences of integers). Given nested sequences and the rule that any function can be applied in parallel over the elements of a sequence, NESL necessarily supplies the ability to apply a parallel function multiple times in parallel; we call this ability nested parallelism. For example, we could apply the parallel sequence function sum over a nested sequence: In this expression there is parallelism both within each sum, since the sequence function has a parallel implementation, and across the three instances of sum, since the apply-to-each construct is defined such that all instances can run in parallel. NESL supplies a handful of functions for moving between levels of nesting. These include flatten, which takes a nested sequence and flattens it by one level. For example, Another useful function is bottop (for bottom and top), which takes a sequence of values and creates a nested sequence of length 2 with all the elements from the bottom half of the input sequence in the first element and elements from the top half in the second element (if the length of the sequence is odd, the bottom part gets the extra element). For example, Table 2: Routines with nested parallelism. Both the inner part and the outer part can be executed in parallel. Table 3: Some divide and conquer algorithms. Table 2 lists several examples of routines that could take advantage of nested parallelism. Nested parallelism also appears in most divide-and-conquer algorithms. A divide-and-conquer algorithm breaks the original data into smaller parts, applies the same algorithm on the subparts, and then merges the results. If the subproblems can be executed in parallel, as is often the case, the application of the subparts involves nested parallelism. Table 3 lists several examples. As an example, consider how the function sum might be implemented, /afs/cs/project/scandal/nesl/2.6/examples/preduce.cnesl This code tests if the length of the input is one, and returns the single element if it is. If the length is not one, it uses bottop to split the sequence in two parts, and then applies itself recursively to each part in parallel. When the parallel calls return, the two results are extracted and added. The code effectively creates a tree of parallel calls which has depth , where n is the length of a, and executes a total of n-1 calls to +. Figure: An implementation of quicksort. As another more involved example, consider a parallel variation of quicksort [6] (see Figure 2). When applied to a sequence s, this version splits the values into three subsets (the elements lesser, equal and greater than the pivot) and calls itself recursively on the lesser and greater subsets. To execute the two recursive calls, the lesser and greater sequences are concatenated into a nested sequence and qsort is applied over the two elements of the nested sequences in parallel. The final line extracts the two results of the recursive calls and appends them together with the equal elements in the correct order. Figure 3: The quicksort algorithm. Just using parallelism within each block yields a parallel running time at least as great as the number of blocks ( ). Just using parallelism from running the blocks in parallel yields a parallel running time at least as great as the largest block ( ). By using both forms of parallelism the parallel running time can be reduced to the depth of the tree (expected ). The recursive invocation of qsort generates a tree of calls that looks something like the tree shown in Figure 3. In this diagram, taking advantage of parallelism within each block as well as across the blocks is critical to getting a fast parallel algorithm. If we were only to take advantage of the parallelism within each quicksort to subselect the two sets (the parallelism within each block), we would do well near the root and badly near the leaves (there are n leaves which would be processed serially). Conversely, if we were only to take advantage of the parallelism available by running the invocations of quicksort in parallel (the parallelism between blocks but not within a block), we would do well at the leaves and badly at the root (it would take n time to process the root). In both cases the parallel time complexity is rather than the ideal we can get using both forms (this is discussed in Section 1.5). Next: Pairs Up: Introduction Previous: Parallel Operations on Sequences Jonathan Hardwick Tue Nov 28 13:57:00 EST 1995