Efficient Scheduling for Parallel Memory Hierarchies
Harsha Simhadri
Apr 21, 2010


In dynamic parallelism, the user expresses the full parallelism of an application without concern about how it is mapped onto processors. The running time of an application on a particular machine depends on the cost metrics of the application, the memory hierarchy of the machine and the mapping of the application on to the machine. We consider the case of mapping nested-parallel computations on to a parallel memory hierarchy (modeled as a tree of caches).

To capture the cache cost of nested-parallel computations we introduce a parallel version of the ideal cache model. In this model, computations can be written cache obliviously (no choices are made based on machine

parameters) and analyzed using a single level of cache with parameters $Z$ (cache size) and $L$ (cache line size), and a parameter $\alpha$ specifying the computation's parallelism (for input size $n$, $n^{\alpha}$ represents the number of processors that can be effectively used). For several fundamental algorithms, this cache cost in the parallel ideal cache model is optimal, matching the sequential bounds, with a parallelism $\alpha\rightarrow 1$.

Analogous to a computation's parallelism, we quantify the parallelism of a machine and present an efficient scheduler for settings where the parallelism of the computation exceeds that of the machine. We analyze the number of cache misses caused by this scheduler and bound the run time of a computation under this scheduler in terms of the cache cost of the computation.

This is joint work with Guy Blelloch, Jeremy Fineman and Phillip Gibbons.