This paper studies the data locality of the work-stealing scheduling
algorithm on hardware-controlled shared-memory machines. We present
lower and upper bounds on the number of cache misses using work
stealing, and introduce a locality-guided work-stealing algorithm
along with experimental validation.
As a lower bound, we show that there is a family of multithreaded
computations $G_n$ each member of which requires $\Theta(n)$ total
operations (work), for which when using work-stealing the total number
of cache misses on one processor is constant, while even on two
processors the total number of cache misses is $\Omega(n)$.
For nested-parallel computations, however, we show that on $P$
processors the expected additional number of cache misses beyond those
on a single processor is bounded by
$O(C\lceil\frac{m}{s}\rceil\:P\:T_{\infty})$, where $m$ is the
execution time of an instruction incurring a cache miss, $s$ is the
steal time, $C$ is the size of cache, and $T_\infty$ is the number of
nodes on the longest chain of dependences. Based on this we give
strong bounds on the total running time of nested-parallel
computations using work stealing.
For the second part of our results, we present a locality-guided work
stealing algorithm that improves the data locality of multithreaded
computations by allowing a thread to have an affinity for a processor.
Our initial experiments on iterative data-parallel applications show
that the algorithm matches the performance of
static-partitioning under traditional work loads but improves the
performance up to $50\%$ over static partitioning under
multiprogrammed work loads. Furthermore, the locality-guided work
stealing improves the performance of work-stealing up to $80\%$.