Work stealing is a popular method of scheduling fine-grained parallel tasks.
The performance of work stealing has been extensively studied, both
theoretically and empirically, but primarily for the restricted class of
nested-parallel (or fully strict) computations.  We extend this prior work by
considering a broader class of programs that also supports pipelined
parallelism through the use of parallel futures.

Though the overhead of work-stealing schedulers is often quantified in terms
of the number of steals, we show that a broader metric, the number of {\em
  deviations}, is a better way to quantify work-stealing overhead for less
restrictive forms of parallelism, including parallel futures.  For such
parallelism, we prove bounds on work-stealing overheads---scheduler time and
cache misses---as a function of the number of deviations.  Deviations can
occur, for example, when work is stolen or when a future is touched.  We also
show instances where deviations can occur independently of steals and touches.

Next, we prove that, under work stealing, the expected number of deviations is
$O(Pd + td)$ in a $P$-processor execution of a computation with span $d$ and
$t$ touches of futures.  Moreover, this bound is existentially tight for any
work-stealing scheduler that is \emph{parsimonious} (those where processors
steal only when their queues are empty); this class includes all prior
work-stealing schedulers.  We also present empirical measurements of the
number of deviations incurred by a classic application of futures, Halstead's
quicksort, using our parallel implementation of ML.  Finally, we identify a
family of applications that use futures and, in contrast to quicksort, incur
significantly smaller overheads.