This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations $G_n$ each member of which requires $\Theta(n)$ total operations (work), for which when using work-stealing the total number of cache misses on one processor is constant, while even on two processors the total number of cache misses is $\Omega(n)$. For nested-parallel computations, however, we show that on $P$ processors the expected additional number of cache misses beyond those on a single processor is bounded by $O(C\lceil\frac{m}{s}\rceil\:P\:T_{\infty})$, where $m$ is the execution time of an instruction incurring a cache miss, $s$ is the steal time, $C$ is the size of cache, and $T_\infty$ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to $50\%$ over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to $80\%$.