Project Proposal

Daniel Neill  (neill)  and    Adam Wierman (acw)

http://www.cs.cmu.edu/~acw/15740/


On the Benefits of Work Stealing in
 Shared-memory Multiprocessors


Load balancing is one of the key techniques exploited to improve the performance of parallel programs.  In this work we focus on applying load balancing to problems that are susceptible to domain decomposition.  These problems are typically scientific or engineering applications where parallelism is used to exploit the local-based nature of physical problems.  For example, simulations of ocean currents, population movement, disease epidemics, animal migration, stress, and other environments where information requirements fall off with distance.  This domain of problems are  typically approached using static partitioning within some grid computation.  Thus, for our purposes we can imagine a partition of the physical space onto the set of processors in the multiprocessor system.  Then each processor performs a chain of computation based on the current state of the physical space assigned to it.  Typically, the most difficult aspect for the programmer in these situations is to decide how to divide up the physical space so the the computation load is balanced among the processors. 

We present the following model of the above situation.  We model a homogeneous shared-memory multiprocessor system where the service rate of each of $m$ processors is $\mu$.  Further, we model the assignment of tasks to each processor as homogeneous Poisson streams with rate $\lambda$.  This can be justified as follows.  If we consider our realm of applications, physical simulations, we are likely simulating the effect of some stream of events on the physical state.  For instance, earthquake shocks on a building, variable air currents on a wing, layoffs on population movement, an outbreak of disease, or changes in weather on the migration of animals.  Thus, it is reasonable to assume some random distribution on the occurrence of these events, which corresponds to the arrival of computation tasks to our system.  Further, because the goal of the programmer is to balance the load among processors, we can assume that the programmer has assigned the physical space onto processors in a manner where an event is equally likely to result in computation at each of the processors.

As we have mentioned, the goal of the parallel programmer is to balance the load of these computations across the processors.  However, the programmer typically statically chooses a partition of physical space onto the multiprocessor system.  Thus, any load balancing performed by the programmer must be performed at a fairly coarse level - specifically, the programmer can only hope to balance the average loads of each processor.  Optimal performance however can only come by balancing the instantaneous load at the processor.  One important architectural technique commonly employed to attempt to accomplish this instantaneous load balancing is work stealing.

In order to understand the benefits of work stealing, consider the following example.  Suppose that, as the programmer, we have succeeded in balancing the average load among the processors of our system.  However, there will be some cases where, due to a strange combination of events in our simulation, one processor will be forced to do a lot of computation while another is left with very little computation.  A specific example of this is a powerful gust of wind hitting only one portion of an aircraft wing.  In this situation, one processor will be sitting idle while another processor is overloaded.   Work stealing allows the idle processor to perform some of the tasks in the overloaded processor's queue.  This seems like a win-win situation; however some real world constraints complicate the issue.  In particular, there is some cost associated with stealing a job from another processor's queue.  First, there are the communication costs involved in transferring the job - these costs come from both the extra contention for the system bus and the latency of the transfer.  Second is the fact that jobs have some affinity for the queue where they are assigned.  This affinity results from the fact that the applications we are considering have a spatial breakdown of the computations.  It follows that a large amount of the data needed by computations will be cached.  However, when we steal a task from another processor we can no longer take advantage of cached computation.

These two costs beg the questions of (i) when is it appropriate to steal work and (ii) can work stealing provide benefit in a system that is already load balanced.  To answer these questions we will add to our model the capability for processor $i$ to steal work from processor $j$; however when doing so processor $i$ can only process the work at rate $\gamma < \mu$.

As a quick aside, it is important to consider the appropriate metric for performance under these systems.  In contrast to typical performance models, the appropriate metric for this situation is not the response time or delay a customer sees in the queue.  We are interested instead in throughput (number of computations finishing per unit time), which can be measured in terms of the utilization of our resources - specifically the processors.  Thus, the performance of the systems we are modeling is measured by the time average percentage of processing capability being used, where we take into account that a processor is not fully utilized when stealing work because of the costs it must assume.

It is likely that in the absence of extreme communication costs or very strong affinity of jobs for the processors they are assigned to, work stealing will result in some benefit as a result of the fact that it balances the load on a more instantaneous level than the user is able to.  If this holds, it suggests a very interesting question: can a multiprocessor architecture ease the burden on the parallel programmer by allowing work stealing?  In other words, is work stealing a substitute for affinity scheduling? Specifically, if the architecture performs work stealing and the programmer does a reasonable, but not optimal, job of balancing the load, e.g. the arrival rates to the queues differ by at most $\delta$, how will the system performance compare to the performance of a perfectly balanced system?   


Plan of Attack

  1. Convince Seth that the model is reasonable.

  2. Simulate all models.

  3. Attempt to analyze models using Markovian or differential equations techniques.

  4. Validate the analysis with simulation if we are able to obtain analytic results.

  5. Investigate questions mentioned above using analysis and simulation.

Week 1: Convince Seth that our model is reasonable.
Week 2: Plan code for simulation and start investigating analysis.
Week 3: Implement load balanced system in simulation and continue analysis.
Week 4: Implement work stealing system in simulation and finish analysis.
Week 5: Achieve milestone and write introduction.
Week 6: Use simulations and analysis to investigate when it is appropriate to perform work stealing and write description of simulation.
Week 7: Use simulations and analysis to investigate whether work stealing can compensate for imbalanced loads and write description of analysis techniques.
Week 8: Pick up slack and write up results.

The critical path in this schedule is the implementation of the simulations.  We will both work equally on planning, analysis, simulation, and write-up; however we plan to use work stealing in order to instantaneously balance the load.


Milestone

By November 21st we will have formalized our model of work stealing.  In addition we will have completed the implementation of the simulations.  Also, we will have decided whether it is possible analytically evaluate the models, and if so, made progress on the analysis.


Literature

The previous work we have read so far is listed and summarized here: [pdf].


Resources Needed

We plan to use Matlab, for the analysis, and C++, for the simulations, on our workstations.  We have all of the resources we need to conduct this study.


Getting Started

We have almost completed our literature survey.  In addition, we have discussed various models for the system, as well as analysis techniques for each model.  The only constraints on us getting started on this work are that Adam will be attending a conference from 10/18-10/22 and Daniel will be attending a conference from 10/22-10/26.