|
Project Proposal Daniel Neill (neill) and Adam Wierman (acw) http://www.cs.cmu.edu/~acw/15740/ On the Benefits of Work Stealing in Load balancing is one of the key techniques exploited to improve the performance of parallel programs. In this work we focus on applying load balancing to problems that are susceptible to domain decomposition. These problems are typically scientific or engineering applications where parallelism is used to exploit the local-based nature of physical problems. For example, simulations of ocean currents, population movement, disease epidemics, animal migration, stress, and other environments where information requirements fall off with distance. This domain of problems are typically approached using static partitioning within some grid computation. Thus, for our purposes we can imagine a partition of the physical space onto the set of processors in the multiprocessor system. Then each processor performs a chain of computation based on the current state of the physical space assigned to it. Typically, the most difficult aspect for the programmer in these situations is to decide how to divide up the physical space so the the computation load is balanced among the processors. We present the following model of the above situation. We model a homogeneous shared-memory multiprocessor system where the service rate of each of $m$ processors is $\mu$. Further, we model the assignment of tasks to each processor as homogeneous Poisson streams with rate $\lambda$. This can be justified as follows. If we consider our realm of applications, physical simulations, we are likely simulating the effect of some stream of events on the physical state. For instance, earthquake shocks on a building, variable air currents on a wing, layoffs on population movement, an outbreak of disease, or changes in weather on the migration of animals. Thus, it is reasonable to assume some random distribution on the occurrence of these events, which corresponds to the arrival of computation tasks to our system. Further, because the goal of the programmer is to balance the load among processors, we can assume that the programmer has assigned the physical space onto processors in a manner where an event is equally likely to result in computation at each of the processors. As we have mentioned, the goal of the parallel programmer is to balance the load of these computations across the processors. However, the programmer typically statically chooses a partition of physical space onto the multiprocessor system. Thus, any load balancing performed by the programmer must be performed at a fairly coarse level - specifically, the programmer can only hope to balance the average loads of each processor. Optimal performance however can only come by balancing the instantaneous load at the processor. One important architectural technique commonly employed to attempt to accomplish this instantaneous load balancing is work stealing. In order to understand the benefits of work stealing, consider the following example. Suppose that, as the programmer, we have succeeded in balancing the average load among the processors of our system. However, there will be some cases where, due to a strange combination of events in our simulation, one processor will be forced to do a lot of computation while another is left with very little computation. A specific example of this is a powerful gust of wind hitting only one portion of an aircraft wing. In this situation, one processor will be sitting idle while another processor is overloaded. Work stealing allows the idle processor to perform some of the tasks in the overloaded processor's queue. This seems like a win-win situation; however some real world constraints complicate the issue. In particular, there is some cost associated with stealing a job from another processor's queue. First, there are the communication costs involved in transferring the job - these costs come from both the extra contention for the system bus and the latency of the transfer. Second is the fact that jobs have some affinity for the queue where they are assigned. This affinity results from the fact that the applications we are considering have a spatial breakdown of the computations. It follows that a large amount of the data needed by computations will be cached. However, when we steal a task from another processor we can no longer take advantage of cached computation. These two costs beg the questions of (i) when is it appropriate to steal work and (ii) can work stealing provide benefit in a system that is already load balanced. To answer these questions we will add to our model the capability for processor $i$ to steal work from processor $j$; however when doing so processor $i$ can only process the work at rate $\gamma < \mu$. As a quick aside, it is important to consider the appropriate metric for performance under these systems. In contrast to typical performance models, the appropriate metric for this situation is not the response time or delay a customer sees in the queue. We are interested instead in throughput (number of computations finishing per unit time), which can be measured in terms of the utilization of our resources - specifically the processors. Thus, the performance of the systems we are modeling is measured by the time average percentage of processing capability being used, where we take into account that a processor is not fully utilized when stealing work because of the costs it must assume. It is likely that in the absence of extreme communication costs or very strong affinity of jobs for the processors they are assigned to, work stealing will result in some benefit as a result of the fact that it balances the load on a more instantaneous level than the user is able to. If this holds, it suggests a very interesting question: can a multiprocessor architecture ease the burden on the parallel programmer by allowing work stealing? In other words, is work stealing a substitute for affinity scheduling? Specifically, if the architecture performs work stealing and the programmer does a reasonable, but not optimal, job of balancing the load, e.g. the arrival rates to the queues differ by at most $\delta$, how will the system performance compare to the performance of a perfectly balanced system? Plan of Attack
Milestone
Literature
Resources Needed
Getting Started
|