![]() Mor Harchol-Balter
Scheduling in Multiserver Systems: Approaches and Open Problems: Part I & II
PART I: Server farms are ubiquitous in applications ranging fromWeb server farms to high-performance supercomputing systems to call centers. The popularity of the server farm architecture is understandable, as it allows for increased performance, while being cost-effective and easily scalable. Given the prevalence of server farms, it is surprising that even at this late date so little is understood regarding their performance as compared with their single-server counterpart, particularly with respect to scheduling. Part of the problem is that there are at least three disjoint communities studying scheduling in server farms, including the SIGMETRICS community, the INFORMS community, and the SPAA/STOC/FOCS community, all of which have different approaches and goals. One of our goals in this tutorial is to make researchers aware of results in these different communities. Our primary focus is the evaluation of different routing/dispatching policies in server farms. The emphasis will be on intuition, so that the talk is accessible to newcomers as well as old-timers. In surveying the newest results, we will also present some practical open problems. Since server farms are composed of many individual servers, each operating under some scheduling policy, Part I of this tutorial will begin by briefly examining single-server systems, and the effect of scheduling therein. Here we will pay particular attention to the effect of heavy-tailed job size distributions witnessed in computer system environments [6, 1, 11], in determining which scheduling policies are most effective in practice. We will point out several counter-intuitive results, such as the fact that scheduling policies that favor short jobs may actually help long jobs as well [8, 12, 2], and the fact that scheduling results in closed system models can be very different from those in open system models [10]. We will then move on to studying server farm models representative of those used in super-computing and manufacturing. These involve non-preemptive, First-Come-First-Serve (FCFS) scheduling at the individual servers. We will see that the mean response time of such FCFS server farms can vary by orders of magnitude depending on the routing/dispatching policy used for assigning jobs to servers [5]. We will question common wisdoms, like whether load should be balanced among identical servers [9, 3]. We will also discuss the benefits of cycle stealing in such models [7], and what one can do when the size of jobs isn”Ēt known a priori [4]. PART II: In Part II we move on to looking at server farm models that are representative of Web server farms. Here the individual servers all employ Processor-Sharing (PS) service order, which is a preemptive time-sharing scheduling policy. Examples of such server farms include the Cisco Local Director product, the IBM Network Dispatcher, Microsoft SharePoint, and F5 Labs BIG/IP. We first show that the desired routing/dispatching policy for minimizing mean response time in the case of PS server farms can be very different from that for FCFS server farms. We then focus on the Join-the-Shortest-Queue routing policy and discuss some existing approximations in the literature (e.g., [5, 6]) and some new approximations that apply to the case of PS server farms [2]. Finally, we turn to the question of what server farm architectures are optimal for minimizing mean response time. Here we consider server farms where the individual servers employ Shortest-Remaining-Processing-Time (SRPT) scheduling, or there is a central SRPT queue. Such models are very difficult to analyze stochastically. The closest stochastic result only manages a server farm with a central priority queue [3]. Primary work on SRPT server farms is dominated by the STOC/FOCS/SPAA community, which uses competitive ratios as its metric. We will describe server farm architectures that appear to be optimal, but aren't, and discuss their competitive ratios, both for the case of a central queue model [4] and an immediate-dispatch model [1]. These results motivate future research directions for researchers in the stochastic community. |