A summary of research interests and accomplishments 

(April 2004)



Waiting time (delay) is a source of frustration for users who receive service via computer or communication systems.  This frustration can result in lost revenue, e.g., when a customer leaves a commercial web site to shop at a competitor's site.  One obvious way to decrease delay is simply to buy (more expensive) faster machines.  However, we can also decrease delay for free with given resources by making more efficient use of resources and by better scheduling jobs (i.e., by changing the order of jobs to be processed).

Utilizing the full potential computing power of multiserver systems and analyzing their performance are known to be much harder problems than for the case of a single server system.  Despite the ubiquity of multiserver systems, it is not known how we should assign jobs to servers and how we should schedule jobs within each server to minimize the mean delay in multiserver systems.  Also, it is not well understood how we can evaluate various assignment and scheduling policies for multiserver systems.  A primary goal of my research is to provide answers to these questions.


Resource allocation mechanisms for minimizing delay

We have considered various resource allocation mechanisms, which determine (loosely speaking) a mapping between jobs and the server on which they will be run, and have quantitatively analyzed performance benefit and characteristics of these resource allocation mechanisms.  The resource allocation mechanisms that we have considered include the following:

Cycle stealing:

A cycle stealing mechanism is a particular resource allocation mechanism, where a server processes jobs other than its own when the server's own queue is empty (the ``beneficiary'' server is said to steal idle CPU cycles of the ``donor'' server).  We have considered three cycle stealing mechanisms with different objectives and
assumptions.

Cycle stealing for balancing load in a network of workstations:  
This cycle stealing mechanism allows a lightly loaded workstation (donor) to help heavily loaded workstation (beneficiary) when donor has no jobs to do.  This cycle stealing mechanism balances load, decreasing the load of the beneficiary and increasing the load of the donor.  The delay at the beneficiary server is reduced due to the decreased load, while the delay in the donor server ideally stays the same, since the beneficiary only steals idle CPU cycles of the donor.  In particular, we analyze the effect of switching costs on the performance under cycle stealing.  A paper on this research was presented at the ACM SIGMETRICS 2003 [SIGMETRICS03].

Cycle stealing for increasing utilization under restrictive task assignment policies in server farms with distributed queues:
 When service demand has high variability and jobs cannot be preempted, allocating one server (``small'' server) to small jobs and another server (``large'' server) to large jobs (Dedicated policy) in a server farm can significantly decrease the mean delay, since it prevents small jobs waiting behind large jobs.  A disadvantage of Dedicated policy is that it may lead to low utilization of resource (CPU time).  We consider a cycle stealing mechanism that increases utilization by allowing the ``large'' server to process small jobs when there are no large jobs.  This increases utilization, while still preventing small jobs waiting behind large jobs.  In particular, we analyze the performance benefit of cycle stealing. A paper on this research was presented at the ACM SPAA 2003 [SPAA03].

Cycle stealing for increasing utilization under restrictive task assignment policies in server farms with a central queue:
 We consider the above cycle stealing mechanism for increasing utilization also in server farms with a central queue.  A paper on this research was presented at the IEEE ICDCS 2003 [ICDCS03].

Threshold-based policies:

Threshold-based policies are a generalization of cycle stealing mechanisms.  While cycle stealing mechanisms allow a donor to help a beneficiary if and only if the donor queue is empty, threshold-based policies allow a donor not to help a beneficiary even if the donor queue is empty and they allow a donor to help a beneficiary even if the donor has its own jobs.

Threshold-based policy for reducing switching cost in NOW:  In networks of workstations, the donor workstation might not want to switch between its own jobs and the beneficiary jobs so frequently, since switching may require additional time such as context switching time and checkpointing time.  We consider a threshold-based policy that allows a donor server to help a beneficiary less frequently by setting a threshold TB on the beneficiary queue and a threshold TD on the donor queue.  Now, the donor switches to the beneficiary queue only when there are at least TB beneficiary jobs and there are no donor jobs, and the donor switches back to the donor queue only when there are at least TD donor jobs.  In particular, we analyze the performance benefit of the threshold-based policy and characterize the optimal threshold values.  A paper on this research is under submission [CS04].

Threshold-based policy for prioritizing small jobs in servers with affinities:
 Cycle stealing mechanisms for increasing utilization in server farms that we introduced above  allow a ``large'' server to process small jobs only when there are no large jobs.  However, allowing the ``large'' server to process small jobs even in the presence of large jobs can further decrease the mean delay, since processing small jobs before large jobs can decrease the mean delay.  However, biasing too much towards small jobs can starve large jobs, resulting in infinite mean delay of large jobs.  We consider a threshold-based policy that allows a ``large'' server to process small jobs if and only if there are at least TS small jobs in a more general model than a server farm, namely in servers with affinities.  In particular, we characterize the performance of the threshold-based policy as a function of threshold values.  A paper on this research is under submission [ADAPTIVE04].

Threshold-based policies adaptive to fluctuating load in servers with affinities:  
An advantage of threshold-based policy is that it allows us to optimize the overall performance by setting the appropriate threshold value for given environmental parameters such as loads and service demand distributions.  A disadvantage is that the optimal threshold value depends on the environmental parameters.  A threshold-based policy that is optimized at incorrect (mispredicted) settings can provide performance far from the optimal.  We propose new threshold-based policies that are adaptive to environmental changes such as changes in load, while providing near optimal performance at estimated settings.  We will be particularly interested in the case where jobs have affinity towards particular servers (i.e., they can be processed faster on some servers than others). In particular, we propose an adaptive threshold-based policy that is robust against misprediction and fluctuation of load and analyze its performance benefit.  Preliminary results are summarized in a paper under submission [ADAPTIVE04] and we are currently working on extending the results.


Evaluating the effectiveness of resource allocation mechanisms: 

New analysis methods

The difficulty in the performance analysis of distributed computing systems often comes from the large (infinite) state space required to capture the state of the system.  When there are n types of jobs, the state space often grows infinitely in n dimensions, and known approaches either are computationally expensive or have poor accuracy in solution.  We propose new analysis techniques that are widely applicable in the analysis of computer system performance.  We apply these techniques to analyze system performance under various resource allocation mechanisms.

The first technique is recursive dimensionality reduction (RDR).  RDR reduces an n-dimensionally infinite state space into a one dimensionally infinite state space, which we can analyze efficiently via a known algorithmic method.  RDR is used in the performance analysis of various resource allocation mechanisms mentioned above as well as in the performance analysis of priority scheduling in multiserver systems.  Two papers that introduce RDR and apply it to the analysis of priority scheduling in multiserver systems are in preparation [RDR03a, RDR03b].  In fact, [RDR03b] provides the first accurate analysis of priority scheduling in multiserver systems with an arbitrary number of priority classes and an arbitrary number of servers.

In applying RDR, it is crucial to model a general distribution (such as service demand, interarrival times, and durations of busy periods) by phase-type (PH) distributions, which are mixtures and/or convolutions of exponential distributions.  Our second technique allows us to find a minimal PH distribution whose first three moments matches those of a given distribution.  Two papers on this research
were presented at the Performance TOOLS 2003 [TOOLS03a, TOOLS03b].


Other applications


We also apply the method of phases, where general distributions are approximated by phase type distributions, to performance modeling transmission control protocols (TCPs).  We propose a unified framework to analyze the performance of various versions of TCPs.  In this framework, the sending rate of TCP sources is modeled by a Markov chain, and it was used in a framework of fixed point method, which allows us to predict both the throughput of the senders and the loss rate of the network.  Our model overcomes two major limitations of other TCP models.  First, our model is the first model of TCP-Vegas that accurately predicts the operating point of the network; i.e., it predicts both the throughput and the loss rate of TCP sources.  Second, our model is the first model of TCP-Vegas for arbitrary on-off traffic; all previous work on modeling TCP-Vegas is based on the analysis of TCP-Vegas for bulk transfers.  Two papers on this research were presented at the IEEE MASCOTS 2003 [MASCOTS03] and the MAMA 2003 [MAMA03].


Current work


In our prior work, we have considered various resource allocation mechanisms and have analyzed their performance benefit in reducing the impact of high load, job size variability, and switching costs on waiting time.  We are currently working on characterizing the performance benefit of various resource allocation mechanisms under load fluctuations caused by interarrival time variability and correlation.




An older summary of research interests and accomplishments (September 2003)



Takayuki Osogami
Department of Computer Science
Carnegie Mellon University