I am interested in designing and implementing computer systems that use novel resource management and scheduling techniques to meet performance goals.
I believe that efficiently utilizing resources will require building automated performance management tools based on mathematically sound performance analysis models.
Below are examples of automated performance management systems I have built.
I believe that underlying all of these resource management problems is a need for automated configuration of resources.
It is too difficult and cumbersome for IT operators to constantly optimize system parameters.
Furthermore, it is possible to embed the knowledge and expertise from better performance analysis techniques into systems that automatically tune parameters.
I am excited about research problems in automatically managing resources to meet performance goals, and I hope to continue working on these types of problems.
Quality of Service (QoS) support for tail latency SLOs [details]
My main research focus is on how to meet tail latency Service Level Objectives (SLOs) in shared storage and networks.
The problem of long tail latencies is pervasive in datacenter environments, and many companies and researchers are trying to better control latency.
Congestion is one of the main sources of tail latency in shared environments.
Our IOFlow paper (SOSP 2013) introduces a QoS architecture for controlling congestion via rate limiting and prioritization of storage and network I/O.
Our PriorityMeister (SoCC 2014) paper addresses how to automatically configure priorities and rate limits to meet tail latency SLOs using a Deterministic Network Calculus (DNC) analysis.
Our SNC-Meister (SoCC 2016) paper shows significant improvements in admission control when using a probabilistic analysis called Stochastic Network Calculus (SNC) instead of DNC, which is a worst-case analysis.
We are the first to build a computer system based on SNC, and our code is publicly available at: https://github.com/timmyzhu/SNC-Meister.
Cluster scheduling on heterogeneous resources [details]
I have also worked on cluster scheduling problems and am interested in ways of better scheduling jobs to take advantage of specialized resources.
With heterogeneous resources comes new questions in scheduling.
For example, is it better to statically partition specialized resources, or to dynamically schedule across a heterogeneous mixture of resources?
When dynamically scheduling heterogeneous resources, should the scheduler wait for specialized resources to become available in the future or use slower alternative resources that are immediately available?
Our TetriSched (EuroSys 2016) paper introduces a new cluster scheduler that optimizes when and where to run jobs so as to improve performance in heterogeneous clusters.
Autoscaling is a useful technique for adapting resource utilization to load.
In my CacheScale (HotCloud 2012) work, I investigate techniques for elastically scaling memcached resources to reduce costs of cloud web services.
As an alternative to autoscaling memcached servers, our SOFTScale (Middleware 2012) work performs cycle-stealing on memcached servers to help deal with bursts of work during periods of low load.
I have also investigated autoscaling resources to meet deadlines for multi-phase batch jobs during an internship at Google.