I am interested in designing and implementing computer systems that use novel resource management and scheduling techniques to meet performance goals.
I believe that efficiently utilizing resources will require building automated performance management tools based on mathematically sound performance analysis models.
Below are examples of automated performance management systems I have built.
I believe that underlying all of these resource management problems is a need for automated configuration of resources.
It is too difficult and cumbersome for IT operators to constantly optimize system parameters.
Furthermore, it is possible to embed the knowledge and expertise from better performance analysis techniques into systems that automatically tune parameters.
I am excited about research problems in automatically managing resources to meet performance goals, and I hope to continue working on these types of problems.
Quality of Service (QoS) support for tail latency SLOs [details]
My main research focus is on how to meet tail latency Service Level Objectives (SLOs) in shared storage and networks.
The problem of long tail latencies is pervasive in datacenter environments, and many companies and researchers are trying to better control latency.
Congestion is one of the main sources of tail latency in shared environments.
Our IOFlow paper (SOSP 2013) introduces a QoS architecture for controlling congestion via rate limiting and prioritization of storage and network I/O.
Our PriorityMeister (SoCC 2014) paper addresses how to automatically configure priorities and rate limits to meet tail latency SLOs using a Deterministic Network Calculus (DNC) analysis.
Our SNC-Meister (SoCC 2016) paper shows significant improvements in admission control when using a probabilistic analysis called Stochastic Network Calculus (SNC) instead of DNC, which is a worst-case analysis.
We are the first to build a computer system based on SNC, and our code is publicly available at: https://github.com/timmyzhu/SNC-Meister.
Cluster scheduling on heterogeneous resources [details]
I have also worked on cluster scheduling problems and am interested in ways of better scheduling jobs to take advantage of specialized resources.
With heterogeneous resources comes new questions in scheduling.
For example, is it beneficial to statically partition specialized resources, or to dynamically schedule a large pool of heterogeneous resources?
When dynamically scheduling a large pool of heterogeneous resources, should the scheduler wait for specialized resources to become available in the future or use slower alternative resources that are immediately available?
Our TetriSched (EuroSys 2016) paper introduces a new cluster scheduler that optimizes when and where to run jobs so as to improve performance in heterogeneous clusters.
Autoscaling is a useful technique for adapting resource utilization to load.
In my CacheScale (HotCloud 2012) work, I investigate techniques for elastically scaling memcached resources to reduce costs of cloud web services.
As an alternative to autoscaling memcached servers, our SOFTScale (Middleware 2012) work performs cycle-stealing on memcached servers to help deal with bursts of work during periods of low load.
I have also investigated autoscaling resources to meet deadlines for multi-phase batch jobs during an internship at Google.
SNC-Meister: Admitting More Tenants with Tail Latency SLOs
Timothy Zhu, Daniel S. Berger, Mor Harchol-Balter
SoCC 2016 [To appear]
TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, Gregory R. Ganger
Best student paper award at EuroSys 2016 [pdf]
PriorityMeister: Tail Latency QoS for Shared Networked Storage
Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Mor Harchol-Balter, Gregory R. Ganger
SoCC 2014 [pdf]
TetriSched: Space-Time Scheduling for Heterogeneous Datacenters
Alexey Tumanov, Timothy Zhu, Michael A. Kozuch, Mor Harchol-Balter, Gregory R. Ganger
CMU PDL Technical Report CMU-PDL-13-112, Dec 2013 [pdf]
IOFlow: A Software-Defined Storage Architecture
Eno Thereska, Hitesh Ballani, Greg O'Shea, Thomas Karagiannis,
Antony Rowstron, Tom Talpey, Richard Black, Timothy Zhu
SOSP 2013 [pdf]
SOFTScale: Stealing Opportunistically For Transient Scaling
Anshul Gandhi, Timothy Zhu, Mor Harchol-Balter and Michael A. Kozuch
CMU Technical Report CMU-CS-12-111 [pdf] (extended version)
Saving Cash by Using Less Cache
Timothy Zhu, Anshul Gandhi, Mor Harchol-Balter and Michael A. Kozuch
HotCloud 2012 [pdf]