Timothy ZhuGraduate Student
Computer Science Department
Carnegie Mellon University
Office: Gates Hillman Center (GHC) 7010
Email: timothyz (at) cs (dot) cmu (dot) edu
Advisor: Mor Harchol-Balter
Research:I am interested in the performance analysis and design of computer systems. I enjoy building systems and finding practical ways of solving resource management and scheduling problems using mathematically sound techniques.
My main research focus is on how to meet tail latency Service Level Objectives (SLOs) in shared storage and networks. The problem of long tail latencies is pervasive in datacenter environments, and many companies and researchers are trying to better control latency. Congestion is one of the main sources of tail latency, and I believe that analysis techniques such as Stochastic Network Calculus (SNC) and Deterministic Network Calculus (DNC) are useful tools in determining how to control congestion. Our IOFlow paper (SOSP 2013) introduces a QoS architecture for providing rate control and prioritization of storage and network I/O. Our PriorityMeister (SoCC 2014) paper addresses how to automatically configure priorities and rate limits to meet tail latency SLOs using DNC. I'm continuing this line of work as my thesis, and I look forward to exploring other methods of more efficiently using resources to meet performance goals such as tail latency SLOs.
I have also worked on cluster scheduling problems and am interested in ways of better scheduling jobs to take advantage of specialized resources. With heterogeneous resources comes new questions in scheduling. For example, is it beneficial to statically partition specialized resources, or to dynamically schedule a large pool of heterogeneous resources? When dynamically scheduling a large pool of heterogeneous resources, should the scheduler wait for specialized resources to become available in the future or use slower alternative resources that are immediately available? We have started to investigate some of these questions in our TetriSched work, but there is still more research to be done in this area.
I believe that underlying all of these resource management problems is a need for automated configuration of resources. It is too difficult and cumbersome for operators to constantly optimize system parameters. Furthermore, it is possible to embed the knowledge and expertise from better performance analysis techniques into systems that automatically tune parameters. In some of my earlier work, I looked at tuning the number of VMs. In our HotCloud 2012 work, I investigated techniques for elastically scaling memcached resources to reduce costs of cloud web services. I also investigated auto-scaling resources to meet deadlines for multi-phase batch jobs during an internship at Google. More recently, I'm looking at tuning QoS parameters to meet tail latency SLOs. I am excited about research problems in automatically managing resources to meet performance goals, and I hope to continue working on these types of problems.