|
Mor Harchol-Balter Professor of Computer Science
-
Department of Computer Science
- Carnegie Mellon University
- Pittsburgh, PA 15213
- Office: Gates Hall 7207
- Phone: (412) 268-7893
- Fax: (412) 268-5576
- harchol@cs.cmu.edu
- Exec. Assistant: Nancy Conway (GHC 9229)
-
Finally Here!
-
|
|
Introduction for students:
I am interested in the performance analysis and design of computer
systems, particularly distributed systems. I use analytical models to capture the important characteristics of a
computer system, and then I prove theorems about these models that
allow me to redesign the system to improve its
performance. Here "performance" might denote response time, energy use, throughput, capacity, etc. Most of my research involves inventing new analytical techniques in the area of performance analysis, as well as new algorithms for resource allocation.
Unlike many theoretical computer scientists, my analysis is based on stochastic (probabilistic) models of computer systems. There is no "adversary" sending us worst-case inputs. By contrast, there is a stream of requests, whose arrival times and service demands come from empirically fitted distributions. These distributions might be correlated, and they often exhibit heavy tailed service demands and high variability.
I believe that many conventional wisdoms on which we base computer system
designs are not well understood and sometimes false, leading to
inferior designs. My research revisits very classic questions in system design.
Here are a few examples of commonly-held beliefs that my research challenges:
- Thousands of "load balancing" heuristics do exactly that -- they
aim to balance the load among the existing hosts. But is load balancing
necessarily a good thing?
- Ever notice that the optimal scheduling policy,
Shortest-Remaining-Processing-Time-First (SRPT), is rarely used in
practice? There's a fear that the big jobs will "starve", or be
treated unfairly as compared with Processor-Sharing (PS). But is
this a reality?
- To minimize mean waiting time in a server farm, research suggests that each job should be sent to the server at which it will experience the least wait. That seems good from the job's perspective, but is the greedy strategy best for the system as a whole?
- Given a choice between a single machine with speed s, or n
identical machines each with speed s/n, which would you choose?
Think again ...
- Migrating active jobs is generally considered too expensive.
Killing jobs midway through execution and restarting them from scratch
later is even worse! Says who?
- Cycle stealing (using another machine's idle cycles to do your work) is
a central theme in distributed systems and has been the topic of thousands
of papers. But can we quantify when cycle stealing pays off, as a function of switching costs and threshold parameters?
- Redundancy is the idea of making multiple copies of the same job, where each copy is dispatched to a different queue, and where one only needs to wait for the first copy to complete. When does redundancy help?
Research Specifics:
I work on the design and performance analysis of computer systems including both theory
and implementation:
Theoretical work on Algorithms and Analysis:
Queueing behavior and stochastic analysis of resource allocation in distributed computer
systems. Load sharing policies, routing policies, cycle stealing, replication algorithms,
power-efficient policies, threshold policies, and
multi-class/multi-server prioritization. Fairness metrics in
scheduling. Correlated arrival processes, and analysis under
high-variability workloads. Known for ``All-Can-Win Theorem'', demonstrating that scheduling policies which are biased towards favoring small jobs can also be preferable to large jobs.
Stochastic Analysis & Queueing Techniques: :
Developing new methods for stochastic
analysis. Examples include: (i) Recursive Dimensionality Reduction (RDR), a technique that
allows one to reduce a Markov chain that grows unboundedly in many dimensions to a Markov chain that grows unboundedly in only one dimension, by using the idea of busy period
transitions; (ii) Recursive Renewal Reward (RRR) and Clearing Analysis on Phases (CAP) , techniques that allow one to
obtain closed-form solutions for many one-dimensional infinite repeating Markov chains, including the M/M/k with setup chain; (iii) Exact analysis of Replication Systems: the first exact solution (in product-form) for replication systems, involving any number of servers, and number of classes, and any degree of replication.
Computer systems implementation:
Resource management in data
centers and other distributed systems. Implementation of QoS tail
latency guarantees for flows in networked storage systems, based on
techniques from Deterministic Network Calculus and Stochastic Network
Calculus (PriorityMeister and SNC-Meister). Online scheduling of jobs
with heterogeneous preferences in data centers with heterogeneous
servers (TetriSched). Dynamic power management in multi-tier data centers
(AutoScale). Scheduling of Memory Controllers (ATLAS).
Pricing and queueing optimizations for cloud services. Kernel-level
connection scheduling in Web servers. Solutions for coping with
transient overload in Web servers. QoS for e-commerce applications
involving the backend database. Prioritization mechanisms for OLTP
transactions in database servers. Job scheduling in
supercomputing centers.
Modeling and Workload characterization:
Known for the discovery of Pareto heavy-tailed distribution of UNIX process CPU lifetimes. Statistical characterization of workloads including UNIX processes, web, OLTP, supercomputing, memory workloads, and parallel jobs.
My TENURE RESEARCH STATEMENT from May 2007:
SOME TALKS
- (2015) ISCA Tutorial & ICDCS Keynote:
What Queueing Theory Teaches Us about Computer Systems Design:
talk slides and
abstract .
- (2014) GreenMetrics Keynote:
Power Management in Data Centers: Theory & Practice:
talk slides and
abstract .
- (2009) Surprising Results on Task Assignment for High-Variability Workloads:
talk slides and abstract .
- (2007) ORSIS '07 Keynote, Lunteren '07 Keynote, MASCOTS '08 Keynote, SIPEW '08 Keynote:
Scheduling in Server Farms:
talk slides and abstract .
- (2006) Analysis of Cycle Stealing and Priority Queueing in Multi-Server Systems via Dimensionality Reduction Technique:
talk slides and abstract .
- (2005) Scheduling Your Network Connections: the SYNC project
talk slides and abstract .
- (2006) Open Workload Generators versus Closed: A cautionary tale
Adam's talk slides and abstract .
- (2004) What Analytical Performance Modeling Teaches Us About Computer System Design
talk slides and abstract .
Current PhD STUDENTS
Graduated PhD STUDENTS
PROFESSIONAL SERVICE
KEYNOTE PRESENTATIONS
CanQueue 2016, ACM SIGMETRICS 2016, ICDCS 2015, GreenMetrics 2014, MASCOTS 2008, SIPEW 2008, Lunteren Conference 2007, ORSIS 2007.
ANNUAL TALK on Applying to Ph.D. Programs in Computer Science (pdf) .
TEACHING
- (2014S) 21-127 Concepts of Mathematics (undergraduate)
- (1999,2000,2001,2002,2003,2005,2006,2009,2011,2012,2014,2015,2017 Next in 2019F )
15-857 Analytical Performance Modeling (graduate)
- (2004,2006,2007,2008,2012,2013S&F,2015,2016,2018):
15-359 Probability and Computing (undergraduate)
- NEW FOR SPRING 2019: 15-359 is being renamed as 15-259:
15-259 Probability and Computing (undergraduate)
- (Spring 2007):
15-858A Advanced Stochastic Processes (graduate)
- (Spring 2000, Spring 2001, Fall 2002): 15-441 Computer Networks (undergraduate)
- (2000 through present): SQUALL (Scheduling and QUeueing Around Lunchtime)
|