15-829: Performance Modeling Tools for Computer Systems Researchers (No Math!)

Meets: FRIDAYS 10:10 a.m. - 1:10 p.m., Room: GHC 4301

12 Units

CLASS STARTS SEPTEMBER 9, 2022

www.cs.cmu.edu/~harchol/Tools/class.html

INSTRUCTOR: Prof. Mor Harchol-Balter

TA: Shashank Obla (volunteer)

Click here: ANNOUNCEMENTS/HOMEWORKS

Office Hours:

Tuesday 5:00 p.m. - 6:00 p.m. in Hamerschlag Hall A208 with Shashank
Wednesday 12:00 noon - 1:00 p.m. in GHC 7207 with Mor
Thursday 1:00 p.m. - 2:00 p.m. in GHC 7207 with Mor

DESCRIPTION:

This class is aimed at computer systems PhD students who are already involved in doing systems research, where the goal is to improve the performance of the system. Improving performance could involve reducing response times, providing class-based response time differentiation, improving tail behavior, scheduling to favor certain jobs, reducing loss/drop rate, increasing throughput, increasing revenue, reducing power or other costs, load balancing, etc.
Improving systems performance involves queue management and resource allocation, both major topics in queueing theory. While queueing theory classes traditionally involve heavy mathematics, the goal of this class is to teach systems students the performance modeling/queueing theory in a super intuitive manner, without covering proofs , and without requiring a probability background . The focus of the class will be on learning how to translate computer systems performance problems into the appropriate queueing network framework. Each class is divided into two parts. The first half presents a lesson in queueing theory, modeling, simulation, or workload characterization. The second half is devoted to having a student in the class present their own computer systems performance research problem. Together, we will figure out in real time how to model this research problem as a queueing network and solve the problem. In between the two halves, we will share pizza!

Prerequisites:

No prerequisites, other than the fact that you should be a Phd student , and you should be already be working on computer systems research, where you're looking at improving performance of your system.

Textbook: Performance Modeling and Design of Computer Systems .

Class meets these Fridays:

Sept 9, 16, 23 (skipping Sept 30, but there's a make-up on Oct 28)
Oct 7, 14, (Oct 21 is Fall break), 28 (make-up for missing Sept 30)
Nov 4, 11, 18 (Nov 25 is Thanksgiving break)
Dec 2, 9

Tentative Syllabus of Queueing Topics: (in progress)

[Sept 9: Morning] Vocabulary: Speaking like a queueing theorist (Chpt 2)
- Single-sever system
- Translating systems-speak into queueing-speak
- Queueing network
- Response time
- Load (Utilization)
- Throughput
- Maximum allowable arrival rate
- Closed systems versus Open systems
[Sept 9: Afternoon] Distributions and how to generate these for simulation (Chpt 3 plus 4.1)
- Common discrete distributions
- Common continuous distributions
- Computing means and tails
- Generating distributions for simulation via Inverse Transform Method.
[Sept 16: Morning] Poisson arrival Process (Chpt 4.2 plus Chpt 11)
- Generating distributions for simulation via Accept-Reject Method
- Memorylessness
- Poisson Process definition
- Some Poisson Process properties
[Sept 16: Afternoon] Event-driven simulation and PASTA (Chpt 14 new PnC book)
- How to run an event-driven simulation of queueing networks.
- The right and wrong ways to measure performance in a simulation.
- The important role that PASTA plays in simulation.
[Sept 23: Morning + Afternoon] Variance and the M/G/1 queue (Chpt 23)
- What is variance and how to compute it
- Squared coefficient of variation
- M/G/1 formula for E[T]
- Inspection paradox
- What happens when load is low and variability is high
- Applications: double arrival rate and speed
- Applications: Statistical multiplexing vs. Freq-Division Multiplexing
- Applications: Comparison of 3 ways of sharing capacity.
- Applications: Load balancing.
[Sept 30: NO CLASS]
[Oct 7: Morning] Heavy-tailed workloads (Chpt 20,24)
- Distribution of compute usage in today's data centers
- Distribution of memory usage in today's data centers
- How these distributions have changed over time
- Heavy-tailed property
- Decreasing failure rate property
- Implications for load balancing
- Implications for scheduling
[Oct 7: Afternoon] Interactive student presentations
- Finish up lecture.
- Timothy Kim research presentation.
[Oct 14: Morning] Load Balancing Policies (Chpt 24.1, Starting Chpt 28,29)
- Load balancing policies -- not based on size
- Load balancing policies -- that use size
- Round-robin versus Random
- Dynamic versus Static
- M/G/k versus Least-Work-Left
- SITA -- Size Interval Task Assignment
- The benefits of *unbalancing* load
- Introduction to Scheduling
- Scheduling definitions: work-conserving, preemption
- Discussion of scheduling policies that do not use size
[Oct 14: Afternoon] Interactive student presentations
- Finish up lecture.
- Tianshu Huang research on Spatial Edge Computing.
[Oct 21: NO CLASS] Mid-Semester break
[Oct 28: Morning] Scheduling (Chpts 28-33)
- Non-preemptive, non-size-based scheduling policies: FCFS, LCFS, Random
- Preemptive, non-size-based scheduling policies: PS, P-LCFS
- Non-preemptive, size-based: priority queues, SJF
- Preemptive, size-based: priority queues, PSJF, SRPT
- Comparison of scheduling policies wrt response time.
- Scheduling when job sizes are unknown: FB, SERPT, Gittins
- Unfairness and starvation in scheduling
- Scheduling to optimize the tail of response time
- Scheduling when jobs have value and size
- Scheduling when jobs have deadlines
[Oct 28: Afternoon] Interactive student presentations
- Finish up lecture.
[Nov 4: Morning] Closed Systems and What-If analysis (Chpts 6,7)
- Little's Law
- Implications/examples of Little's Law in Open systems
- Closed system terminology
- Implications/examples of Little's Law for Closed systems
[Nov 4: Afternoon] One or two interactive student presentations
- Kaiyang Zhao -- Research on memory allocators in operating systems.
[Nov 11: Morning] Closed Systems and What-If analysis, cont. (Chpts 6, 7)
- More Operational Laws
- "What If" analysis for closed systems
- Differences between Open and Closed systems
[Nov 11: Afternoon] One or two interactive student presentations
- Naifeng Zhang -- Research on automatic high-performance code generation.
[Nov 18: Morning] Differences between Open and Closed Systems
- More practice with closed systems
- Doubling server speed: effect in open versus closed
- Load in closed systems
- Mean response time in open versus closed
- Effect of job size variability under open versus closed
- Effect of scheduling in open versus closed
[Nov 18: Afternoon] Two interactive student presentations
- Jekyeom Jeon -- Minimizing garbage collection costs by using ZNS SSDs in a distributed file system.
- Eric (Yuxuan) Zheng -- Research on distributed deep learning systems.
[Nov 25: Thanksgiving No Class]
[Dec 2: Morning] More Queueing Tools (Chpt 3, 11, 13)
- Minimum of independent random variables.
- Application: Redundancy analysis
- Maximum of independent random variables.
- Application: Fork-Join analysis
[Dec 2: Afternoon] Two interactive student presentations
- Ziyue Qiu talking on cache management for modern software.
- Shawn Chen -- high-performance software-defined networking control plane
[Dec 9: Morning] Ask me anything
- This class is your chance to ask anything else you want to know in the space of queueing theory. You can also revisit questions about your own research.
- Other topics that I will cover:
  - Evaluating response time tails and 99th percentiles. (Chpts 3, 5, 25, 31-33)
    - Markov's Inequality for Tails
    - Chebyshev's Inequality for Tails
    - Central Limit Theorem Approximation for Sums
    - Examples, including Mor's Capacity Provisioning Idea.
    - 99th Percentiles
  - Setup Times (Chpt 27)
    - Implications for power management
    - Implications for caching/memory management
[Dec 9: Afternoon] One interactive student presentation
- Nikhil Agarwal -- Research on CGRAs and dataflow architectures.

Some Application Areas we will cover:

Meeting QoS Service Level Objectives
Capacity provisioning, work-stealing
Load balancing algorithms
Dynamic power management
Network routing
Scheduling of parallelizable jobs with different speedup functions
Admission control for database systems
Caching to minimize response time
Managing Supercomputing Centers

GRADING:

Short weekly homeworks -- worth 40%. Homeworks will involve simulations and simple derivations, rather than proofs (learning by observing).
In-class presentation(s) -- worth 25%.
Attendance and Participation in presentations of others -- worth 15%.
Writeup (Due December 2) showing how you applied queueing theory to the performance of your system -- worth 20%.
Standard grading scale: 90%- 100% is A; 80% - 89% is B; 70%- 79% is C; and so on, typically with curve at end.
No quizzes or tests. Your goal in this class is to improve your own research and that of others in the class.