15-779: Advanced Topics in Machine Learning Systems (LLM Edition)

Course Information

Machine learning (ML) techniques, especially recent advances in large language models and generative AI, have surpassed human predictive performance in a variety of real-world tasks. This success is enabled by the recent development of ML systems (e.g., PyTorch) that provide high-level programming interfaces for people to easily prototype different ML models on modern hardware platforms. In this course, we will explore the design of modern ML systems by learning how an ML model written in high-level languages is decomposed into low-level kernels and executed across heterogeneous hardware accelerators (e.g., TPUs and GPUs) in a distributed fashion. Topics covered in this course include: programming models for expressing ML models, deep learning accelerators, ML compilation, programming techniques on modern GPUs (e.g., H100 and B200), distributed training techniques, auto-parallelization, computation graph optimizations, automated kernel generation, memory optimizations, etc. The main goal of this course is to provide a comprehensive view on how existing ML systems work. Throughout this course, we will also learn the design principles behind these systems and discuss the challenges and opportunities for building future ML systems for next-generation ML applications and hardware platforms.


  • Instructor Zhihao Jia
  • Office hours: upon request