15-783: Trustworthy AI: Theory & Practice (Fall 2025)

Instructor: Aditi Raghunathan (raditi@cmu.edu) · TA: Andy Zou (jzou4@cs.cmu.edu) · Tue & Thu 3:30–5:00pm · GHC 4307

Overview

As AI systems become more capable and widely deployed, ensuring their reliability, robustness, and alignment with human intent is critical. This advanced machine learning class is designed for students interested in both theoretical insights and practical implications, bridging research in machine learning, security, and AI alignment to address some of the most pressing challenges in modern AI development. Through a mix of foundational papers and recent advances, the class will investigate recurring themes across security, robustness, and alignment, drawing connections to classical machine learning principles and modern scaling trends. Discussions will emphasize not only what works but also why it works (or fails)—aiming to equip students with the conceptual tools to critically assess current methods and develop principled approaches for trustworthy AI.

Course content

Module 1 — Jailbreaking & Adversarial Attacks

Learning goals: Understand basic ideas behind adversarial inputs, jailbreaking, prompt injection; hands-on experience with attacking leading models

  • Attacks: (prompt) optimization, white box vs black box, transfer, images vs text, multi-turn attacks, agent attacks
  • Defenses: adversarial training, circuit breakers, constitutional classifiers, red-teaming in practice, impossibility results, sample complexity

Module 2 — Privacy & Memorization

Learning goals: Understand differential privacy and formalisms of memorization; measurement and mitigation at scale

  • Privacy: Differential privacy: algorithmic foundations, utility-privacy tradeoff, applications to deep learning; non-parameteric approaches
  • Memorization: Membership inference and data extraction, mechanics of memorization, unlearning

Module 3 — Reliability in the Wild

Learning goals: Understand why models struggle with shortcuts and under distribution shifts, recognize challenges in AI alignment, and critically examine benchmarks

  • Distribution shifts: accuracy-on-the-line, connections to robustness of alignment (post-training), factuality, context reliance
  • Spurious correlations: inductive bias, robust optimization, reward hacking, goal misgeneralization
  • Guest lecture: ethics and society

Pre-requisites

This class assumes expertise in machine learning and deep learning: concepts such as generalization, regularization, basics of optimization, probability, linear algebra and experience with training deep networks. The class also assumes some familiarity with large language models (transformers, datasets, prompting). There are no official pre-requisites for this class, and we will cover some background material in the lectures, but this class requires students to be ready to undertake a significant course project.

Learning resources

There is no official textbook for this class. We will put up lecture slides with pointers to relevant papers and textbook material.

Assessment and course policies

Deadlines

Schedule