15-783 · Trustworthy AI: Theory & Practice (Fall 2025)

Overview

As AI systems become more capable and widely deployed, ensuring their reliability, robustness, and alignment with human intent is critical. This advanced machine learning class is designed for students interested in both theoretical insights and practical implications, bridging research in machine learning, security, and AI alignment to address some of the most pressing challenges in modern AI development. Through a mix of foundational papers and recent advances, the class will investigate recurring themes across security, robustness, and alignment, drawing connections to classical machine learning principles and modern scaling trends. Discussions will emphasize not only what works but also why it works (or fails)—aiming to equip students with the conceptual tools to critically assess current methods and develop principled approaches for trustworthy AI.

Course content

Module 1 — Jailbreaking & Adversarial Attacks

Learning goals: Understand basic ideas behind adversarial inputs, jailbreaking, prompt injection; hands-on experience with attacking leading models

Attacks: (prompt) optimization, white box vs black box, transfer, images vs text, multi-turn attacks, agent attacks
Defenses: adversarial training, circuit breakers, constitutional classifiers, red-teaming in practice, impossibility results, sample complexity

Module 2 — Privacy & Memorization

Learning goals: Understand differential privacy and formalisms of memorization; measurement and mitigation at scale

Privacy: Differential privacy: algorithmic foundations, utility-privacy tradeoff, applications to deep learning; non-parameteric approaches
Memorization: Membership inference and data extraction, mechanics of memorization, unlearning

Module 3 — Reliability in the Wild

Learning goals: Understand why models struggle with shortcuts and under distribution shifts, recognize challenges in AI alignment, and critically examine benchmarks

Distribution shifts: accuracy-on-the-line, connections to robustness of alignment (post-training), factuality, context reliance
Spurious correlations: inductive bias, robust optimization, reward hacking, goal misgeneralization
Guest lecture: ethics and society

Pre-requisites

This class assumes expertise in machine learning and deep learning: concepts such as generalization, regularization, basics of optimization, probability, linear algebra and experience with training deep networks. The class also assumes some familiarity with large language models (transformers, datasets, prompting). There are no official pre-requisites for this class, and we will cover some background material in the lectures, but this class requires students to be ready to undertake a significant course project.

Learning resources

There is no official textbook for this class. We will put up lecture slides with pointers to relevant papers and textbook material.

Assessment and course policies

Modules 1 and 2 will have one homework each, and we will have one in-class assessment overall. Students will also work on a course project and present papers during the latter part of the course (overlapping with Module 3).
Grading will be based on:
- Homework assignments (35%)
- In-class assessments (20%)
- Project presentation (10%)
- Project + project report (35%)
- Class participation (Extra credit 5%)
Course projects should be carried out in groups of 2 or 3. More instructions will be provided soon.
Use of generative AI is allowed. Students are encouraged to work through assignments independently and use generative AI primarily for coding assistance. All uses of generative AI must be disclosed in assignments and project reports.

Deadlines

Project final report: Dec 12 2025

15-783: Trustworthy AI: Theory & Practice (Fall 2025)