Overview
As AI systems become more capable and widely deployed, ensuring their reliability, robustness, and alignment with human intent is critical. This advanced machine learning class is designed for students interested in both theoretical insights and practical implications, bridging research in machine learning, security, and AI alignment to address some of the most pressing challenges in modern AI development. Through a mix of foundational papers and recent advances, the class will investigate recurring themes across security, robustness, and alignment, drawing connections to classical machine learning principles and modern scaling trends. Discussions will emphasize not only what works but also why it works (or fails)—aiming to equip students with the conceptual tools to critically assess current methods and develop principled approaches for trustworthy AI.
Course content
Module 1 — Jailbreaking & Adversarial Attacks
Learning goals: Understand basic ideas behind adversarial inputs, jailbreaking, prompt injection; hands-on experience with attacking leading models
- Attacks: (prompt) optimization, white box vs black box, transfer, images vs text, multi-turn attacks, agent attacks
- Defenses: adversarial training, circuit breakers, constitutional classifiers, red-teaming in practice, impossibility results, sample complexity
Module 2 — Privacy & Memorization
Learning goals: Understand differential privacy and formalisms of memorization; measurement and mitigation at scale
- Privacy: Differential privacy: algorithmic foundations, utility-privacy tradeoff, applications to deep learning; non-parameteric approaches
- Memorization: Membership inference and data extraction, mechanics of memorization, unlearning
Module 3 — Reliability in the Wild
Learning goals: Understand why models struggle with shortcuts and under distribution shifts, recognize challenges in AI alignment, and critically examine benchmarks
- Distribution shifts: accuracy-on-the-line, connections to robustness of alignment (post-training), factuality, context reliance
- Spurious correlations: inductive bias, robust optimization, reward hacking, goal misgeneralization
- Guest lecture: ethics and society
Pre-requisites
This class assumes expertise in machine learning and deep learning: concepts such as generalization, regularization, basics of optimization, probability, linear algebra and experience with training deep networks. The class also assumes some familiarity with large language models (transformers, datasets, prompting). There are no official pre-requisites for this class, and we will cover some background material in the lectures, but this class requires students to be ready to undertake a significant course project.
Learning resources
There is no official textbook for this class. We will put up lecture slides with pointers to relevant papers and textbook material.
Assessment and course policies
- Modules 1 and 2 will have one homework each, and we will have one in-class assessment overall. Students will also work on a course project and present papers during the latter part of the course (overlapping with Module 3).
- Grading will be based on:
- Homework assignments (35%)
- In-class assessments (20%)
- Project presentation (10%)
- Project + project report (35%)
- Class participation (Extra credit 5%)
- Course projects should be carried out in groups of 2 or 3. More instructions will be provided soon.
- Use of generative AI is allowed. Students are encouraged to work through assignments independently and use generative AI primarily for coding assistance. All uses of generative AI must be disclosed in assignments and project reports.
Deadlines
- Project final report: Dec 12 2025