I'm Klas and I'm a PhD student in the Accountable Systems Lab at Carnegie Mellon University, advised by Matt Fredrikson. My research concentrates on demystifying deep learning, and understanding its weaknesses and vulnerabilities. I work to improve the security, transparency, and generality of deep neural networks, with a focus on applications in data privacy and computer vision.

GHC 7004 | kleino cs. cmu. edu | klasleino

Research Interests

My research concentrates on demystifying deep learning, and understanding its weaknesses and vulnerabilities. I work to improve the security, transparency, and generality of deep neural networks, with a focus on applications in data privacy and computer vision. My work fits primarily under the sub-fields of explainable AI and ML security. Explainable AI aims to bring interpretability and transparency to otherwise opaque deep learning methods, giving us a richer understanding of their inner workings. ML security addresses concerns including attacks that compromise data privacy and that fool even state-of-the-art models. Currently, I am most interested in topics with three major themes; namely, explaining black-box neural network behavior, creating a theory of network generalization, and developing robust and private models.

Explaining Black-box Neural Network Behavior

In the recent years, deep neural networks have become increasingly powerful at tasks previously only humans had mastered. Deep learning has become widely used, and while it has many practitioners, its inner workings are far from well-understood. As the application of ML has increased, so has the need for algorithmic transparency, the ability to understand why algorithms deployed in the real world make the decisions they do. Much of my work has addressed the problem of determining which aspects of a network influence particular decisions, in addition to interpreting the identified influential components. Influence can be used to increase model trust, to uncover insights discovered by ML models, and as a building block for debugging arbitrary network behavior.

A Theory of Network Generalization

Despite having the capacity to significantly overfit, or moreover, memorize the training data, deep neural networks demonstrate an ability to generalize reasonably well in practice. Present hypotheses have failed to explain why this is the case. In fact, it is not well understood how exactly overfitting is manifested in a model. One aspect of my work tries to understand what phenomena give rise to misclassifications, overfitting, and bias in DNNs. Understanding the causes for these problems will also shed light on what leads models to generalize; and may suggest ways of improving generalization. Furthermore, as overfitting presents a threat to the security of a model, understanding overfitting more fundamentally may help protect the privacy of the data involved in training a model, and improve the model's robustness to adversarial manipulation. I develop explanations for these problems that have direct applications to membership inference, misclassification prediction, and bias amplification.

Robust and Private Models

Deep neural networks have seen great success in many domains, with the ability to master complex tasks such as image recognition, text translation, and medical diagnosis. Despite their remarkable abilities, neural networks have several peculiar weaknesses. In particular, there are concerns around the lack of robustness of deep networks to malicious perturbations to their inputs, and around deep networks' tendancy to leak private information about their training data. My research sheds light on privacy weaknesses in deep networks, paving the way for the development of training routines that ensure privacy without sacrificing the utility of the resuling model. I am also involved in work towards building models that are not fooled by malicious input perturbations.


Leveraging Model Memorization for Calibrated White-Box Membership Inference

Membership inference (MI) attacks exploit a learned model's lack of generalization to infer whether a given sample was in the model's training set. Known MI attacks generally work by casting the attacker's goal as a supervised learning problem, training an attack model from predictions generated by the target model, or by others like it. However, we find that these attacks do not often provide a meaningful basis for confidently inferring training set membership, as the attack models are not well-calibrated. Moreover, these attacks do not significantly outperform a trivial attack that predicts that a point is a member if and only if the model correctly predicts its label. In this work we present well-calibrated MI attacks that allow the attacker to accurately control the minimum confidence with which positive membership inferences are made. Our attacks take advantage of white-box information about the target model and leverage new insights about how overfitting occurs in deep neural networks; namely, we show how a model's idiosyncratic use of features can provide evidence for membership. Experiments on seven real-world datasets show that our attacks support calibration for high-confidence inferences, while outperforming previous MI attacks in terms of accuracy. Finally, we show that our attacks achieve non-trivial advantage on some models with low generalization error, including those trained with small-epsilon-differential privacy; for large-epsilon (epsilon=16, as reported in some industrial settings), the attack performs comparably to unprotected models.

  title={Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference},
  author={Klas Leino and Matt Fredrikson},

Feature-wise Bias Amplification [ICLR 2019]

We study the phenomenon of bias amplification in classifiers, wherein a machine learning model learns to predict classes with a greater disparity than the underlying ground truth. We demonstrate that bias amplification can arise via an inductive bias in gradient descent methods that results in the overestimation of the importance of moderately-predictive "weak" features if insufficient training data is available. This overestimation gives rise to feature-wise bias amplification — a previously unreported form of bias that can be traced back to the features of a trained model. Through analysis and experiments, we show that while some bias cannot be mitigated without sacrificing accuracy, feature-wise bias amplification can be mitigated through targeted feature selection. We present two new feature selection algorithms for mitigating bias amplification in linear models, and show how they can be adapted to convolutional neural networks efficiently. Our experiments on synthetic and real data demonstrate that these algorithms consistently lead to reduced bias without harming accuracy, in some cases eliminating predictive bias altogether while providing modest gains in accuracy.

  title={Feature-Wise Bias Amplification},
  author={Klas Leino and Emily Black and Matt Fredrikson and Shayak Sen and Anupam Datta},
  booktitle={International Conference on Learning Representations},

Influence-directed Explanations for Convolutional Neural Networks [ITC 2018]

We study the problem of explaining a rich class of behavioral properties of deep neural networks. Distinctively, our influence-directed explanations approach this problem by peering inside the network to identify neurons with high influence on a quantity and distribution of interest, using an axiomatically-justified influence measure, and then providing an interpretation for the concepts these neurons represent. We evaluate our approach by demonstrating a number of its unique capabilities on convolutional neural networks trained on ImageNet. Our evaluation demonstrates that influence-directed explanations (1) identify influential concepts that generalize across instances, (2) can be used to extract the "essence" of what the network learned about a class, and (3) isolate individual features the network uses to make decisions and distinguish related classes.

  author={Klas Leino and Shayak Sen and Anupam Datta and Matt Fredrikson and Linyi Li},
  booktitle={2018 IEEE International Test Conference (ITC)},
  title={Influence-Directed Explanations for Deep Convolutional Networks},


18-739: Security and Fairness of Deep Learning (Spring 2019)

This course will provide an introduction to deep learning methods with emphasis on understanding and improving their security, privacy, and fairness properties. The course will cover basics of machine learning and introduce popular deep learning methods. It will delve into applications of deep learning methods in security, their susceptibility to adversarial manipulation, and techniques for making deep learning robust to adversarial manipulation. It will cover state-of-the-art methods for explaining black-box deep learning models to enhance their transparency. It will also examine methods for deep learning that are designed to respect individual privacy and fairness. Students will do homework assignments and critique weekly readings. Prior knowledge of machine learning, deep learning, and security concepts are useful but not required.


15-781: Graduate Artificial Intelligence (Fall 2016)

15-122: Principles of Imperative Computation (Fall 2013)

15-122: Principles of Imperative Computation (Fall 2012)

  • Carnegie Mellon University

  • School of Computer Science

  • Computer Science Department

  • Accountable Systems Lab