I'm Klas and I'm a PhD student in the Accountable Systems Lab at Carnegie Mellon University, advised by Matt Fredrikson. My research concentrates on demystifying deep learning, and understanding its weaknesses and vulnerabilities. I work to improve the security, transparency, and generality of deep neural networks.

GHC 7004 | kleino cs. cmu. edu | klasleino

Research Interests

My research concentrates on demystifying deep learning, and understanding its weaknesses and vulnerabilities. I work to improve the security, transparency, and generality of deep neural networks. My work fits primarily under the sub-fields of explainable AI and ML security. Explainable AI aims to bring interpretability and transparency to otherwise opaque deep learning methods, giving us a richer understanding of their inner workings. ML security addresses concerns including attacks that compromise data privacy and that fool even state-of-the-art models. Currently, I am most interested in topics with three major themes; namely, explaining black-box neural network behavior, creating a theory of network generalization, and developing robust and private models.

Explaining Black-box Neural Network Behavior

In the recent years, deep neural networks have become increasingly powerful at tasks previously only humans had mastered. Deep learning has become widely used, and while it has many practitioners, its inner workings are far from well-understood. As the application of ML has increased, so has the need for algorithmic transparency, the ability to understand why algorithms deployed in the real world make the decisions they do. Much of my work has addressed the problem of determining which aspects of a network influence particular decisions, in addition to interpreting the identified influential components. Influence can be used to increase model trust, to uncover insights discovered by ML models, and as a building block for debugging arbitrary network behavior.

A Theory of Network Generalization

Despite having the capacity to significantly overfit, or moreover, memorize the training data, deep neural networks demonstrate an ability to generalize reasonably well in practice. Present hypotheses have failed to explain why this is the case. In fact, it is not well understood how exactly overfitting is manifested in a model. One aspect of my work tries to understand what phenomena give rise to misclassifications, overfitting, and bias in DNNs. Understanding the causes for these problems will also shed light on what leads models to generalize; and may suggest ways of improving generalization. Furthermore, as overfitting presents a threat to the security of a model, understanding overfitting more fundamentally may help protect the privacy of the data involved in training a model, and improve the model's robustness to adversarial manipulation. I develop explanations for these problems that have direct applications to membership inference, misclassification prediction, and bias amplification.

Robust and Private Models

Deep neural networks have seen great success in many domains, with the ability to master complex tasks such as image recognition, text translation, and medical diagnosis. Despite their remarkable abilities, neural networks have several peculiar weaknesses. In particular, there are concerns around the lack of robustness of deep networks to malicious perturbations to their inputs, and around deep networks' tendency to leak private information about their training data. To this end, I am involved in work towards building models that are provably not fooled by malicious input perturbations. Some of my research also sheds light on privacy weaknesses in deep networks, paving the way for the development of training routines that ensure privacy without sacrificing the utility of the resulting model.


Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair

Klas Leino, Aymeric Fromherz, Ravi Mangal, Matt Fredrikson, Bryan Parno, Corina Păsăreanu

Neural networks are increasingly being deployed in contexts where safety is a critical concern. In this work, we propose a way to construct neural network classifiers that dynamically repair violations of non-relational safety constraints called safe ordering properties. Safe ordering properties relate requirements on the ordering of a network's output indices to conditions on their input, and are sufficient to express most useful notions of non-relational safety for classifiers. Our approach is based on a novel self-repairing layer, which provably yields safe outputs regardless of the characteristics of its input. We compose this layer with an existing network to construct a self-repairing network (SR-Net), and show that in addition to providing safe outputs, the SR-Net is guaranteed to preserve the accuracy of the original network. Notably, our approach is independent of the size and architecture of the network being repaired, depending only on the specified property and the dimension of the network's output; thus it is scalable to large state-of-the-art networks. We show that our approach can be implemented using vectorized computations that execute efficiently on a GPU, introducing run-time overhead of less than one millisecond on current hardware—even on large, widely-used networks containing hundreds of thousands of neurons and millions of parameters.

  title = {Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair},
  author = {Klas Leino and Aymeric Fromherz and Ravi Mangal and Matt Fredrikson and Bryan Parno and Corina Păsăreanu},
  eprint = {2107.11445},
  archivePrefix = {arXiv},
  year = {2021}

Relaxing Local Robustness [NIPS 2021]

Klas Leino, Matt Fredrikson

Certifiable local robustness, which rigorously precludes small-norm adversarial examples, has received significant attention as a means of addressing security concerns in deep learning. However, for some classification problems, local robustness is not a natural objective, even in the presence of adversaries; for example, if an image contains two classes of subjects, the correct label for the image may be considered arbitrary between the two, and thus enforcing strict separation between them is unnecessary. In this work, we introduce two relaxed safety properties for classifiers that address this observation: (1) relaxed top-k robustness, which serves as the analogue of top-k accuracy; and (2) affinity robustness, which specifies which sets of labels must be separated by a robustness margin, and which can be ε-close in ℓp space. We show how to construct models that can be efficiently certified against each relaxed robustness property, and trained with very little overhead relative to standard gradient descent. Finally, we demonstrate experimentally that these relaxed variants of robustness are well-suited to several significant classification problems, leading to lower rejection rates and higher certified accuracies than can be obtained when certifying "standard" local robustness.

  title = {Relaxing Local Robustness},
  author = {Klas Leino and Matt Fredrikson},
  booktitle = {Neural Information Processing Systems (NIPS)},
  year = {2021}

Globally-Robust Neural Networks [ICML 2021]

Klas Leino, Zifan Wang, Matt Fredrikson

The threat of adversarial examples has motivated work on training certifiably robust neural networks, to facilitate efficient verification of local robustness at inference time. We formalize a notion of global robustness, which captures the operational properties of on-line local robustness certification while yielding a natural learning objective for robust training. We show that widely-used architectures can be easily adapted to this objective by incorporating efficient global Lipschitz bounds into the network, yielding certifiably-robust models by construction that achieve state-of-the-art verifiable accuracy. Notably, this approach requires significantly less time and memory than recent certifiable training methods, and leads to negligible costs when certifying points on-line; for example, our evaluation shows that it is possible to train a large tiny-imagenet model in a matter of hours. We posit that this is possible using inexpensive global bounds—despite prior suggestions that tighter local bounds are needed for good performance—because these models are trained to achieve tighter global bounds. Namely, we prove that the maximum achievable verifiable accuracy for a given dataset is not improved by using a local bound.

  title = {Globally-Robust Neural Networks},
  author = {Klas Leino and Zifan Wang and Matt Fredrikson},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = {2021}

Fast Geometric Projections for Local Robustness Certification [ICLR 2021 - Spotlight]

Klas Leino*, Aymeric Fromherz*, Matt Fredrikson, Bryan Parno, Corina Păsăreanu

Local robustness ensures that a model classifies all inputs within an ε-ball consistently, which precludes various forms of adversarial inputs. In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise linear activation functions. The key insight is that such networks partition the input space into a polyhedral complex such that the network is linear inside each polyhedral region; hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing robustness. Crucially, we show how these regions can be analyzed using geometric projections instead of expensive constraint solving, thus admitting an efficient, highly-parallel GPU implementation at the price of incompleteness, which can be addressed by falling back on prior approaches. Empirically, we find that incompleteness is not often an issue, and that our method performs one to two orders of magnitude faster than existing robustness-certification techniques based on constraint solving.

  title = {Fast Geometric Projections for Local Robustness Certification},
  author = {Aymeric Fromherz and Klas Leino and Matt Fredrikson and Bryan Parno and Corina Păsăreanu},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2021},

Leveraging Model Memorization for Calibrated White-Box Membership Inference [USENIX 2020]

Klas Leino, Matt Fredrikson

Membership inference (MI) attacks exploit the fact that machine learning algorithms sometimes leak information about their training data through the learned model. In this work, we study membership inference in the white-box setting in order to exploit the internals of a model, which have not been effectively utilized by previous work. Leveraging new insights about how overfitting occurs in deep neural networks, we show how a model's idiosyncratic use of features can provide evidence of membership to white-box attackers—even when the model's black-box behavior appears to generalize well—and demonstrate that this approach outperforms prior black-box methods. Taking the position that an effective attack should have the ability to provide confident positive inferences, we find that previous attacks do not often provide a meaningful basis for confidently inferring membership, whereas our attack can be effectively calibrated for high precision. Finally, we examine popular defenses against MI attacks, finding that (1) smaller generalization error is not sufficient to prevent attacks on real models, and (2) while small-ε-differential privacy reduces the attack's effectiveness, this often comes at a significant cost to the model's accuracy; and for larger ε that are sometimes used in practice (e.g., ε = 16), the attack can achieve nearly the same accuracy as on the unprotected model.

  title = {Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference},
  author = {Klas Leino and Matt Fredrikson},
  booktitle = {USENIX Security Symposium},
  year = {2020},

Influence Paths for Characterizing Subject-Verb Number Agreement in LSTM Language Models [ACL 2020]

Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fedrikson, Anupam Datta

LSTM-based recurrent neural networks are the state-of-the-art for many natural language processing (NLP) tasks. Despite their performance, it is unclear whether, or how, LSTMs learn structural features of natural languages such as subject-verb number agreement in English. Lacking this understanding, the generality of LSTMs on this task and their suitability for related tasks remains uncertain. Further, errors cannot be properly attributed to a lack of structural capability, training data omissions, or other exceptional faults. We introduce influence paths, a causal account of structural properties as carried by paths across gates and neurons of a recurrent neural network. The approach refines the notion of influence (the subject's grammatical number has influence on the grammatical number of the subsequent verb) into a set of gate-level or neuron-level paths. The set localizes and segments the concept (e.g., subject-verb agreement), its constituent elements (e.g., the subject), and related or interfering elements (e.g., attractors). We exemplify the methodology on a widely-studied multi-level LSTM language model, demonstrating its accounting for subject-verb number agreement. The results offer both a finer and a more complete view of an LSTM's handling of this structural aspect of the English language than prior results based on diagnostic classifiers and ablation.

  title = {Influence Paths for Characterizing Subject-Verb Number Agreement in LSTM Language Models},
  author = {Kaiji Lu and Piotr Mardziel and Klas Leino and Matt Fedrikson and Anupam Datta},
  booktitle = {Association for Computational Linguistics (ACL)},
  year = {2020},

Feature-wise Bias Amplification [ICLR 2019]

Klas Leino, Emily Black, Matt Fredrikson, Shayak Sen, Anupam Datta

We study the phenomenon of bias amplification in classifiers, wherein a machine learning model learns to predict classes with a greater disparity than the underlying ground truth. We demonstrate that bias amplification can arise via an inductive bias in gradient descent methods that results in the overestimation of the importance of moderately-predictive "weak" features if insufficient training data is available. This overestimation gives rise to feature-wise bias amplification — a previously unreported form of bias that can be traced back to the features of a trained model. Through analysis and experiments, we show that while some bias cannot be mitigated without sacrificing accuracy, feature-wise bias amplification can be mitigated through targeted feature selection. We present two new feature selection algorithms for mitigating bias amplification in linear models, and show how they can be adapted to convolutional neural networks efficiently. Our experiments on synthetic and real data demonstrate that these algorithms consistently lead to reduced bias without harming accuracy, in some cases eliminating predictive bias altogether while providing modest gains in accuracy.

  title = {Feature-Wise Bias Amplification},
  author = {Klas Leino and Emily Black and Matt Fredrikson and Shayak Sen and Anupam Datta},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2019},

Influence-directed Explanations for Convolutional Neural Networks [ITC 2018]

Klas Leino, Shayak Sen, Anupam Datta, Matt Fredrikson

We study the problem of explaining a rich class of behavioral properties of deep neural networks. Distinctively, our influence-directed explanations approach this problem by peering inside the network to identify neurons with high influence on a quantity and distribution of interest, using an axiomatically-justified influence measure, and then providing an interpretation for the concepts these neurons represent. We evaluate our approach by demonstrating a number of its unique capabilities on convolutional neural networks trained on ImageNet. Our evaluation demonstrates that influence-directed explanations (1) identify influential concepts that generalize across instances, (2) can be used to extract the "essence" of what the network learned about a class, and (3) isolate individual features the network uses to make decisions and distinguish related classes.

  title = {Influence-Directed Explanations for Deep Convolutional Networks},
  author = {Klas Leino and Shayak Sen and Anupam Datta and Matt Fredrikson and Linyi Li},
  booktitle = {IEEE International Test Conference (ITC)},
  year = {2018},



NIPS-21 Demo: Exploring Conceptual Soundness with TruLens

Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Ricardo Shih, Zifan Wang

As machine learning has become increasingly ubiquitous, there has been a growing need to assess the trustworthiness of learned models. One important aspect to model trust is conceptual soundness, i.e., the extent to which a model uses features that are appropriate for its intended task. We present TruLens, a new cross-platform framework for explaining deep network behavior. In our demonstration, we provide an interactive application built on TruLens that we use to explore the conceptual soundness of various pre-trained models. Throughout the presentation, we take the unique perspective that robustness to small-norm adversarial examples is a necessary condition for conceptual soundness; we demonstrate this by comparing explanations on models trained with and without a robust objective. Our demonstration will focus on our end-to-end application, which will be made accessible for the audience to interact with; but we will also provide details on its open-source components, including the TruLens library and the code used to train robust networks.

KDD-21 Tutorial: Machine Learning Explainability and Robustness: Connected at the Hip

Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Zifan Wang

This tutorial examines the synergistic relationship between explainability methods for machine learning and a significant problem related to model quality: robustness against adversarial perturbations. We begin with a broad overview of approaches to explainable AI, before narrowing our focus to post-hoc explanation methods for predictive models. We discuss perspectives on what constitutes a good explanation in various settings, with an emphasis on axiomatic justifications for various explanation methods. In doing so, we will highlight the importance of an explanation method's faithfulness to the target model, as this property allows one to distinguish between explanations that are unintelligible because of the method used to produce them, and cases where a seemingly poor explanation points to model quality issues. Next, we introduce concepts surrounding adversarial robustness, including state-of-the-art adversarial attacks as well as a range of corresponding defenses. Finally, building on the knowledge presented thus far, we present key insights from the recent literature on the connections between explainability and adversarial robustness. We show that many commonly-perceived issues in explanations are actually caused by a lack of robustness. At the same time, we show that a careful study of adversarial examples and robustness can lead to models whose explanations better appeal to human intuition and domain knowledge.

AAAI-21 Tutorial: From Explanability to Model Quality and Back Again

Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Zifan Wang

The goal of this tutorial is to provide a systematic view of the current knowledge relating explainability to several key outstanding concerns regarding the quality of ML models; in particular, robustness, privacy, and fairness. We will discuss the ways in which explainability can inform questions about these aspects of model quality, and how methods for improving them that are emerging from recent research of AI, Security & Privacy, and Fairness communities can in turn lead to better outcomes for explainability. We aim to make these findings accessible to a general AI audience, including not only researchers who want to further engage with this direction, but also practitioners who stand to benefit from the results, and policy-makers who want to deepen their technical understanding of these important issues.

CMU Courses

18-739: Security and Fairness of Deep Learning (Spring 2019)

15-781: Graduate Artificial Intelligence (Fall 2016)

15-122: Principles of Imperative Computation (Fall 2013)

15-122: Principles of Imperative Computation (Fall 2012)

  • Carnegie Mellon University

  • School of Computer Science

  • Computer Science Department

  • Accountable Systems Lab