Fast Geometric Projections for Local Robustness Certification
Klas Leino*, Aymeric Fromherz*, Matt Fredrikson, Bryan Parno, Corina Păsăreanu
Local robustness ensures that a model classifies all inputs within an epsilon-ball consistently, which precludes various forms of adversarial inputs.
In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise linear activation functions.
The key insight is that such networks partition the input space into a polyhedral complex such that the network is linear inside each polyhedral region;
hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing robustness.
Crucially, we show how these regions can be analyzed using geometric projections instead of expensive constraint solving, thus admitting an efficient, highly-parallel GPU implementation at the price of incompleteness, which can be addressed by falling back on prior approaches.
Empirically, we find that incompleteness is not often an issue, and that our method performs one to two orders of magnitude faster than existing robustness-certification techniques based on constraint solving.
Leveraging Model Memorization for Calibrated White-Box Membership Inference [USENIX 2020]
Klas Leino, Matt Fredrikson
Membership inference (MI) attacks exploit the fact that machine learning algorithms sometimes leak information about their training data through the learned model.
In this work, we study membership inference in the white-box setting in order to exploit the internals of a model, which have not been effectively utilized by previous work.
Leveraging new insights about how overfitting occurs in deep neural networks, we show how a model's idiosyncratic use of features can provide evidence of membership to white-box attackers – even when the model's black-box behavior appears to generalize well – and demonstrate that this approach outperforms prior black-box methods.
Taking the position that an effective attack should have the ability to provide confident positive inferences, we find that previous attacks do not often provide a meaningful basis for confidently inferring membership, whereas our attack can be effectively calibrated for high precision.
Finally, we examine popular defenses against MI attacks, finding that
(1) smaller generalization error is not sufficient to prevent attacks on real models, and
(2) while small-ε-differential privacy reduces the attack's effectiveness, this often comes at a significant cost to the model's accuracy; and for larger ε that are sometimes used in practice (e.g., ε = 16), the attack can achieve nearly the same accuracy as on the unprotected model.
Influence Paths for Characterizing Subject-Verb Number Agreement in LSTM Language Models [ACL 2020]
Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fedrikson, Anupam Datta
LSTM-based recurrent neural networks are the state-of-the-art for many natural language processing (NLP) tasks. Despite their performance, it is unclear whether, or how, LSTMs learn structural features of natural languages such as subject-verb number agreement in English. Lacking this understanding, the generality of LSTMs on this task and their suitability for related tasks remains uncertain. Further, errors cannot be properly attributed to a lack of structural capability, training data omissions, or other exceptional faults. We introduce influence paths, a causal account of structural properties as carried by paths across gates and neurons of a recurrent neural network. The approach refines the notion of influence (the subject's grammatical number has influence on the grammatical number of the subsequent verb) into a set of gate-level or neuron-level paths. The set localizes and segments the concept (e.g., subject-verb agreement), its constituent elements (e.g., the subject), and related or interfering elements (e.g., attractors). We exemplify the methodology on a widely-studied multi-level LSTM language model, demonstrating its accounting for subject-verb number agreement. The results offer both a finer and a more complete view of an LSTM's handling of this structural aspect of the English language than prior results based on diagnostic classifiers and ablation.
Feature-wise Bias Amplification [ICLR 2019]
Klas Leino, Emily Black, Matt Fredrikson, Shayak Sen, Anupam Datta
We study the phenomenon of bias amplification in classifiers, wherein a machine learning model learns to predict classes with a greater disparity than the underlying ground truth. We demonstrate that bias amplification can arise via an inductive bias in gradient descent methods that results in the overestimation of the importance of moderately-predictive "weak" features if insufficient training data is available. This overestimation gives rise to feature-wise bias amplification — a previously unreported form of bias that can be traced back to the features of a trained model. Through analysis and experiments, we show that while some bias cannot be mitigated without sacrificing accuracy, feature-wise bias amplification can be mitigated through targeted feature selection. We present two new feature selection algorithms for mitigating bias amplification in linear models, and show how they can be adapted to convolutional neural networks efficiently. Our experiments on synthetic and real data demonstrate that these algorithms consistently lead to reduced bias without harming accuracy, in some cases eliminating predictive bias altogether while providing modest gains in accuracy.
Influence-directed Explanations for Convolutional Neural Networks [ITC 2018]
Klas Leino, Shayak Sen, Anupam Datta, Matt Fredrikson
We study the problem of explaining a rich class of behavioral properties of deep neural networks. Distinctively, our influence-directed explanations approach this problem by peering inside the network to identify neurons with high influence on a quantity and distribution of interest, using an axiomatically-justified influence measure, and then providing an interpretation for the concepts these neurons represent. We evaluate our approach by demonstrating a number of its unique capabilities on convolutional neural networks trained on ImageNet. Our evaluation demonstrates that influence-directed explanations (1) identify influential concepts that generalize across instances, (2) can be used to extract the "essence" of what the network learned about a class, and (3) isolate individual features the network uses to make decisions and distinguish related classes.