Preprints
Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks
S. Hu, Y. Fu, Z. S. Wu, V. Smith
Machine unlearning is a promising approach to mitigate undesirable memorization of training data in LLMs. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can "jog" the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.
LLM Unlearning Benchmarks are Weak Measures of Progress
P. Thaker, S. Hu, N. Kale, Y. Maurya, Z. S. Wu, V. Smith
Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.
Revisiting Cascaded Ensembles for Efficient Inference
S. Kolawole, D. Dennis, A. Talwalkar, V. Smith
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes, which route or select models for each example at inference time. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models, where ensemble agreement serves as a data-dependent routing criterion. This scheme is easy to incorporate into existing inference pipelines, requires no additional training, and can be used to place models across multiple resource tiers--for instance, serving efficient models at the edge and invoking larger models in the cloud only when necessary. In cases where parallel inference is feasible, we show that CoE can improve accuracy relative to the single best model while reducing the average cost of inference by up to 7x, and provides Pareto-dominate solutions in accuracy and efficiency relative to existing adaptive inference baselines. These savings translate to an over 3x-reduction in total monetary cost when performing inference using a heterogeneous cluster of GPUs. Finally, for edge inference scenarios where portions of the cascade reside at the edge vs. in the cloud, CoE can provide a 14x reduction in communication cost and inference latency without sacrificing accuracy.
Guardrail Baselines for Unlearning in LLMs
P. Thaker, Y. Maurya, S. Hu, Z. S. Wu, V. Smith
Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.
Federated LoRA with Sparse Communication
K. Kuo, A. Raje, K. Rajesh, V. Smith
Low-rank adaptation (LoRA) is a natural method for finetuning in communication-constrained machine learning settings such as cross-device federated learning. Prior work that has studied LoRA in the context of federated learning has focused on improving LoRA's robustness to heterogeneity and privacy. In this work, we instead consider techniques for further improving communication-efficiency in federated LoRA. Unfortunately, we show that centralized ML methods that improve the efficiency of LoRA through unstructured pruning do not transfer well to federated settings. We instead study a simple approach, FLASC, that applies sparsity to LoRA during communication while allowing clients to locally fine-tune the entire LoRA module. Across four common federated learning tasks, we demonstrate that this method matches the performance of dense LoRA with up to 10x less communication. Additionally, despite being designed primarily to target communication, we find that this approach has benefits in terms of heterogeneity and privacy relative to existing approaches tailored to these specific concerns. Overall, our work highlights the importance of considering system-specific constraints when developing communication-efficient finetuning approaches, and serves as a simple and competitive baseline for future work in federated finetuning.
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
L. Dery, S. Kolawole, J-F. Kagy, V. Smith, G. Neubig, A. Talwalkar
Given the generational gap in available hardware between lay practitioners and the most endowed institutions, LLMs are becoming increasingly inaccessible as they grow in size. Whilst many approaches have been proposed to compress LLMs to make their resource consumption manageable, these methods themselves tend to be resource intensive, putting them out of the reach of the very user groups they target. In this work, we explore the problem of structured pruning of LLMs using only forward passes. We seek to empower practitioners to prune models so large that their available hardware has just enough memory to run inference. We develop Bonsai, a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models. We observe that Bonsai outputs pruned models that (i) outperform those generated by more expensive gradient-based structured pruning methods, and (ii) are twice as fast (with comparable accuracy) as those generated by semi-structured pruning methods requiring comparable resources as Bonsai. We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.
2024
No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices
Q. Pang, S. Hu, W. Zheng, V. Smith
Neural Information Processing Systems (NeurIPS), 2024
Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.
On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
P. Thaker, A. Setlur, Z. S. Wu, V. Smith
Neural Information Processing Systems (NeurIPS), 2024
Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
A. Setlur, S. Garg, X. Geng, N. Garg, V. Smith, A. Kumar
Neural Information Processing Systems (NeurIPS), 2024
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data doubles the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by 8×. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
A. Muhamed, O. Li, D. Woodruff, M. Diab, V. Smith
Empirical Methods in Natural Language Processing (EMNLP), 2024
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves comparable performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU---a feat infeasible for previous methods---and yields up to a throughput improvement on an 8-GPU system.
Prompting is a Double-Edged Sword: Improving Worst-Group Robustness of Foundation Models
A. Setlur*, S. Garg*, V. Smith, S. Levine
International Conference on Machine Learning (ICML), 2024
Machine learning models fail catastrophically under distribution shift, but a surprisingly effective way to empirically improve robustness to some types of shift (e.g., Imagenet-A/C) is to use stronger open-vocabulary classifiers derived from foundation models. In this work, we first note that for shifts governed by spurious correlations (features spuriously correlated with the label on the training data, but not on test), the zero-shot and few-shot performance of foundation models is no better than ERM models, and remains unchanged when pretrained data/model size is scaled. Secondly, even in these situations, foundation models are quite accurate at predicting the value of the spurious feature. In a simplified setup, we theoretically analyze both these findings. Specifically, we show that during contrastive pretraining, the simplicity bias of foundation models tends to result in the learning of features that mostly rely on the spurious attribute, compared to more robust features. We leverage these observations to propose Prompting for Robustness (PfR) which first uses foundation models to zero-shot predict the spurious attribute on labeled examples, and then learns a classifier with balanced performance across different groups of labels and spurious attribute. Across 5 vision and language tasks, we show that PfR's performance nearly equals that of an oracle algorithm (group DRO) that leverages human labeled spurious attributes.
Maximizing Global Model Appeal in Federated Learning
Y. J. Cho, D. Jhunjhunwala, T. Li, V. Smith, G. Joshi
Transactions on Machine Learning Research (TMLR), 2024
Federated learning (FL) aims to collaboratively train a global model using local data from a network of clients. To warrant collaborative training, each federated client may expect the resulting global model to satisfy some individual requirement, such as achieving a certain loss threshold on their local data. However, in real FL scenarios, the global model may not satisfy the requirements of all clients in the network due to the data heterogeneity across clients. In this work, we explore the problem of global model appeal in FL, which we define as the total number of clients that find that the global model satisfies their individual requirements. We discover that global models trained using traditional FL approaches can result in a significant number of clients unsatisfied with the model based on their local requirements. As a consequence, we show that global model appeal can directly impact how clients participate in training and how the model performs on new clients at inference time. Our work proposes MaxFL, which maximizes the number of clients that find the global model appealing. MaxFL achieves a 22-40% and 18-50% improvement in the test accuracy of training clients and (unseen) test clients respectively, compared to a wide range of FL approaches that tackle data heterogeneity, aim to incentivize clients, and learn personalized/fair models.
Fair Federated Learning via Bounded Group Loss
S. Hu, Z. S. Wu, V. Smith
IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2024
Best Paper Award at ICLR 2022 Socially Responsible ML Workshop
Fair prediction across protected groups is an important consideration in federated learning applications. In this work we propose a general framework for provably fair federated learning. In particular, we explore and extend the notion of Bounded Group Loss as a theoretically-grounded approach for group fairness that offers favorable trade-offs between fairness and utility relative to prior work. Using this setup, we propose a scalable federated optimization method that optimizes the empirical risk under a number of group fairness constraints. We provide convergence guarantees for the method as well as fairness guarantees for the resulting solution. Empirically, we evaluate our method across common benchmarks from fair ML and federated learning, showing that it can provide both fairer and more accurate predictions than existing approaches in fair federated learning.
2023
Progressive Knowledge Distillation: Building Ensembles for Efficient Inference
D. Dennis, A. Shetty, A. Sevekari, K. Koishida, V. Smith
Neural Information Processing Systems (NeurIPS), 2023
We study the problem of progressive ensemble distillation: Given a large, pretrained teacher model g, we seek to decompose the model into smaller, low-inference cost student models f_i, such that progressively evaluating additional models in this ensemble leads to improved predictions. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost at runtime, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL, relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as g, but with smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees in terms of convergence and generalization.
Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies
O. Li, J. Harrison, J. Sohl-Dickstein, V. Smith, L. Metz
Neural Information Processing Systems (NeurIPS), 2023
Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.
Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift
S. Garg*, A. Setlur*, Z. Lipton, S. Balakrishnan, V. Smith, A. Raghunathan
Neural Information Processing Systems (NeurIPS), 2023
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.
On Tilted Losses in Machine Learning: Theory and Applications
T. Li*, A. Beirami*, M. Sanjabi, V. Smith
Journal of Machine Learning Research (JMLR), 2023
Exponential tilting is a technique commonly used in fields such as statistics, probability, information theory, and optimization to create parametric distribution shifts. Despite its prevalence in related fields, tilting has not seen widespread use in machine learning. In this work, we aim to bridge this gap by exploring the use of tilting in risk minimization. We study a simple extension to ERM---tilted empirical risk minimization (TERM)---which uses exponential tilting to flexibly tune the impact of individual losses. The resulting framework has several useful properties: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to the tail probability of losses. Our work makes connections between TERM and related objectives, such as Value-at-Risk, Conditional Value-at-Risk, and distributionally robust optimization (DRO). We develop batch and stochastic first-order optimization methods for solving TERM, provide convergence guarantees for the solvers, and show that the framework can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications in machine learning, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. Despite the straightforward modification TERM makes to traditional ERM objectives, we find that the framework can consistently outperform ERM and deliver competitive performance with state-of-the-art, problem-specific approaches.
Private Multi-Task Learning: Formulation and Applications to Federated Learning
S. Hu, Z. S. Wu, V. Smith
Transactions on Machine Learning Research (TMLR), 2023
Many problems in machine learning rely on multi-task learning (MTL), in which the goal is to solve multiple related machine learning tasks simultaneously. MTL is particularly relevant for privacy-sensitive applications in areas such as healthcare, finance, and IoT computing, where sensitive data from multiple, varied sources are shared for the purpose of learning. In this work, we formalize notions of task-level privacy for MTL via joint differential privacy (JDP), a relaxation of differential privacy for mechanism design and distributed optimization. We then propose an algorithm for mean-regularized MTL, an objective commonly used for applications in personalized federated learning, subject to JDP. We analyze our objective and solver, providing certifiable guarantees on both privacy and utility. Empirically, we find that our method allows for improved privacy/utility trade-offs relative to global baselines across common federated learning benchmarks.
On Noisy Evaluation in Federated Hyperparameter Tuning
K. Kuo, P. Thaker, M. Khodak, J. Ngyuen, D. Jiang, A. Talwalkar, V. Smith
Conference on Machine Learning and Systems (MLSys), 2023
Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks. Issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to evaluate the performance of various hyperparameters. In this work, we perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We first identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can significantly impact tuning methods-reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost the evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.
Validating Large Language Models with ReLM
M. Kuchnik, V. Smith, G. Amvrosiadis
Conference on Machine Learning and Systems (MLSys), 2023
Outstanding Paper Award
Although large language models (LLMs) have been touted for their ability to generate natural-sounding text, there are growing concerns around possible negative effects of LLMs such as data memorization, bias, and inappropriate language. Unfortunately, the complexity and generation capacities of LLMs make validating (and correcting) such concerns difficult. In this work, we introduce ReLM, a system for validating and querying LLMs using standard regular expressions. ReLM formalizes and enables a broad range of language model evaluations, reducing complex evaluation rules to simple regular expression queries. Our results exploring queries surrounding memorization, gender bias, toxicity, and language understanding show that ReLM achieves up to 15x higher system efficiency, 2.5x data efficiency, and increased statistical and prompt-tuning coverage compared to state-of-the-art ad-hoc queries. ReLM offers a competitive and general baseline for the increasingly important problem of LLM validation.
Differentially Private Adaptive Optimization with Delayed Preconditioners
T. Li, M. Zaheer, K. Liu, S. Reddi, B. McMahan, V. Smith
International Conference on Learning Representations (ICLR), 2023
Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4x relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.
Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts
A. Setlur, D. Dennis, B. Eysenbach, A. Raghunathan, C. Finn, V. Smith, S. Levine
International Conference on Learning Representations (ICLR), 2023
Although training machine learning models for robustness is critical for real-world adoption, determining how to best ensure robustness remains an open problem. Some methods (e.g., DRO) are overly conservative, while others (e.g., Group DRO) require domain knowledge that may be hard to obtain. In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function is simple. For example, we may expect that group shifts occur along high-level features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these features, but need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this idea, we formulate a two-player game where conditioned on the label the adversary can only separate datapoints into potential groups using simple features, which corresponds to a bitrate constraint on the adversary's capacity. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group information on training samples yet matches the performance of Group DRO. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less pessimistic solutions than unconstrained DRO.
2022
On Privacy and Personalization in Cross-Silo Federated Learning
Z. Liu, S. Hu, Z. S. Wu, V. Smith
Neural Information Processing Systems (NeurIPS), 2022
While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level privacy are less suitable as real-world privacy regulations typically concern in-silo data subjects rather than the silos themselves. In this work, we instead consider the more realistic notion of silo-specific item-level privacy, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy, silos are further incentivized to "federate" with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide a thorough empirical study of competing methods as well as a theoretical characterization of MR-MTL for a mean estimation problem, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
Adversarial Unlearning: Reducing Confidence Along Adversarial Directions
A. Setlur, B. Eysenbach, V. Smith, S. Levine
Neural Information Processing Systems (NeurIPS), 2022
Supervised learning methods trained with maximum likelihood objectives often overfit on training data. Most regularizers that prevent overfitting look to increase confidence on additional examples (e.g., data augmentation, adversarial training), or reduce it on training data (e.g., label smoothing). In this work we propose a complementary regularization strategy that reduces confidence on self-generated examples. The method, which we call RCAD (Reducing Confidence along Adversarial Directions), aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss. In contrast to adversarial training, RCAD does not try to robustify the model to output the original label, but rather regularizes it to have reduced confidence on points generated using much larger perturbations than in conventional adversarial training. RCAD can be easily integrated into training pipelines with a few lines of code. Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques (e.g., label smoothing, MixUp training) to increase test accuracy by 1-3% in absolute value, with more significant gains in the low data regime. We also provide a theoretical analysis that helps to explain these benefits in simplified settings, showing that RCAD can provably help the model unlearn spurious features in the training data.
Motley: Benchmarking Heterogeneity and Personalization in Federated Learning
S. Wu, T. Li, Z. Charles, Y. Xiao, Z. Liu, Z. Xu, V. Smith
Workshop on Federated Learning at NeurIPS, 2022
Personalized federated learning considers learning models unique to each client in a heterogeneous network. The resulting client-specific models have been purported to improve metrics such as accuracy, fairness, and robustness in federated networks. However, despite a plethora of work in this area, it remains unclear: (1) which personalization techniques are most effective in various settings, and (2) how important personalization truly is for realistic federated applications. To better answer these questions, we propose Motley, a benchmark for personalized federated learning. Motley consists of a suite of cross-device and cross-silo federated datasets from varied problem domains, as well as thorough evaluation metrics for better understanding the possible impacts of personalization. We establish baselines on the benchmark by comparing a number of representative personalized federated learning methods. These initial results highlight strengths and weaknesses of existing approaches, and raise several open questions for the community. Motley aims to provide a reproducible means with which to advance developments in personalized and heterogeneity-aware federated learning, as well as the related areas of transfer learning, meta-learning, and multi-task learning.
Private Adaptive Optimization with Side Information
T. Li, M. Zaheer, S. Reddi, V. Smith
International Conference on Machine Learning (ICML), 2022
Adaptive optimization methods have become the default solvers for many machine learning tasks. Unfortunately, the benefits of adaptivity may degrade when training with differential privacy, as the noise added to ensure privacy reduces the effectiveness of the adaptive preconditioner. To this end, we propose AdaDPS, a general framework that uses non-sensitive side information to precondition the gradients, allowing the effective use of adaptive methods in private settings. We formally show AdaDPS reduces the amount of noise needed to achieve similar privacy guarantees, thereby improving optimization performance. Empirically, we leverage simple and readily available side information to explore the performance of AdaDPS in practice, comparing to strong baselines in both centralized and federated settings. Our results show that AdaDPS improves accuracy by 7.7% (absolute) on average---yielding state-of-the-art privacy-utility trade-offs on large-scale text and image benchmarks.
Label Leakage and Protection in Two-party Split Learning
O. Li, J. Sun, X. Yang, W. Gao, H. Zhang, J. Xie, V. Smith, C. Wang
International Conference on Learning Representations (ICLR), 2022
Two-party split learning is a popular technique for learning a model across feature-partitioned data. In this work, we explore whether it is possible for one party to steal the private label information from the other party during split training, and whether there are methods that can protect against such attacks. Specifically, we first formulate a realistic threat model and propose a privacy loss metric to quantify label leakage in split learning. We then show that there exist two simple yet effective methods within the threat model that can allow one party to accurately recover private ground-truth labels owned by the other party. To combat these attacks, we propose several random perturbation techniques, including Marvell, an approach that strategically finds the structure of the noise perturbation by minimizing the amount of label leakage (measured through our quantification metric) of a worst-case adversary. We empirically demonstrate the effectiveness of our protection techniques against the identified attacks, and show that Marvell in particular has improved privacy-utility tradeoffs relative to baseline approaches.
Diverse Client Selection for Federated Learning via Submodular Maximization
R. Balakrishnan, T. Li, T. Zhou, N. Himayat, V. Smith, J. Bilmes
International Conference on Learning Representations (ICLR), 2022
In every communication round of federated learning, a random subset of clients communicate their model updates back to the server which then aggregates them all. The optimal size of this subset is not known and several studies have shown that typically random selection does not perform very well in terms of convergence, learning efficiency and fairness. We, in this paper, propose to select a small diverse subset of clients, namely those carrying representative gradient information, and we transmit only these updates to the server. Our aim is for updating via only a subset to approximate updating via aggregating all client information. We achieve this by choosing a subset that maximizes a submodular facility location function defined over gradient space. We introduce “federated averaging with diverse client selection (DivFL)”. We provide a thorough analysis of its convergence in the heterogeneous setting and apply it both to synthetic and to real datasets. Empirical results show several benefits to our approach including improved learning efficiency, faster convergence and also more uniform (i.e., fair) performance across clients. We further show a communication-efficient version of DivFL that can still outperform baselines on the above metrics.
Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
M. Kuchnik, A. Klimovic, J. Simsa, V. Smith, G. Amvrosiadis
Conference on Machine Learning and Systems (MLSys), 2022
Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over 2 million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis reveals that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interprettable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 46x for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.
2021
A Field Guide to Federated Optimization
J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, et al.
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
On Large-Cohort Training for Federated Learning
Z. Charles, Z. Garrett, Z. Huo, S. Shmulyian, V. Smith
Neural Information Processing Systems (NeurIPS), 2021
Federated learning methods typically learn a model by iteratively sampling updates from a population of clients. In this work, we explore how the number of clients sampled at each round (the cohort size) impacts the quality of the learned model and the training dynamics of federated learning algorithms. Our work poses three fundamental questions. First, what challenges arise when trying to scale federated learning to larger cohorts? Second, what parallels exist between cohort sizes in federated learning and batch sizes in centralized learning? Last, how can we design federated learning methods that effectively utilize larger cohort sizes? We give partial answers to these questions based on extensive empirical evaluation. Our work highlights a number of challenges stemming from the use of larger cohorts. While some of these (such as generalization issues and diminishing returns) are analogs of large-batch training challenges, others (including training failures and fairness concerns) are unique to federated learning.
Federated Hyperparameter Tuning: Challenges, Baselines, and Connections to Weight-Sharing
M. Khodak, R. Tu, T. Li, L. Li, M.-F. Balcan, V. Smith, A. Talwalkar
Neural Information Processing Systems (NeurIPS), 2021
Tuning hyperparameters is a crucial but arduous part of the machine learning pipeline. Hyperparameter optimization is even more challenging in federated learning, where models are learned over a distributed network of heterogeneous devices; here, the need to keep data on device and perform local training makes it difficult to efficiently train and evaluate configurations. In this work, we investigate the problem of federated hyperparameter tuning. We first identify key challenges and show how standard approaches may be adapted to form baselines for the federated setting. Then, by making a novel connection to the neural architecture search technique of weight-sharing, we introduce a new method, FedEx, to accelerate federated hyperparameter tuning that is applicable to widely-used federated optimization methods such as FedAvg and recent variants. Theoretically, we show that a FedEx variant correctly tunes the on-device learning rate in the setting of online convex optimization across devices. Empirically, we show that FedEx can outperform natural baselines for federated hyperparameter tuning by several percentage points on the Shakespeare, FEMNIST, and CIFAR-10 benchmarks, obtaining higher accuracy using the same training budget.
Two Sides of Meta-Learning Evaluation: In vs. Out of Distribution
A. Setlur*, O. Li*, V. Smith
Neural Information Processing Systems (NeurIPS), 2021
We categorize meta-learning evaluation into two settings: in-distribution [ID], in which the train and test tasks are sampled iid from the same underlying task distribution, and out-of-distribution [OOD], in which they are not. While most meta-learning theory and some FSL applications follow the ID setting, we identify that most existing few-shot classification benchmarks instead reflect OOD evaluation, as they use disjoint sets of train (base) and test (novel) classes for task generation. This discrepancy is problematic because -- as we show on numerous benchmarks -- meta-learning methods that perform better on existing OOD datasets may perform significantly worse in the ID setting. In addition, in the OOD setting, even though current FSL benchmarks seem befitting, our study highlights concerns in 1) reliably performing model selection for a given meta-learning method, and 2) consistently comparing the performance of different methods. To address these concerns, we provide suggestions on how to construct FSL benchmarks to allow for ID evaluation as well as more reliable OOD evaluation. Our work aims to inform the meta-learning community about the importance and distinction of ID vs. OOD evaluation, as well as the subtleties of OOD evaluation with current benchmarks.
Progressive Compressed Records: Taking a Byte out of Deep Learning Data
M. Kuchnik, G. Amvrosiadis, V. Smith
Conference on Very Large Data Bases (VLDB), 2021
Deep learning training accesses vast amounts of data at high velocity, posing challenges for datasets retrieved over commodity networks and storage devices. We introduce a way to dynamically reduce the overhead of fetching and transporting training data with a method we term Progressive Compressed Records (PCRs). PCRs deviate from previous formats by using progressive compression to convert a single dataset into multiple datasets of increasing fidelity---all without adding to the total dataset size. Empirically, we implement PCRs and evaluate them on a wide range of datasets: ImageNet, HAM10000, Stanford Cars, and CelebA-HQ. Our results show that different tasks can tolerate different levels of compression. PCRs use an on-disk layout that enables applications to efficiently and dynamically access appropriate levels of compression at runtime. In turn, we demonstrate that PCRs can seamlessly enable a 2x speedup in training time on average over baseline formats.
Ditto: Fair and Robust Federated Learning Through Personalization
T. Li, S. Hu, A. Beirami, V. Smith
International Conference on Machine Learning (ICML), 2021
Best Paper Award at ICLR 2021 Secure ML Workshop
Fairness and robustness are two important concerns for federated learning systems. In this work, we identify that robustness to data and model poisoning attacks and fairness, measured as the uniformity of performance across devices, are competing constraints in statistically heterogeneous networks. To address these constraints, we propose employing a simple, general framework for personalized federated learning, Ditto, and develop a scalable solver for it. Theoretically, we analyze the ability of Ditto to achieve fairness and robustness simultaneously on a class of linear problems. Empirically, across a suite of federated datasets, we show that Ditto not only achieves competitive performance relative to recent personalization methods, but also enables more accurate, robust, and fair models relative to state-of-the-art fair or robust baselines.
Heterogeneity for the Win: One-Shot Federated Clustering
D. Dennis, T. Li, V. Smith
International Conference on Machine Learning (ICML), 2021
In this work, we explore the unique challenges---and opportunities---of unsupervised federated learning (FL). We develop and analyze a one-shot federated clustering scheme, k-FED, based on the widely-used Lloyd's method for k-means clustering. In contrast to many supervised problems, we show that the issue of statistical heterogeneity in federated networks can in fact benefit our analysis. We analyse k-FED under a center separation assumption and compare it to the best known requirements of its centralized counterpart. Our analysis shows that in heterogeneous regimes where the number of clusters per device (k') is smaller than the total number of clusters over the network k, ($k' \le \sqrt{k}$), we can use heterogeneity to our advantage---significantly weakening the cluster separation requirements for k-FED. From a practical viewpoint, k-FED also has many desirable properties: it requires only round of communication, can run asynchronously, and can handle partial participation or node/network failures. We motivate our analysis with experiments on common FL benchmarks, and highlight the practical utility of one-shot clustering through use-cases in personalized FL and device sampling.
Tilted Empirical Risk Minimization
T. Li*, A. Beirami*, M. Sanjabi, V. Smith
International Conference on Learning Representations (ICLR), 2021
Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.
2020
Federated Learning: Challenges, Methods, and Future Directions
T. Li, A. K. Sahu, A. Talwalkar, V. Smith
IEEE Signal Processing Magazine, Special Issue on Distributed Machine Learning, 2020
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Fair Resource Allocation in Federated Learning
T. Li, M. Sanjabi, A. Beirami, V. Smith
International Conference on Learning Representations (ICLR), 2020
Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by resource allocation in wireless networks that encourages a more fair (i.e., lower-variance) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efficient method, q-FedAvg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efficiency of q-FedAvg on a suite of federated datasets, and show that q-FFL (along with q-FedAvg) outperforms existing baselines in terms of the resulting fairness, flexibility, and efficiency.
Learning Context-aware Policies from Multiple Smart Homes via Federated Multi-Task Learning
T. Yu, T. Li, Y. Sun, S. Nanda, V. Smith, V. Sekar, S. Seshan
ACM/IEEE Conference on Internet of Things Design and Implementation (IoTDI), 2020
Internet-of-Things (IoT) devices deployed in smart homes expose users to cyber threats that can cause privacy leakage (e.g., smart TV eavesdropping) or physical hazards (e.g., smart stove causing fire). Prior work has argued that to effectively detect and prevent such threats, contextual policies are needed to decide if an access to an IoT device should be allowed. Today, however, such contextual access control policies need to be manually generated by IoT developers or users via preinstallation or runtime prompts. Both approaches suffer from potential misconfigurations and often fail to provide coverage over the space of policies. In this paper, our goal is to build a machine learning framework to automatically learn the contextual access control policies from the observed behavioral patterns of users in smart homes. Designing such a learning framework is challenging on two fronts. First, the accuracy is constrained by insufficient data in some smart homes and the diversity of IoT access patterns across different smart homes. Second, since we rely on usage patterns of IoT devices, users will have privacy concerns. We address these challenges in designing LoFTI, a federated multi-task learning framework that learns customized context-aware policies from multiple smart homes in a privacy-preserving manner. Based on prior user studies, we identify six general types of features to capture contextual access patterns. We build a simple machine learning model with temporal structure to achieve a good trade-off between accuracy and communication/computation cost. We design a custom data augmentation mechanism to address the issue of unbalanced data in learning (i.e., few negative vs. normal samples). We show that LoFTI can achieve low false positives/false negatives, reducing the false negative rate by 24.2% and false positive rate by 49.5%, comparing with the state-of-the-art single-home learning and all-home learning mechanism.
Federated Optimization in Heterogeneous Networks
T. Li, A. K. Sahu, M. Sanjabi, M. Zaheer, A. Talwalkar, V. Smith
Conference on Machine Learning and Systems (MLSys), 2020
Federated learning involves training machine learning models in massively distributed networks. While Federated Averaging (FedAvg) is the leading optimization method for training non-convex models in this setting, its behavior is not well understood in realistic federated settings when learning across statistically heterogeneous devices, i.e., where each device collects data in a non-identical fashion. In this work, we introduce a framework to tackle statistical heterogeneity, FedProx, which encompasses FedAvg as a special case. We provide convergence guarantees for FedProx through a novel device dissimilarity assumption, which allows us to characterize heterogeneity in the network. Finally, we perform a detailed empirical evaluation across a suite of federated datasets, validating our theoretical analysis and demonstrating the improved robustness and stability of the generalized FedProx framework relative to FedAvg for learning in heterogeneous networks.
2019
LEAF: A Benchmark for Federated Settings
S. Caldas, P. Wu, T. Li, J. Konecny, B. McMahan, V. Smith, A. Talwalkar
Workshop on Federated Learning for Data Privacy and Confidentiality at NeurIPS, 2019
Modern federated networks, such as those comprised of wearable devices, mobile phones, or autonomous vehicles, generate massive amounts of data each day. This wealth of data can help to learn models that can improve the user experience on each device. However, learning in federated settings presents new challenges at all stages of the machine learning pipeline. As the machine learning community begins to tackle these challenges, we are at a critical time to ensure that developments made in this area are grounded in real-world assumptions. To this end, we propose LEAF, a modular benchmarking framework for learning in federated settings. LEAF includes a suite of open-source federated datasets, a rigorous evaluation framework, and a set of reference implementations, all geared towards capturing the obstacles and intricacies of practical federated environments.
FedDANE: A Federated Newton-Type Method
T. Li, A. K. Sahu, M. Sanjabi, M. Zaheer, A. Talwalkar, V. Smith
Asilomar Conference on Signals, Systems and Computers, 2019, Invited Paper
Federated learning aims to jointly learn statistical models over massively distributed remote devices. In this work, we propose FedDANE, an optimization method that we adapt from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning. We provide convergence guarantees for this method when learning over both convex and non-convex functions. Despite encouraging theoretical results, we find that the method has underwhelming performance empirically. In particular, through empirical simulations on both synthetic and real-world datasets, FedDANE consistently underperforms baselines of FedAvg and FedProx in realistic federated settings. We identify low device participation and statistical device heterogeneity as two underlying causes of this underwhelming performance, and conclude by suggesting several directions of future work.
A Kernel Theory of Modern Data Augmentation
T. Dao, A. Gu, A. Ratner, V. Smith, C. De Sa, C. Re
International Conference on Machine Learning (ICML), 2019
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing k-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.
Efficient Augmentation via Data Subsampling
M. Kuchnik, V. Smith
International Conference on Learning Representations (ICLR), 2019
Data augmentation is commonly used to encode invariances in learning methods. However, this process is often performed in an inefficient manner, as artificial examples are created by applying a number of transformations to all points in the training set. The resulting explosion of the dataset size can be an issue in terms of storage and training costs, as well as in selecting and tuning the optimal set of transformations to apply. In this work, we demonstrate that it is possible to significantly reduce the number of data points included in data augmentation while realizing the same accuracy and invariance benefits of augmenting the entire dataset. We propose a novel set of subsampling policies, based on model influence and loss, that can achieve a 90% reduction in augmentation set size while maintaining the accuracy gains of standard data augmentation.
MLSys: The New Frontier of Machine Learning Systems
Technical Report, 2019
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
2018 & prior
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, M. Jaggi
Journal of Machine Learning Research (JMLR), 2018
The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for the distributed environment, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.
One-Shot Federated Learning
N. Guha, A. Talwalkar, V. Smith
Machine Learning on Devices Workshop at NeurIPS, 2018
We present one-shot federated learning, where a central server learns a global model over a network of federated devices in a single round of communication. Our approach - drawing on ensemble learning and knowledge aggregation - achieves an average relative gain of 51.5% in AUC over local baselines and comes within 90.1% of the (unattainable) global ideal. We discuss these methods and identify several promising directions of future work.
Federated Multi-Task Learning
V. Smith, C. Chiang, M. Sanjabi, A. Talwalkar
Neural Information Processing Systems (NeurIPS), 2017
Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets.
Distributed Optimization with Arbitrary Local Solvers
C. Ma, J. Konecny, M. Jaggi, V. Smith, M. I. Jordan, P. Richtarik, M. Takac
Optimization Methods and Software, 2017
With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.
L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework
V. Smith, S. Forte, M. I. Jordan, M. Jaggi
ML Systems Workshop at ICML, 2016
Despite the importance of sparsity in many big data applications, there are few existing methods for efficient distributed optimization of sparsely-regularized objectives. In this paper, we present a communication-efficient framework for L1-regularized optimization in distributed environments. By taking a non-traditional view of classical objectives as part of a more general primal-dual setting, we obtain a new class of methods that can be efficiently distributed and is applicable to common L1-regularized regression and classification objectives, such as Lasso, sparse logistic regression, and elastic net regression. We provide convergence guarantees for this framework and demonstrate strong empirical performance as compared to other state-of-the-art methods on several real-world distributed datasets.
Going In-Depth: Finding Longform on the Web
V. Smith, M. Connor, I. Stanton
Conference on Knowledge Discovery and Data Mining (KDD), 2015
tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.
Adding vs. Averaging in Distributed Primal-Dual Optimization
C. Ma*, V. Smith*, M. Jaggi, M. I. Jordan, P. Richtarik, M. Takac
International Conference on Machine Learning (ICML), 2015
Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (CoCoA) for distributed optimization. Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allow conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to cover non-smooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of CoCoA+ on several real-world distributed datasets, especially when scaling up the number of machines.
Communication-Efficient Distributed Dual Coordinate Ascent
M. Jaggi*, V. Smith*, M. Takac, J. Terhorst, S. Krishnan, T. Hofmann, M. I. Jordan
Neural Information Processing Systems (NeurIPS), 2014
Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, COCOA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same .001-accurate solution quality on average 25x as quickly.
MLI: An API for User-friendly Distribued Machine Learning
E. Sparks, A. Talwalkar, V. Smith, X. Pan, J. Gonzalez, T. Kraska, M. I. Jordan, and M. J. Franklin
IEEE International Conference on Data Mining (ICDM), 2013
MLI is a Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
A Comparative Study of High Renewables Penetration Electricity Grids
J. Taneja, V. Smith, D. Culler, and C. Rosenberg
IEEE International Conference on Smart Grid Communications (SmartGridComm), 2013
Electricity grids are transforming as renewables proliferate, yet operational concerns due to fluctuations in renewables sources could limit the ultimate potential for high penetrations of renewables. In this paper, we compare three electricity grids – California, Germany, and Ontario – studying the effects of relative cost of solar and wind generation on the selection of the renewables mix, and examine the resulting excess generation. We then observe the effects of the renewables mix and the use of baseload energy generation on the limits to renewables penetration, quantifying what proportion of delivered energy can be provided by renewables. Our study shows that the optimal renewables mix, from the perspective of minimizing total cost of generation, is highly dependent on the relative costs of technology, and that above a certain penetration rate, different for each grid, the optimal mix must contain both solar and wind generation.
Classification of Sidewalks in Street View Images
V. Smith, J. Malik, and D. Culler
WiP Workshop at International Green Computing Conference (IGCC), 2013
Mapping sidewalks in urban environments is key in the creation of pedestrian-friendly, sustainable cities. Currently, urban planners are hindered by a lack of information available in a format suitable for the large-scale analysis of sidewalk design. To demonstrate the impact that information technology could have in this area, we leverage techniques from machine learning and computer vision to gather information about the presence and quality of sidewalks in map images. In particular, we identify sidewalk segments in street view images using a random forest classifier, utilizing a set of local and global features that include geometric context, presence of lanes, pixel color, and location. Our results illustrate that this approach is effective in classifying sidewalk segments in a large set of street view images. This algorithm can be easily extended to other datasets, and can be automated to gather complete, fine-grained details about sidewalks for arbitrarily large urban environments.
MLbase: A Distributed Machine Learning Wrapper
A. Talwalkar, T. Kraska, R. Griffith, J. Duchi, J. Gonzalez, D. Britz, X. Pan, V. Smith, E. Sparks, A. Wibisono, M. J. Franklin, and M. I. Jordan
Big Learning Workshop at NeurIPS, 2012
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
Identifying Models of HVAC Systems Using Semiparametric Regression
A. Aswani, N. Master, J. Taneja, V. Smith, A. Krioukov, D. Culler, and C. Tomlin
Proceedings of the American Control Conference (ACC), 2012
Heating, ventilation, and air-conditioning (HVAC) systems use a large amount of energy, and so they are an interesting area for efficiency improvements. The focus here is on the use of semiparametric regression to identify models, which are amenable to analysis and control system design, of HVAC systems. This paper briefly describes two testbeds that we have built on the Berkeley campus for modeling and efficient control of HVAC systems, and we use these testbeds as case studies for system identification. The main contribution of this work is that the use of semiparametric regression allows for the estimation of the heating load from occupancy, equipment, and solar heating using only temperature measurements. These estimates are important for building accurate models as well as designing efficient control schemes, and in our other work we have been able to achieve a reduction in energy consumption on a single room testbed using heating load estimation in conjunction with the learning-based model predictive control (LBMPC) technique. Furthermore, this framework is not restrictive to modeling nonlinear HVAC behavior, because we have been able to use this methodology to create hybrid system models that incorporate such nonlinearities.
Modeling Building Thermal Response to HVAC Zoning
V. Smith, T. Sookoor, and K. Whitehouse
ACM SIGBED Review, 2012
HVAC systems account for 38% of building energy usage. Studies have indicated at least 5-15% waste due to unoccupied spaces being conditioned. Our goal is to minimize this waste by retrofitting HVAC systems to enable room-level zoning where each room is conditioned individually based on its occupancy. This will allow only occupied rooms to be conditioned while saving the energy used to condition unoccupied rooms. In order to achieve this goal, the effect of opening or closing air vent registers on room temperatures has to be predicted. Making such a prediction is complicated by the fact that weather has a larger effect on room temperatures than the settings of air vent registers, making it hard to isolate the influence of the HVAC system. We present a technique for dynamically estimating the heat load due to weather on room temperatures and subtracting it out in order to predict the effect of the HVAC system more directly.