I’m a Ph.D. candidate in the Computer Science Department (CSD) at Carnegie Mellon University (CMU), where I’m fortunate to be advised by Prof. Justine Sherry. My research area is Networking, and I’m part of the Systems, Networking, and Performance Lab (SNAP) at CMU; I’m secondarily affiliated with CONIX, CyLab, and CMU’s Parallel Data Lab (PDL). My research interests broadly lie at the intersection of systems and performance modeling (i.e. building high-performance networked systems and proving theoretical properties about their performance and stability).
Prior to starting graduate school, I completed my undergraduate degree (B.A.Sc.) in Computer Engineering at the University of Toronto, Canada, in 2018. Despite my apparent penchant for cold, capricious weather, I spent most of life in Mumbai, India. In my spare time, I enjoy reading (fiction, philosophy), imbibing copious amounts of espresso, and playing chess, golf, and soccer.
Abstract Clos topologies have a long history in networking. In telephone networks and in network flow problems, they have been shown to emulate the performance properties of a `macro-switch': a single, infinite-capacity, and non-blocking interconnect between sources and destinations. In particular, individual flow demands can be guaranteed and aggregate network throughput is easily maximized. The above properties made Clos topologies an attractive choice for modern data-centers and hence they are widely deployed today. At the same time, it is well-known that the above properties are not guaranteed to hold in data-centers due to fundamental differences between data-centers, telephone networks, and network flow problems. In this paper, we provide the first mathematical bounds characterizing the gap in throughput and flow demand allocations between practical data-center Clos deployments and the macro-switch abstraction for these networks. Our findings cast a spotlight on the tradeoffs between routing, ensuring max-min fairness for flows, and maximizing throughput in data-centers, and call into question some common assumptions about the design and evaluation of data-center routing and scheduling.
BibTeX @inproceedings{ferreira-podc24, title={Impossibility Results for Data-Center Routing with Congestion Control and Unsplittable Flows}, author={Ferreira, Miguel Alves and Atre, Nirav and Sherry, Justine and Sobrinho, Jo{\~a}o Lu{\'\i}s}, booktitle={Proceedings of the 2024 ACM Symposium on Principles of Distributed Computing}, pages={}, year={2024} }
Abstract The need for fairness, strong isolation, and fine-grained control over network traffic in multi-tenant cloud settings has engendered a rich literature on packet scheduling in switches and programmable hardware. Recent proposals for hardware scheduling primitives (PIFO, PIEO, BMW-Tree, etc.) have enabled run-time programmable packet schedulers, considerably expanding the suite of scheduling policies that be applied to network traffic. However, no existing solution can be practically *deployed* on modern switches and NICs because they either do not scale to the number of elements required by these devices or fail to deliver good throughput, thus requiring an impractical number of replicas. In this work, we ask: is it possible to achieve priority packet scheduling at line-rate while supporting a large number of flows? Our key insight is to leverage a scheduling primitive used previously in software -- called Hierarchical Find First Set -- and port this to a highly pipeline-parallel hardware design. We present the architecture and implementation of Bitmapped Bucket Queue (BBQ), a hardware-based integer priority queue that supports a wide range of scheduling policies (via a PIFO-like abstraction). BBQ, for the first time, supports hundreds of thousands of concurrent flows while guaranteeing 100Gbps line rate (148.8 Mpps) on FPGAs and 1Tbps (1,488 Mpps) line rate on ASICs. We demonstrate this by implementing BBQ on a commodity FPGA where it is capable of supporting 100K+ flows and 32K+ priorities at 300MHz, 3X the packet rate of similar hardware priority queue designs. On ASIC, we can synthesize 100k elements at 3.1 GHz using a 7nm process.
BibTeX @inproceedings {atre-nsdi24, author = {Nirav Atre and Hugo Sadok and Justine Sherry}, title = {BBQ: A Fast and Scalable Integer Priority Queue for Hardware Packet Scheduling}, booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)}, year = {2024}, isbn = {978-1-939133-39-7}, address = {Santa Clara, CA}, pages = {455--475}, url = {https://www.usenix.org/conference/nsdi24/presentation/atre}, publisher = {USENIX Association}, month = apr }
Abstract Today, most communication between the NIC and software involves exchanging fixed-size packet buffers. This packetized interface was designed for an era when NICs implemented few offloads and software implemented the logic for translating between application data and packets. However, both NICs and networked software have evolved: modern NICs implement hardware offloads, e.g., TSO, LRO, and serialization offloads that can more efficiently translate between application data and packets. Furthermore, modern software increasingly batches network I/O to reduce overheads. These changes have led to a mismatch between the packetized interface, which assumes that the NIC and software exchange fixed-size buffers, and the features provided by modern NICs and used by modern software. This incongruence between interface and data adds software complexity and I/O overheads, which in turn limits communication performance. This paper proposes Ensō, a new streaming NIC-to-software interface designed to better support how NICs and software interact today. At its core, Ensō eschews fixed-size buffers, and instead structures communication as a stream that can be used to send arbitrary data sizes. We show that this change reduces software overheads, reduces PCIe bandwidth requirements, and leads to fewer cache misses. These improvements allow an Ensō-based NIC to saturate a 100 Gbps link with minimum-sized packets (forwarding at 148.8 Mpps) using a single core, improve throughput for high-performance network applications by 1.5–6X, and reduce latency by up to 43%.
BibTeX @inproceedings {sadok-osdi23, author = {Sadok, Hugo and Atre, Nirav and Zhao, Zhipeng and Berger, Daniel S. and Hoe, James C. and Panda, Aurojit and Sherry, Justine and Wang, Ren}, title = {Ensō: A Streaming Interface for {NIC}-Application Communication}, booktitle = {17th {USENIX} Symposium on Operating Systems Design and Implementation}, year = {2023}, isbn = {}, pages = {}, publisher = {USENIX Association}, month = jul, series = {OSDI~'23} }
Abstract Algorithmic complexity attacks (ACAs) are a class of denial-of-service (DoS) attacks where an attacker uses a small amount of adversarial traffic to induce a large amount of work in the target system, pushing the system into overload and causing it to drop packets from innocent users. ACAs are particularly dangerous because, unlike volumetric DoS attacks, ACAs don't require a significant network bandwidth investment from the attacker. Today, network functions (NFs) on the Internet must be painstakingly designed and engineered on a case-by-case basis to mitigate the debilitating impact of ACAs. Further, the resulting designs tend to be overly conservative in their attack mitigation strategy, limiting the innocent traffic that the NF can serve under common-case operation. In this work, we propose a general framework to make any NF more resilient to ACAs without the limitations of prior approaches. Our framework, SurgeProtector, uses the NF's scheduler to mitigate the impact of ACAs using a very traditional scheduling algorithm: weighted-shortest-job-first (WSJF). To evaluate SurgeProtector, we propose a new metric of vulnerability called the Displacement Factor (DF), which quantifies the 'harm per unit effort' which an adversary can inflict on the system. We provide novel, adversarial analysis of WSJF and show that any system using this policy has a worst-case DF of only a small constant, where traditional schedulers place no upper bound on the DF. Illustrating that SurgeProtector is not only theoretically, but practically robust, we integrate SurgeProtector into an open source intrusion detection system (IDS). Under simulated attack, the SurgeProtector-augmented IDS suffers 90-99% lower innocent traffic loss than the original system.
BibTeX @inproceedings{atre-sigcomm22, author = {Atre, Nirav and Sadok, Hugo and Chiang, Erica and Wang, Weina and Sherry, Justine}, title = {SurgeProtector: Mitigating Temporal Algorithmic Complexity Attacks using Adversarial Scheduling}, year = {2022}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 2022 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM)} }
Abstract Kernel-bypass network APIs, which allow applications to circumvent the kernel and interface directly with the NIC hardware, have recently emerged as one of the main tools for improving application network performance. However, allowing applications to circumvent the kernel makes it impossible to use tools (e.g., tcpdump) or impose policies (e.g., QoS and filters) that need to consider traffic sent by different applications running on a host. This makes maintainability and manageability a challenge for kernel-bypass applications. In response we propose Kernel On-Path Interposition (KOPI), in which traditional kernel dataplane functionality is retained but implemented in a fully programmable SmartNIC. We hypothesize that KOPI can support the same tools and policies as the kernel stack while retaining the performance benefits of kernel bypass.
BibTeX @inproceedings{sadok-hotos21, author = {Sadok, Hugo and Zhao, Zhipeng and Choung, Valerie and Atre, Nirav and Berger, Daniel S. and Hoe, James C. and Panda, Aurojit and Sherry, Justine}, title = {We Need Kernel Interposition over the Network Dataplane}, year = {2021}, isbn = {}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3458336.3465281}, doi = {10.1145/3458336.3465281}, booktitle = {Proceedings of the Workshop on Hot Topics in Operating Systems}, pages = {152--158}, month = may, series = {HotOS '21} }
Abstract Intrusion Detection and Prevention Systems (IDS/IPS) are among the most demanding stateful network functions. Today's network operators are faced with securing 100Gbps networks with 100K+ concurrent connections by deploying IDS/IPSes to search for 10K+ rules concurrently. In this paper we set an ambitious goal: Can we do all of the above in a single server? Through the Pigasus IDS/IPS, we show that this goal is achievable, perhaps for the first time, by building on recent advances in FPGA-capable SmartNICs. Pigasus' design takes an FPGA-first approach, where the majority of processing, and all state and control flow are managed on the FPGA. However, doing so requires careful design of algorithms and data structures to ensure fast common-case performance while densely utilizing system memory resources. Our experiments with a variety of traces show that Pigasus can support 100Gbps using an average of 5 cores and 1 FPGA, using 38x less power than a CPU-only approach.
BibTeX @inproceedings{zhao-osdi20, author = {Zhao, Zhipeng and Sadok, Hugo and Atre, Nirav and Hoe, James and Sekar, Vyas and Sherry, Justine}, title = {Achieving 100Gbps Intrusion Prevention on a Single Server}, booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20)}, year = {2020}, address = {Banff, Alberta}, url = {https://www.usenix.org/conference/osdi20/presentation/zhao-zhipeng}, publisher = {USENIX Association}, month = nov, }
Abstract Caches are at the heart of latency-sensitive systems. In this paper, we identify a growing challenge for the design of latency-minimizing caches called delayed hits. Delayed hits occur at high throughput, when multiple requests to the same object queue up before an outstanding cache miss is resolved. This effect increases latencies beyond the predictions of traditional caching models and simulations; in fact, caching algorithms are designed as if delayed hits simply didn’t exist. We show that traditional caching strategies – even so called ‘optimal’ algorithms – can fail to minimize latency in the presence of delayed hits. We design a new, latency-optimal offline caching algorithm called BELATEDLY which reduces average latencies by up to 45% compared to the traditional, hit-rate optimal Belady’s algorithm. Using BELATEDLY as our guide, we show that incorporating an object’s ‘aggregate delay’ into online caching heuristics can improve latencies for practical caching systems by up to 40%. We implement a prototype, Minimum-AggregateDelay (MAD), within a CDN caching node. Using a CDN production trace and backends deployed in different geographic locations, we show that MAD can reduce latencies by 12-18% depending on the backend RTTs.
BibTeX @inproceedings{atre-sigcomm20, author = {Atre, Nirav and Sherry, Justine and Wang, Weina and Berger, Daniel S.}, title = {Caching with Delayed Hits}, year = {2020}, isbn = {9781450379557}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3387514.3405883}, doi = {10.1145/3387514.3405883}, booktitle = {Proceedings of the 2020 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM)}, pages = {495–513}, numpages = {19}, keywords = {Caching, Belatedly, Delayed hits}, location = {Virtual Event, USA}, series = {SIGCOMM ’20} }
Abstract Packet scheduling algorithms control the order in which a system serves network packets, which can have significant impact on system performance. Recent work shows that Weighted Shortest Job First (WSJF) -- scheduling packets by the ratio of job size to packet size -- significantly mitigates a system’s susceptibility to algorithmic complexity attacks (ACAs), a particularly dangerous class of Denial-of-Service (DoS) attacks. WSJF relies on knowledge of a packet’s job size, information that is not available a priori. In this work, we explore the theoretical implications of using heuristics for job size estimation. Further, we consider preemption as another technique that may help protect systems when job sizes are unknown. We find that heuristics with certain properties (estimated job-size-to-packet-size ratios increasing monotonically with the actual ratios, step function-based categorization) can protect systems against ACAs with varying guarantees, while preemption alone does not.
BibTeX @inproceedings{chiang-sigcomm22-poster, title={Robust Heuristics: Attacks and Defenses on Job Size Estimation for WSJF Systems}, author={Chiang, Erica and Atre, Nirav and Sadok, Hugo and Wang, Weina and Sherry, Justine}, booktitle={Proceedings of the 2022 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM) (Poster Session)}, year={2022} }
Abstract Intrusion Detection and Prevention Systems (IDS/IPSes) are critical components of the service chain for many network deployments. Pigasus is a state-of-the-art, FPGA-based IDS/IPS capable of processing 100Gbps input traffic, while supporting 100K concurrent flows and matching against 10K rules. However, despite its impressive performance, the original version of Pigasus (`Pigasus 1.0') suffers from several shortcomings that hinder its deployment potential. In this demo abstract, we present Pigasus 2.0, a rearchitecture of the Pigasus framework that not only enables efficient, elastic scaling of the datapath to adapt to different traffic workloads, but also provides resilience against algorithmic complexity attacks (ACAs), a potent class of Denial-of-Service (DoS) attacks.
BibTeX @article{zhao-sigcomm22-poster, title={Pigasus 2.0: Making the Pigasus IDS Robust to Attacks and Different Workloads}, author={Zhao, Zhipeng and Atre, Nirav and Sadok, Hugo and Sahay, Siddharth and Obla, Shashank and Hoe, James C. and Sherry, Justine}, booktitle={Proceedings of the 2022 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM) (Poster Session)}, year={2022} }
Abstract Traditional caching models emphasize hit rate as the principal measure of performance for cache replacement algorithms. However, hit rate alone can be misleading in the presence of a phenomenon known as a delayed hit. Delayed hits occur in high-throughput systems when multiple requests for an object accumulate before the object can be fetched from the backing store. Prior work by Atre et al. has explored the impact of delayed hits in simple caching scenarios, namely single-tier caches with uniform object sizes. In this work we seek to extend that investigation to consider multi-level caches, such as those that might be found in a modern CDN. Furthermore, we extend MAD, the delayed-hits-aware policy proposed by Atre et al., so that it can be deployed in a multi-tier caching system. We evaluate the performance of MAD using a multi-tier cache simulator and an empirical cache configuration based on modern CDNs. Our initial results lead us to believe that delayed hits can still be a prominent factor in the performance of multi-level caches, although their effect may be reduced in comparison to simpler cache configurations.
BibTeX @inproceedings{carleton-sosp2021-poster, title={Delayed Hits in Multi-Level Caches}, author={Carleton, Benjamin}, booktitle={Symposium on Operating Systems Principles (Poster Session)}, year={2021} }