This project presents a new approach for caching in CDNs that uses machine learning to approximate the Belady MIN (oracle) algorithm. To accomplish this complex task, we propose a CDN cache design called Learning Relaxed Belady (LRB) to mimic a Relaxed Belady algorithm, using the concept of Belady boundary. We also propose a metric called good decision ratio to help us make better design decisions. In addition, the paper addresses several challenges to build an end-to-end machine learning caching prototype, including how to gather training data, limit memory overhead, and have lightweight training and prediction.
We have implemented an LRB simulator and a prototype within Apache Traffic Server. Our simulation results with 6 production CDN traces show that LRB reduces WAN traffic compared to a typical production CDN cache design by 4– 25%, and consistently outperform other state-of-the-art methods. Our evaluation of the LRB prototype shows its overhead is modest and it can be deployed on today’s CDN servers.
Recent advances in the field of reinforcement learning promise a general approach to optimize networking systems. This paper argues against the recent trend for generalization by introducing a case study where domain-specific modeling enables the application of lightweight and robust learning techniques.
We study CDN caching systems, which make a good case for optimization as their performance directly affects operational costs, while currently relying on many hand-tuned parameters. In caching, reinforcement learning has been shown to perform suboptimally when compared to simple heuristics. A key challenge is that rewards (cache hits) manifest with large delays, which prevents timely feedback to the learning algorithm and introduces significant complexity.
This paper shows how to significantly simplify this problem by explicitly modeling optimal caching decisions (OPT). While prior work considered deriving OPT impractical, recent theoretical modeling advances change this assumption. Modeling OPT enables even lightweight decision trees to outperform state-of-the-art CDN caching heuristics.
Tail latency is of great importance in user-facing web services. However, maintaining low tail latency is challenging, because a single request to a web application server results in multiple queries to complex, diverse backend services (databases, recommender systems, ad systems, etc.). A request is not complete until all of its queries have completed. We analyze a Microsoft production system and find that backend query latencies vary by more than two orders of magnitude across backends and over time, resulting in high request tail latencies.
We propose a novel solution for maintaining low request tail latency: repurpose existing caches to mitigate the effects of backend latency variability, rather than just caching popular data. Our solution, RobinHood, dynamically reallocates cache resources from the cache-rich (backends which don’t affect request tail latency) to the cache-poor (backends which affect request tail latency). We evaluate RobinHood with production traces on a 50- server cluster with 20 different backend systems. Surprisingly, we find that RobinHood can directly address tail latency even if working sets are much larger than the cache size. In the presence of load spikes, RobinHood meets a 150ms P99 goal 99.7% of the time, whereas the next best policy meets this goal only 70% of the time.
Many recent caching systems aim to improve miss ratios, but there is no good sense among practitioners of how much further miss ratios can be improved. In other words, should the systems community continue working on this problem?
Currently, there is no principled answer to this question. In practice, object sizes often vary by several orders of magnitude, where computing the optimal miss ratio (OPT) is known to be NP-hard. The few known results on caching with variable object sizes provide very weak bounds and are impractical to compute on traces of realistic length.
We propose a new method to compute upper and lower bounds on OPT. Our key insight is to represent caching as a min-cost flow problem, hence we call our method the flow-based offline optimal (FOO). We prove that, under simple independence assumptions, FOO’s bounds become tight as the number of objects goes to infinity. Indeed, FOO’s error over 10M requests of production CDN and storage traces is negligible: at most 0.3%. FOO thus reveals, for the first time, the limits of caching with variable object sizes. While FOO is very accurate, it is computationally impractical on traces with hundreds of millions of requests. We therefore extend FOO to obtain more efficient bounds on OPT, which we call practical flow-based offline optimal (PFOO). We evaluate PFOO on several full production traces and use it to compare OPT to prior online policies. This analysis shows that current caching systems are in fact still far from optimal, suffering 11–43% more cache misses than OPT, whereas the best prior offline bounds suggest that there is essentially no room for improvement.
Most major content providers use content delivery networks (CDNs) to serve web and video content to their users. A CDN is a large distributed system of servers that caches and delivers content to users. The first-level cache in a CDN server is the memory-resident Hot Object Cache (HOC). A major goal of a CDN is to maximize the object hit ratio (OHR) of its HOCs. But, the small size of the HOC, the huge variance in the requested object sizes, and the diversity of request patterns make this goal challenging.
We propose AdaptSize, the first adaptive, size-aware cache admission policy for HOCs that achieves a high OHR, even when object size distributions and request characteristics vary significantly over time. At the core of AdaptSize is a novel Markov cache model that seamlessly adapts the caching parameters to the changing request patterns. Using request traces from one of the largest CDNs in the world, we show that our implementation of AdaptSize achieves significantly higher OHR than widelyused production systems: 30-48% and 47-91% higher OHR than Nginx and Varnish, respectively. AdaptSize also achieves 33-46% higher OHR than state-of-the-art research systems. Further, AdaptSize is more robust to changing request patterns than the traditional tuning approach of hill climbing and shadow queues studied in other contexts.
Meeting tail latency Service Level Objectives (SLOs) in shared cloud networks is known to be an important and challenging problem. The main challenge is determining limits on the multitenancy such that SLOs are met. This requires calculating latency guarantees, which is a difficult problem, especially when tenants exhibit bursty behavior as is common in production environments. Nevertheless, recent papers in the past two years (Silo, QJump, and PriorityMeister) show techniques for calculating latency based on a branch of mathematical modeling called Deterministic Network Calculus (DNC). The DNC theory is designed for adversarial worst-case conditions, which is sometimes necessary, but is often overly conservative. Typical tenants do not require strict worstcase guarantees, but are only looking for SLOs at lower percentiles (e.g., 99th, 99.9th).
This paper describes SNC-Meister, a new admission control system for tail latency SLOs. SNC-Meister improves upon the state-of-the-art DNC-based systems by using a new theory, Stochastic Network Calculus (SNC), which is designed for tail latency percentiles. Focusing on tail latency percentiles, rather than the adversarial worst-case DNC latency, allows SNC-Meister to pack together many more tenants: in experiments with production traces, SNC-Meister supports 75% more tenants than the state-of-the-art. We are the first to bring SNC to practice in a real computer system.
Modern packet-switched communication networks use resource sharing via statistical multiplexing, which requires routers to buffer packets temporarily when they cannot be forwarded immediately. Packet buffering mitigates the risk for packet loss, but contributes to network latency. The Non-Intrusive Delay Differentiation (NIDD) project seeks to offer applications a choice of different delay guarantees without side effects.
Frequency jamming is the fiercest attack tool to disrupt wireless communication and its malicious aspects have received much attention in the literature. Yet, several recent works propose to turn the table and employ so-called friendly jamming for the benefit of a wireless network. For example, recently proposed friendly jamming applications include hiding communication channels, injection attack defense, and access control. This project investigates the practical viability of friendly jamming by applying it in a real-world network.
This project develops a friendly jamming application on customer grade access points (the cheap WRT54GL, skip text and download here) and conducted a three weeks real-world study on the jammer's performance and side-effects on legitimate traffic (the cost of jamming) in a university office environment. Our results provide detailed insights on crucial factors governing the tradeoff between the effectiveness of friendly jamming (we evaluated up to 13 jammers) and its cost.
Adversarial Queueing Theory (AQT) has shown that seemingly innocent traffic injection rates might lead to unbounded queues in packet-switched networks - depending on scheduling strategies as well as topological characteristics. Little attention has been given to quantifying these effects in realistic network configurations. In particular, the existing AQT literature makes two unrealistic assumptions: infinite buffers and perfect synchrony. Because finite buffers inherently limit queue sizes, adversarial effects ultimately lead to packet loss which we address in this work. In addition, we study the effect of imperfect network synchronization under the packet loss metric. Our results, using analysis and simulation, indicate that classical AQT examples appear harmless under realistic assumptions but for a novel class of adversaries considerably higher loss can be observed. We introduce this class by giving examples of two new AQT concepts to construct loss-efficient network adversaries. Our analysis proves the robustness of these new adversaries against randomized de-synchronization effects in terms of variable link delays and nodal processing.