CMU CSD PhD BlogZola2024-01-12T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/atom.xmlBaleen: ML admission & prefetching for flash caches2024-01-12T00:00:00+00:002024-01-12T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/baleen-ml-flash-caching/<!-- After filling in the above "top-matter", as per instructions provided
in the `.md` file, you can write the main body of the blogpost here
onwards. Commonly used examples of syntax are shown below.
You can run `./local_server.sh` at the root of the repository to see
how the final blogpost looks in action. -->
<h1 id="introduction">Introduction</h1>
<p>Large-scale storage is still dominated by hard disks (HDDs) as they are cost-effective. However, HDDs are limited to ~100 IOs per second. Thus, modern storage systems in datacenters widely rely on flash caches to absorb backend load and reduce the number of HDDs required to satisfy the IO workload.</p>
<p>While flash has orders of magnitude higher IOPS, it suffers from wearout as it is written to.
Flash drives lifetime projections assume relatively low average write rates such as “three drive-writes per day (DWPD)”, meaning 3N TB of writes to a N TB flash drive each day. Flash drives with even lower write endurance (e.g., 1 DWPD) are priced correspondingly lower. Given that traditional cache management policies designed for dynamic random-access memory (DRAM) can incur writes exceeding 100 DWPD, there is a need for smart flash admission policies to filter the right items to be written into cache.</p>
<p>Machine learning (ML) policies have been proposed to improve upon historically popular policies, which include random admission and history-based policies that reject items without sufficient recent usage. However, caching is a challenging problem for ML to get right<a rel="noopener" target="_blank" href="https://pdl.cmu.edu/PDL-FTP/BigLearning/2018MachineLearningCDNcache_HOTNETS.pdf">[3]</a>. Furthermore, systems practitioners desire that policies also be understandable in addition to being performant<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a>.</p>
<p>We decompose the flash caching problem into admission, prefetching, and eviction. This helps us align policy decisions to well-understood supervised ML techniques. We also co-design these components, as we show that a policy can have synergistic or antagonistic effects on other parts of the system.</p>
<p>The Baleen flash cache exploits a new cache residency model (which we call episodes) to improve ML training effectiveness. The episodes model also enables a new useful comparison point (OPT). Baleen focuses on optimizing for an end-to-end metric (HDD disk-head time) that balances IOPS and bandwidth, rather than hit rate. We find that a combination of ML-guided admission and ML-guided prefetching works best in reducing peak backend load.</p>
<p>Baleen reduces HDD peak load by 11.8% over state-of-the-art policies on seven recent real-world storage cluster traces collected over 3 years. This work is under submission.</p>
<h1 id="background-bulk-storage-systems">Background: Bulk storage systems</h1>
<p>Bulk storage systems are relied on by hyperscalars to aggregate persistent
storage needs in data centers including blob storage and data warehouses
(such as HDFS<a rel="noopener" target="_blank" href="https://hadoop.apache.org">[7]</a>). Users might not
even know they are using one, as such systems function quietly behind the scenes
at cloud computing platforms like Amazon Web Services, Google Cloud Platform and
Microsoft Azure. In this paper, we use Meta’s Tectonic<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">[2]</a> as an important
and representative example of a bulk storage system. Many other systems have a
similar design (e.g., Google’s Colossus<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a><a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">[5]</a>, YouTube<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi23-song-zhenyu.pdf">[6]</a>). In Tectonic, as in other systems<a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">[5]</a>[6], flash caches are used to reduce load on the backing HDDs and meet throughput requirements.</p>
<p>Accesses are made to byte ranges within blocks. Blocks are mapped to a location on backing HDDs, and subdivided into many smaller units called segments that can be individually cached (Tectonic has 8 MB blocks and 128 KB segments). Upon an access, the cache is checked for all segments needed to cover the request byte range. If any are missing, an IO is made to the backing store to fetch them, at which point they can be admitted into the cache.</p>
<p>The storage system has 10,000s of storage nodes independently serving requests. The ratio of backing HDD space : flash cache : DRAM cache is 37,800 : 40 : 1. We focus on the scope of an individual node.</p>
<h2 id="bulk-storage-limited-by-disk-head-time">Bulk storage limited by disk-head time</h2>
<p>At scale, hard disks (HDDs) remain the choice of backing store as they are cheaper by an order of magnitude per GB than SSDs<a rel="noopener" target="_blank" href="https://web.archive.org/web/20221004225419/https://blocksandfiles.com/2020/08/24/10x-enterprise-ssd-price-premium-over-nearline-disk-drives">[8]</a>. Newer HDDs offer increased storage density, resulting in shrinking IO capacity (IOPS and bandwidth) per GB as more GBs are served by the same disk head.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/disk-head-time_vs_access-size_simple.png" alt="Fig 1: Disk-head Time consists of a seek & transfer time. This reflects disk-head times on our testbed." /></p>
<p style="text-align: center;"><em>Fig 1: <b>Disk-head Time</b> consists of a seek & transfer time. This reflects disk-head times on our testbed.</em></p>
<p>Disk-head time on backing HDDs is a premium resource. The mechanical nature of HDDs results in a high, size-independent access time penalty (e.g., 10 ms) for positioning the read/write head. With a high read rate (e.g., 5.5 ms per MB) and a maximum block size of 8 MB, a request could take 10 to 70 ms. In provisioning bulk storage, peak demand for disk-head time matters most. If the system has insufficient IO capacity, requests queue up, and slowdowns occur. If sustained, clients retry requests and failures occur, affecting user experience. Thus, bulk storage IO requirements are defined by peak load, which in turn affects storage costs.</p>
<h2 id="flash-caches-absorb-backend-load-but-have-limited-write-endurance">Flash caches absorb backend load but have limited write endurance</h2>
<p>Flash caching plays an important role in absorbing backend load, compensating for disk-head time limitations of the underlying HDDs. This setup enables resource-efficient storage for workloads that exceed the throughput requirements of HDDs but which are infeasible to store using flash alone. With the trends towards higher density HDDs and fewer bytes per HDD spindle, flash caches unlock more usable bytes per spindle.</p>
<p>Flash does not have access setup penalties, but does have wearout that translates into long-term average-write-rate limits. SSD manufacturers rate their drives’ endurance in terms of drive-writes per day (DWPD) over their warranty period.</p>
<p>Caching is an especially challenging workload for flash, since items will have widely varying lifetimes, resulting in a usage pattern closer to random I/Os than large sequential writes. Items admitted together may not be evicted at the same time, worsening write amplification. Writing every miss into flash would cause it to wear out prematurely.</p>
<p>Flash caches leverage <strong>admission policies</strong> (APs) to decide if items should be inserted into the flash cache or discarded, and have simple eviction policies (Least Recently Used, First-In First-Out) to minimize write amplification<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a>. Like eviction policies, admission policies weigh the benefit of hits from new items against lost hits from evicted items. They must also weigh the write cost of admitting the new item against other past or future items. Policies have an admission threshold that can be varied to achieve the target flash write rate. We provide some examples.</p>
<ul>
<li><strong>CoinFlip (baseline)</strong> On a miss, segments for an access are either all admitted, or not at all, with probability 𝑝. This simple policy does not need tracking of past items seen.</li>
<li><strong>RejectX (baseline)</strong> rejects a segment the first <em>X</em> times it is seen. Past accesses are tracked using probabilistic data structures similar to Bloom filters. We use X = 1 and vary the window size of past accesses to achieve the desired write rate. Both Meta <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a> and Google <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a> used this prior to switching to more complex policies.</li>
<li><strong>ML admission policies</strong> use offline features to make decisions in addition to online features such as past access counts. A ML model can be trained offline based on a trace (as we do), or online using reinforcement learning.</li>
</ul>
<h1 id="baleen-design">Baleen Design</h1>
<h2 id="optimize-for-disk-head-time-not-hits-or-bandwidth">Optimize for Disk-head time, not hits or bandwidth</h2>
<p>We propose that backing store load be measured using disk-head time (DT), which is a throughput metric that balances IOPS and bandwidth.</p>
<p><strong>Definition</strong>: Disk-Head Time (DT) is the cost of serving a single request to the backend. For a single IO that fetches <em>n</em> bytes:</p>
<p>$$ DT(n) = SeekTime + TransferTime * n $$</p>
<p>Policies are then assessed in terms of Disk-Head Time saved, rather than
object-level hits (corresponding to IOPS) or byte-level hits (corresponding to
bandwidth). Disk-Head Time can also be seen as a weighted sum of object-level
hits and byte-level hits.
We use Disk-Head Time to score episodes for OPT (our approximate optimal online
admission policy) and ultimately generate labels for training Baleen’s ML
admission policy. In training Baleen’s prefetcher, we use Disk-Head Time to
assess the benefit of prefetching for a particular episode.</p>
<p>System capacity, such as the number of backend servers, is provisioned to handle peak load in systems that need to meet realtime demand. Therefore, to reduce the backend size, one should minimize peak disk-head time. This introduces the need for scheduling (i.e., when to spend the flash write rate budget) to prioritize the admission of items that contribute to peak disk-head time. As explicitly optimizing for the peak introduces significant complexity, we leave that for future work. For Baleen, we design our methods to minimize average disk-head time, but show that they are successful in reducing peak disk-head time as well.</p>
<h2 id="decomposing-caching-into-admission-prefetching-and-eviction">Decomposing caching into admission, prefetching and eviction</h2>
<p>We define the caching problem as determining which times we should fetch, admit and evict each segment in order to minimize the backend’s DT given a flash write rate limit.</p>
<p>We propose a heuristic decomposition of this problem into three sub-problems: admission, prefetching, and eviction. This makes it easier to reason about the optimal solutions to each sub-problem and the training and behavior of ML solutions for each part.</p>
<p><strong>Admission:</strong> Whether to admit something into cache in anticipation of future hits that reduce disk-head time. We trade off the disk-head time saved against the write rate used from caching an item. We model admission as a binary classifier, where misses are admitted if the outout probability exceeds the policy threshold.</p>
<p><strong>Prefetching:</strong> Whether to prefetch extra segments outside the current access range (which was a miss). We trade off disk-head time saved from hits on the first accesses against the additional time spent in cache, and for incorrect prefetches, the disk-head time wasted and the opportunity cost of the wasted flash write rate. We further decompose the prefetching problem into a) deciding what segments to prefetch and b) when to prefetch (whether the expected benefit exceeds the cost, taking into account the possibility of mispredictions)</p>
<p><strong>Eviction:</strong> Which segment in the cache to pick for eviction upon an admission. One can employ existing approaches for non-flash caches, including ML-based policies. We employ a simple eviction policy (in our case, Least Recently Used) as is used in production systems, leaving ML-based flash-aware eviction policies for future work.</p>
<h2 id="introducing-episodes-an-offline-model-for-flash-caching">Introducing episodes: an offline model for flash caching</h2>
<p>We devised an offline model for flash caching for efficient evaluation of flash caching improvements, and to facilitate the training of ML-based policies. This model revolves around episodes, which are defined as:</p>
<p><strong>Definition</strong> An <strong>episode</strong> is a sequence of accesses that would be hits (apart from the first access) if the corresponding item was admitted. It is defined on a block, and may span multiple segments. As shown in Fig 2, an episode’s size is the number of segments needed to cache it, and its timespan is the length of time between the first access of any segment and the last eviction of a segment.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/episode_with_segments.png" alt="Fig 2: Episodes span space (measured in segments) in addition to time. An episode’s size is the smallest number of segments required to be admitted to get all possible hits within an episode. OPT-Range is (1,3) and (2,3) respectively." /></p>
<p style="text-align: center;"><em>Fig 2: Episodes span space (measured in segments) in addition to time. An episode’s size is the smallest number of segments required to be admitted to get all possible hits within an episode. OPT-Range is (1,3) and (2,3) respectively.</p></em>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/episode.png" alt="Fig 3: An episode is a group of accesses corresponding to a block’s residency. Accesses (in blue) are grouped into two episodes as the interarrival time (in red) exceeds the assumed eviction age." /></p>
<p style="text-align: center;"><em>Fig 3: An episode is a group of accesses corresponding to a block’s residency. Accesses (in blue) are grouped into two episodes as the interarrival time (in red) exceeds the assumed eviction age.</em></p>
<p>We generate episodes by exploiting the model of a LRU (Least Recently Used) cache as evicting items a constant logical time (eviction age) after the last access. In a LRU cache, the eviction age is the logical time between an item’s last access & eviction. As shown in Fig 3, we group accesses into episodes such that all inter-arrival times within episodes are no larger than the assumed eviction age.</p>
<p>Episodes provide a direct mapping to the costs and benefits associated with an admission, and which corresponds directly to the decisions actually being made by admission policies. These benefits and costs are associated with a item’s entire lifespan in cache, and are not obvious from looking at a stream of individual accesses. Moreover, with flash caching, it is optimal to admit as early as possible in the episode, given that the flash writes required are a fixed cost. By shifting the mental model from interdependent accesses to independent episodes, we can reason about decisions more easily.</p>
<p>Decisions on episodes can be made independent by assuming a constant eviction age. This also allows decisions to be made in parallel. The added pressure on cache space via an admission is accounted for via downwards pressure on the eviction age. We determine an appropriate eviction age using simulations that measure the average eviction age.</p>
<p>The episode model also allows for an efficient offline analytical analysis of policies via Little’s Law. Given the arrival rate and assumed eviction age, we can estimate the cache size required, and set eviction age such that the analytical cache size is equal to the cache size constraint. While this is much more efficient than an online simulation and is useful to explore a greater range of parameters than is possible with online simulation, the numbers will differ from simulated ones as the cache size constraint is not enforced all the time, only as a long-term average.</p>
<p>Admission policies can be viewed as partitioning these episodes into those admitted and discarded. This can be done via scoring episodes and ranking them by score.</p>
<h1 id="baleen-system-architecture">Baleen System Architecture</h1>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/architecture.png" alt="Fig 4: Baleen System Architecture." /></p>
<p style="text-align: center;"><em>Fig 4: Baleen System Architecture.</em></p>
<p>We describe Baleen’s architecture in terms of what happens at training time and when deployed with a CacheLib[<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a> implementation. At training time, episodes are generated and used to train Baleen’s ML admission and ML prefetching policies. At deployment time, the trained models are supplied to CacheLib which uses them to make decisions on the fly.</p>
<h2 id="opt-approximates-optimal-online-admission-policy">OPT approximates optimal online admission policy</h2>
<p>We devise an online admission policy, <strong>OPT</strong>, that we train Baleen’s ML policy to imitate. In OPT, first, each block’s accesses are grouped into episodes using an assumed eviction age. Second, all episodes are scored using the equation below and sorted. Last, the maximum number of episodes are admitted such that the total flash writes required do not exceed the write rate budget. During online simulation, accesses will be admitted if they belong to episodes that were marked as admitted during the offline process.</p>
<p>$$ Score(Episode)= \frac{DTSaved(Episode)}{Size(Episode)} $$</p>
<h2 id="training-baleen-to-imitate-opt">Training Baleen to imitate OPT</h2>
<p>We use OPT’s decisions as binary labels for training Baleen. Training examples are added for the first k (k = 6) accesses of each episode (to avoid biasing the training set towards popular but easy episodes). Features include offline metadata provided by the bulk storage system (which help identify which application the request originates from) and online history-based counts (how many hits the object has received in the last 1,2,3,4,5,6 hours).</p>
<h2 id="training-baleen-to-predict-what-and-when-to-prefetch">Training Baleen to predict what and when to prefetch</h2>
<p>By default, on a miss, the smallest IO that covers all missed segments is made, i.e., no prefetching occurs. It is possible to extend this IO and preemptively admit more segments. If done correctly, this reduces the total no of IOs needed and thus reduces Disk-head Time.</p>
<p>Baleen has two prefetching models: ML-Range, and ML-When.</p>
<h2 id="learning-what-to-prefetch-ml-range-learns-from-opt-range">Learning what to prefetch: ML-Range learns from OPT-Range</h2>
<p>OPT-Range is the minimal range of segments that will cover all accesses in an episode. Using the episodes model, we generate OPT-Range for each episode and use these as labels for ML-Range.</p>
<p>ML-Range is a ML model that predicts a range of segments for prefetching. We use the same features as the ML admission model, but add size-related features (access start index, access end index, access size). We train two regression models to predict the episode range start and end. Each episode is represented once in the training data, with only episodes that meet the score cutoff for the target write rate included</p>
<h2 id="learning-when-to-prefetch-ml-when">Learning when to prefetch: ML-When</h2>
<p>Fetching insufficient segments results in minimal or no Disk-head Time reduction. On the other hand, fetching excess segments results in a high write rate. To balance these tradeoffs, we need to know our confidence in our range prediction.</p>
<p>Mispredictions by the ML admission policy and in ML-Range can easily cause prefetching to hurt instead of help. In reality, the expected benefit will be lower than OPT prefetching and the cost can only be higher. The disk-head time saved from prefetching ML-Range may not be realized. Moreover, prefetching mispredictions are costly in terms of disk-head time consumed to fetch unused segments and the opportunity cost of flash writes used to store them. ML-When aims to address this and exclude epsiodes that do not have a high probability of benefiting from prefetching.
The exact equations are provided in our paper.</p>
<h1 id="evaluation">Evaluation</h1>
<p>We evaluate Baleen using a testbed and a simulator. We validate both with counters from production deployments. Each node in our testbed has a 400 GB flash drive and 2 4TB HDDs.</p>
<p>We report results on 7 Meta production traces collected in 2019, 2021 and 2023 and take an average across the traces.
These traces show week-long workloads on 7 Tectonic clusters from 3 different years,
with each cluster serving the storage needs of an entire data center<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">[2]</a>.
Each trace represents 7 days of production traffic from a single
cluster (except for Region3, which has 3 days), with traffic sampled at every
node (each cluster has thousands of nodes) and later aggregated into a trace.
The Region1 and Region2 traces were recorded from different clusters over the same 7 days in Oct 2019, while the Region3 trace was recorded from another cluster over 3 days in Sep 2019. Region4 was recorded over 7 days in Oct 2021, and the remaining traces (Region5, Region6, Region7) were collected in Mar 2023.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_wr-34-01.png" alt="Fig 5: Baleen reduces Peak Disk-head Time (DT) by an average of 11.8% over the best non-ML policy (RejectX), and 18.6% over random admission on 7 production traces from Meta under flash write rate constraints." /></p>
<p style="text-align: center;"><em>Fig 5: Baleen reduces Peak Disk-head Time (DT) by an average of 11.8% over the best non-ML policy (RejectX), and 18.6% over random admission on 7 production traces from Meta under flash write rate constraints.</em></p>
<p>Fig 5 shows Baleen reduces Peak DT over RejectX by an average of 11.8% across all traces.
In our paper, we show this ranges from 4.8% to 22.6% across the 7 traces,
with 3 regions deriving most of their gains from prefetching.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_csize.png" alt="Fig 6a: Baleen delivers improvements at higher cache sizes." /></p>
<p style="text-align: center;"><em>6a) Cache Sizes</em></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_wr.png" alt="Fig 6b: Baleen delivers improvements at higher cache sizes." /></p>
<p style="text-align: center;"><em>6b) Write Rates</em><br>
<em>Fig 6: Baleen continues to deliver improvements at higher cache sizes and write rates.</em></p>
<p>Fig 6 shows that the benefits of Baleen are consistent at higher cache sizes and write rates, with Baleen enabling a reduction in cache size by 55% while keeping
the same Peak DT as RejectX, or alternatively a reduction in Peak DT equivalent
to a 4X increase in cache size. As expected, increasing write rate or cache size has diminishing returns in reducing Peak DT. Also, the different admission policies (without prefetching) start to converge, indicating that admission by itself is insufficient to drive further reductions in Peak DT. We provide graphs for all 7 traces in our
paper.</p>
<p>Further results are available in our paper, such as:</p>
<ol>
<li><strong>Prefetching should be selective and in tandem with admission</strong>
We show both ML-Range and ML-When are effective in reducing Peak DT over static baselines, and contribute to Baleen’s robustness across the multiple traces.
We also show that prefetching must be paired with a good admission policy; if not, the same prefetching policy can hurt rather than help.</li>
<li><strong>Optimizing the right metric: Peak DT</strong> We show how optimizing for IO
hit ratio can be misleading, as doing so is optimal for reducing seeks, not
Disk-head Time.</li>
<li><strong>Validation of simulator and testbed.</strong> We validated Baleen on our simulator
against Baleen on our testbed. We took the additional step of showing that our
testbed is consistent with production counters.</li>
<li><strong>Trace analysis</strong> We show distributions for block popularity, interarrival times, access sizes and the compulsory miss trace, one-hit-wonder trace (fractions of blocks
with no reuse) and Admit-All flash write rate.</li>
</ol>
<p>In our paper, we described a few lessons gleaned from 3 years of deploying ML
in production caches at Meta. These lessons were that 1) optimizing the wrong metric is an easy misstep, 2) ML model performance does not always translate to production system performance, 3) to rethink the use of DRAM in flash caching, and that 4) ML-based caching should aim for encapsulation of ML, caching, and storage.
To read more, please see Section 6 (Lessons from deploying ML in production) of our paper.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Baleen is a flash cache that uses ML to guide both prefetching and cache admission, reducing peak storage backend load on real workload traces from Meta. Baleen’s design arose from a number of false-step lessons and a cache residency (episodes) formulation that improves training effectiveness, provides an ideal (OPT) target, and exposed the particular value of ML-guided prefetching. As such, Baleen is an important step forward in flash caching for disk-based storage systems.</p>
<p>More details are available in our paper, which <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast24/presentation/wong">has been accepted to FAST 2024</a>. Please
direct any correspondence to <a href="mailto:wonglkd@cmu.edu">Daniel Wong</a>.</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>This post is based on the paper <em>Baleen: ML Admission & Prefetching for Flash Caches</em>. I would like to thank my collaborators and the CacheLib and Tectonic teams at Meta: Hao Wu (Meta), Carson Molder (UT Austin), Sathya Gunasekar (Meta), Jimmy Lu (Meta), Snehal Khandkar (Meta), Abhinav Sharma (Meta), Daniel S. Berger (Microsoft Research & University of Washington), Nathan Beckmann (CMU), and Greg Ganger (CMU).
I would also like to thank the reviewers of this post: George Amvrosiadis, Rashmi Vinayak, and Thomas Kim.</p>
<h1 id="references">References</h1>
<ol>
<li>Benjamin Berg, Daniel S Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, et al. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">The CacheLib caching engine: Design and experiences at scale.</a> In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020.</li>
<li>Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, et al. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">Facebook’s Tectonic filesystem: Efficiency from exascale.</a> In 19th USENIX Conference on File and Storage Technologies (FAST 21), 2021</li>
<li>Daniel S Berger. <a rel="noopener" target="_blank" href="https://pdl.cmu.edu/PDL-FTP/BigLearning/2018MachineLearningCDNcache_HOTNETS.pdf">Towards lightweight and robust machine learning for CDN caching.</a> In Proceedings of the 17th ACM Workshop on Hot Topics in Networks (HotNets), 2018.</li>
<li>Tzu-Wei Yang, Seth Pollen, Mustafa Uysal, Arif Merchant, and Homer Wolfmeister. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">CacheSack: Admission optimization for google datacenter flash caches.</a> In USENIX Annual Technical Conference (USENIX ATC 22), 2022.</li>
<li>Dean Hildebrand and Denis Serenyi. <a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">Colossus under the hood: a peek into google’s scalable storage system</a>, 2021</li>
<li>Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Altınbüken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, Pawel Jurczyk, Richard Schooler, and Ramki Gummadi. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi23-song-zhenyu.pdf">Halp: Heuristic aided learned preference eviction policy for youtube content delivery network.</a> In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023.</li>
<li>Apache Software Foundation. (2010). <a rel="noopener" target="_blank" href="https://hadoop.apache.org">Hadoop</a>.</li>
<li><a rel="noopener" target="_blank" href="https://web.archive.org/web/20221004225419/https://blocksandfiles.com/2020/08/24/10x-enterprise-ssd-price-premium-over-nearline-disk-drives">Chris Mellor. Enterprise ssds cost ten times more than nearline disk drives.</a> Accessed: 2022-10-04.</li>
</ol>
Mimir: Finding cost-efficient storage configurations in the public cloud2023-12-15T00:00:00+00:002023-12-15T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/mimir/<p>In today’s landscape of diverse public cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, organizations are increasingly turning to cloud computing with pay-as-you-go pricing models. <a rel="noopener" target="_blank" href="https://www.cloudzero.com/blog/cloud-computing-statistics">Many businesses</a> are adopting public cloud services to simplify data center management or leverage the scalability and elasticity offered by these providers.</p>
<p>A pressing question that accompanies this shift to public cloud adoption is how to optimize the overall cost of utilizing cloud resources. While researchers have recently delved into cost optimization for virtual machine (VM) instances used in computational workloads, there has been limited focus on optimizing storage choices. Frequently, companies require high-performance storage clusters to efficiently operate their workloads in public clouds. However, the costs associated with these storage clusters cannot be underestimated, given that VMs and block storage options can strain budgets.</p>
<p><strong>Thus, companies need to carefully select resources for storage clusters to reduce their Total Cost of Ownership.</strong> If organizations opt for only inexpensive resources to minimize costs, their storage clusters may fail to meet performance requirements. Conversely, selecting solely high-performance Virtual Machines and storage types can lead to substantial spending compared to an optimized resource selection approach.</p>
<p><strong>Nonetheless, choosing the cost-efficient set of resources for storage clusters in public clouds remains a challenging task and there is no existing system that helps this provisioning decision.</strong> The multitude of available VM and storage types adds complexity. For instance, AWS alone offers over a <a rel="noopener" target="_blank" href="https://www.amazonaws.cn/en/ec2/instance-types/">hundred different instance types</a> and various block storage options, including <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized-instances.html">locally attached (LocalSSD)</a> and <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html">remotely disaggregated (EBS)</a>. Each resource option comes with <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html">distinct pricing and performance models</a>, and the performance also varies based on workload characteristics. This necessitates a deep understanding of both cloud resource attributes and workload characteristics to make informed selections. If we factor in the potential use of heterogeneous storage cluster configurations, the problem’s search space becomes significantly larger and more intricate.</p>
<p><strong>To address these challenges, we introduce Mimir, a resource auto-selection tool designed to identify the most cost-efficient set of resources for storage clusters in public clouds, all while meeting specified performance requirements.</strong> Our system assesses all available VM types, block storage options, and even combinations of these options (heterogeneous configurations) to determine the optimal solution. <strong>As a result, Mimir can yield storage cluster configurations up to 81% cheaper than those generated by the current state-of-the-art resource auto-selection tools.</strong> In our evaluations, we demonstrate that Mimir can also serve as a resource selector for mixed workloads (comprising multiple workloads with distinct characteristics) and dynamic workloads, efficiently identifying cost-effective cluster configurations within a reasonable time.</p>
<h2 id="challenges-navigating-diverse-resource-options-and-heterogeneity"><strong>Challenges: navigating diverse resource options and heterogeneity</strong></h2>
<h3 id="challenge-1-diverse-storage-options-characteristics"><strong>Challenge 1: diverse storage options’ characteristics</strong></h3>
<p>Public cloud providers have established unique performance characteristics for their block storage options, setting them apart from traditional solutions like SSDs and HDDs. Workload attributes such as access pattern (random/sequential), read ratio, and I/O unit size can exert significant influence on the performance of cloud block storage. Overlooking these factors or assuming that cloud storage behaves analogously to traditional storage can result in erroneous storage cluster configurations. To illustrate this, we present two examples showcasing how workload characteristics impact storage performance.</p>
<table><thead><tr><th align="center"><img src="./storage-io.png" alt="alt text" /></th><th align="center"><img src="./storage-rw.png" alt="alt text" /></th></tr></thead><tbody>
<tr><td align="center">(a) by I/O unit size</td><td align="center">(b) by read ratio</td></tr>
</tbody></table>
<p><strong>Fig. 1:</strong> Performance characteristics of public cloud storage volume types by (a) I/O unit size and (b) workload read ratio. In (a), both volume types have throughput limits defined by AWS (horizontal lines).</p>
<p>In our tests, we employed the <a rel="noopener" target="_blank" href="https://fio.readthedocs.io/en/latest/fio_doc.html">fio benchmark</a> to assess cloud block storage performance on AWS, using three different storage types: local NVMe SSD, remote SSD (gp2), and remote HDD (st1). We varied access patterns, read ratios, and I/O unit sizes. Fig. 1 provides insights into the characteristics of 1 TiB gp2 and 1 TiB st1 volumes, each having performance of 3000 IOPS and 40 MiB/s following the performance model provided by AWS, along with local SSD attached to i3.xlarge.</p>
<p>Fig. 1(a) shows <em>how performance characteristics vary with I/O unit size and access pattern for each storage type</em>. For gp2 performance, which is defined in IOPS, increasing the I/O unit size results in higher throughput, eventually reaching the maximum limit set by AWS. Also, it remains consistent regardless of the access pattern. In contrast, st1’s performance, defined in MiB/s, should ideally maintain a throughput of 40 MB/s regardless of the I/O unit size. However, it exhibits reduced throughput for workloads featuring random access patterns and I/O units smaller than 1 MiB, different to the behavior observed with sequential accesses.</p>
<p>In Fig. 1(b), we examine the <em>impact of read ratios on each volume type’s throughput</em>. EBS volumes remain unaffected by the read ratio, as it lies outside their performance models. Conversely, local SSD exhibits considerably higher throughput than EBS and is notably influenced by the read ratio.</p>
<p>As highlighted above, in public clouds, storage performance characteristics differ from traditional storage. For instance, remote SSD throughput remains consistent regardless of read-to-write ratios, while performance of different storage options changes differently for I/O unit size changes. This can confuse users configuring cloud storage clusters, as they may erroneously assume that cloud storage exhibit conventional storage behavior. However, by accurately considering these pricing and performance models, <strong>Mimir can mathematically deduce performance specifications from allocated resources, aiding in cost-efficient cloud storage configurations that meet performance needs</strong>.</p>
<p>It is worth to note that local SSD’s highest throughput in Fig. 1 does not always make it the best choice. Throughput of each storage option varies with its configuration; larger gp2 and st1 volumes can outperform local SSD. st1 and gp2 come with lower per-byte costs, making them cost-efficient alternatives when high throughput is not crucial.</p>
<h3 id="challenge-2-heterogeneity-is-important-for-cost-efficiency"><strong>Challenge 2: heterogeneity is important for cost-efficiency</strong></h3>
<p>One easy way of selecting resources for a storage cluster in the public cloud is configuring a homogeneous storage cluster by using a single storage option.
However, we found that there is no single storage option that is the most cost-efficient for every workload, and sometimes, even a mix of storage options is needed to minimize the cost.</p>
<p><img src="./motiv.png" alt="alt text" /></p>
<p><strong>Fig. 2:</strong> No volume type is most cost-efficient for every workload, and a mix of volume types may be the most cost-effective option.</p>
<p>Fig. 2 demonstrates the need to consider various volume types and configurations for selecting a cost-efficient Virtual Storage Cluster (VSC) configuration. For each of the three workloads, it shows the cost for the best VSC configuration under three constraints: using only local SSD volume types, only remote storage (EBS) volume types, and arbitrary mixes of both.</p>
<p>For Workload 1, which demands high storage throughput per GB of data, opting for EBS volume types leads to over-provisioning capacity, making it an expensive choice due to the 3 IOPS per provisioned-GB limit. Conversely, Workload 2, with lower storage throughput requirements, renders local SSD an expensive option due to over-provisioning storage performance. Workload 3 combines varying performance needs, necessitating a mix of storage options to minimize costs.</p>
<p><strong>Therefore, it is crucial to consider a heterogeneous VSC configuration for the cost-efficiency.</strong> However, this introduces complexity, making it impractical to explore the search space using naive methods. So users can use Mimir as a solution to efficiently navigate this complex search space by using dynamic programming and integer-linear programming.</p>
<h2 id="mimir-resource-auto-selector-for-storage-cluster-in-public-clouds"><strong>Mimir: resource auto-selector for storage cluster in public clouds</strong></h2>
<p>To tackle these challenges, we introduce Mimir, a resource auto-selector that identifies the cost-efficient set of VMs and storage volumes for a storage cluster. Mimir takes into account workload characteristics (such as read/write request ratio and data access locality) and user-defined requirements (including request rate and capacity). Next, we will provide an overview of Mimir’s workflow and delve into our main optimization algorithm.</p>
<h3 id="mimir-design-and-workflow"><strong>Mimir design and workflow</strong></h3>
<p>Fig. 3 outlines Mimir’s workflow, which begins by inputting characteristics from multiple workloads requiring cluster storage. Each storage cluster’s workload profiler profiles these attributes, and the Resource Profiler assesses them to determine resource needs for cost-effective cloud operations. This involves resource utilization profiling using micro-benchmarks, considering given data access patterns like request rate, access locality, and read/write ratios. The Resource Predictor uses this resource profiling data to identify efficient container sizes (i.e., storage/network bandwidth, CPU count, memory) for each workload, as Mimir utilizes containers to run multiple storage servers in the same VM with resource isolation. Finally, the VSC Cost Optimizer combines these insights with the public cloud’s cost model to optimize the Virtual Storage Cluster (VSC) configuration for the distributed storage system.</p>
<p><img src="./mimir_design.png" alt="alt text" />
<strong>Fig. 3:</strong> Mimir’s workflow for optimizing the price of public cloud resources. Initially, Mimir profiles the provided workloads, learning the precise resource requirements (such as CPU and memory). Using this trained module and a cost model encompassing public cloud resources, the VSC Cost Optimizer then identifies the most cost-efficient Virtual Storage Cluster (VSC) configuration.</p>
<p>Mimir assumes that users provide or profile the workload characteristics, which the system uses as input for its optimization process. This modular approach makes Mimir adaptable to any storage system capable of profiling sufficient workload information. Next, we provide a brief overview of the optimization algorithm used by Mimir to minimize costs for the given workload characteristics. Further details regarding other components, such as the Resource Profiler and Resource Predictor, can be found in <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3579370.3594776">our paper</a>.</p>
<h3 id="optimization-algorithm-dynamic-programming"><strong>Optimization algorithm: dynamic programming</strong></h3>
<p>The VSC Cost Optimizer addresses the following question: <em>What resource configuration minimizes costs while meeting performance requirements and accommodating storage workload characteristics?</em></p>
<p>In this optimization problem, we identified an optimal substructure property. This means that if Mimir determines the most cost-efficient virtual storage cluster configuration for the given workloads, then any subset of storage servers from the entire cluster (a sub-cluster) must also represent the cost-efficient configuration for the workloads running on that specific sub-cluster.</p>
<p><img src="./dynamic_programming.png" alt="alt text" />
<strong>Fig. 4:</strong> Mimir’s optimization problem has an optimal substructure property. If we find the most cost-efficient configuration for the entire virtual storage cluster, then any sub-cluster of that configuration must also be cost-efficient for the portion of data stored within that sub-cluster.</p>
<p>Figure 4 exemplifies the optimal substructure property. Suppose Machines 1-4 represent the most cost-efficient VSC configuration for a given workload. We contend that any sub-cluster should also be the most cost-efficient for the portion of the workload it handles. To prove this, we use a proof by contradiction. Let’s assume that Machines 1 and 2 are not the most cost-efficient sub-cluster configuration for 3/10 of the workload. This would imply the existence of another sub-cluster (in this case, Machine 5) that’s cheaper than Machines 1 and 2. However, this contradicts the fact that Machines 1-4 (total VSC cost: $28) constitute the most cost-efficient VSC configuration for the entire workload, given that a cheaper configuration involving Machines 3-5 (total VSC cost: $26) exists.</p>
<p>Based on the optimal substructure property, we use dynamic programming to break down a large search space into manageable segments. For a more in-depth understanding of our approach, including how we use mixed-integer programming for the base case and how Mimir integrates other components (e.g., resource profiler, resource predictor, and cost model) into its optimization algorithm, please refer to our paper.</p>
<h2 id="mimir-can-find-up-to-81-cheaper-storage-cluster-over-sota"><strong>Mimir can find up to 81% cheaper storage cluster over SOTA</strong></h2>
<p>We evaluated Mimir using Apache BookKeeper as the distributed storage backend and six different <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast20/presentation/cao-zhichao">Meta’s RocksDB (MR) key-value workloads</a>.
The results of our evaluation demonstrate significant cost savings achieved by Mimir compared to state-of-the-art solutions, showing its ability to consider a wide range of volume types. We compared Mimir to three baseline configurations, each focusing on a limited subset of instance or storage types, in contrast to Mimir’s comprehensive consideration of all instance and block storage types:</p>
<ol>
<li><strong>i3.xlarge-only:</strong> The simplest way to configure a VSC is using a single instance type (storage-optimized instance, i3.xlarge) and determining the number of instances based on the storage server performance.</li>
<li><strong>Mimir-LocalOnly:</strong> Another way is to use only instance types that have local SSDs, including some compute or memory optimized instance types like m5d, c5d, and r5d.</li>
<li><strong>Mimir-EBSonly/OptimusCloud-like:</strong> Yet another way of configuring VSC is using only EBS volumes that can persist data independently from the instance status, but if the workload requires high-performance, it can be more expensive
than local SSD. <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/atc20/presentation/mahgoub">OptimusCloud</a>, the previous work we consider as the state-of-the-art, restricts the volume type to EBS volumes because of their persistent nature, but our results show that this approach is often much more costly.</li>
</ol>
<p><img src="./evaluation.png" alt="alt text" />
<strong>Fig. 5:</strong> The cost-efficiency analysis of the optimization results of the workloads of Meta’s RocksDB key-value workloads. Throughput-intensive workloads (MR-A,C,E,F) prefer local SSD as its storage type. In contrast, other
workloads (MR-B,D) that do not require high throughput prefer EBS volume to local SSD.</p>
<p>Fig. 5 shows the VSC costs of the most cost-efficient VSC configurations under different resource constraints. Overall, Mimir successfully identifies the most cost-efficient VSC configuration for any workload in Fig. 5, and achieves up to 81% cost savings compared to the <em>OptimusCloud-like</em> baseline. We also demonstrate that depending on the workload characteristics, different workloads prefer different storage types to store data cost-efficiently.</p>
<p>For instance, MR-D, a capacity-intensive workload, does not require high performance. Thus, local SSD proves costly as it under-utilizes its storage bandwidth, and gp2’s throughput (3 IOPS per provisioned GiB) suffices for MR-D.</p>
<p>Conversely, MR-F, with the second highest throughput needs among the six workloads, benefits from local SSD, making Mimir-LocalOnly more cost-efficient than Mimir-EBSonly. Interestingly, for MR-F, compute-optimized instance types like c5d are more economical than storage-optimized i3.xlarge. This is because MR-F demands high computing power for its high data request rate. This evaluation implies that not only considering various storage options, but also selecting the right instance type is important.</p>
<p>Our paper also covers additional evaluations, including the optimization overhead and Mimir’s effectiveness for dynamic workloads. For comprehensive details, please refer to our research paper.</p>
<h2 id="conclusion"><strong>Conclusion</strong></h2>
<p>Mimir finds the cost-efficient virtual storage cluster configurations for distributed storage backends.
By using provided workload information and performance requirements, Mimir predicts resource requirements and explores the complex, heterogeneous set of block storage offerings to identify the lowest cost VSC configuration that satisfies the customer’s need.
Experiments show that no single allocation type is best for all workloads and that a mix of allocation types is the best choice for some workloads.
Compared to a state-of-the-art approach, Mimir finds the VSC configurations that satisfy requirements at up to 81% lower cost for Meta’s RocksDB workloads.</p>
<p>You can find more detailed information in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3579370.3594776">published paper</a>.</p>
Transfer Learning within a Heterogeneous Graph2023-10-31T00:00:00+00:002023-10-31T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/ktn/<h3 id="can-we-transfer-knowledge-between-different-data-types-using-their-connectivity-information">Can we transfer knowledge between different data types using their connectivity information?</h3>
<p>Ecosystems in industry are commonly composed of various data types in terms of data modalities or feature distributions. <strong>Heterogeneous graphs</strong> (HGs) present these multimodal data systems in a unified view by defining multiple types of nodes and edges — for instance, e-commerce networks with (<em>user, product, review</em>) nodes or video platforms with (<em>channel, user, video, comment</em>) nodes. <strong>Heterogeneous graph neural networks</strong> (HGNNs) learn node embeddings, which summarize each node’s heterogeneous local structure into a vector. Unfortunately, there is a <strong>label imbalance</strong> issue between different node types in real-world HGs. For instance, publicly available content node types such as product nodes are abundantly labeled, whereas labels for user or account nodes may not be available due to privacy restrictions. Because of this, label-scarce node types cannot exploit HGNNs, hampering the broader applicability of HGNNs.</p>
<p>In this blog, we introduce how to pre-train an HGNN model on label-abundant node types and then transfer the model to label-scarce node types using relational information given in HGs. You can find details of the work in our paper “<em>Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks</em>” [1], presented at NeurIPS 2022.</p>
<h2 id="what-is-a-heterogeneous-graph-hg">What is a heterogeneous graph (HG)?</h2>
<figure>
<img src="./figure1.png" alt="e-commerce heterogeneous graph" width="400"/>
<figcaption>Figure 1. E-commerce heterogeneous graph: Can we transfer knowledge from label-abundant node types (e.g., products) to zero-labeled node types (e.g., users) through relational information given in a heterogeneous graph?
</figcaption>
</figure>
<p>An HG is composed of multiple node and edge types. Figure 1 shows an e-commerce network presented as an HG. In e-commerce, “users” purchase “products” and write “reviews”. HG presents this ecosystem using three node types (“user”, “product”, “review”) and three edge types (“user-buy-product”, “user-write-review”, review-on-product”). Individual products, users, and reviews are then presented as nodes and their relationships as edges in the HG with the corresponding node/edge types.</p>
<p>In addition to all relational information, HGs are commonly provided with <em>input node attributes</em> that summarize each node’s information. For instance, product nodes could have product images as input node attributes, while review nodes could have review texts as their input attributes. As in the example, input node attributes could have different modalities across different node types. The goal is to predict <em>node labels</em> on each node, such as the category of each product or the category each user is most interested in.</p>
<p>In the following section, we introduce the main challenge we face while training HGNNs to predict labels using input node attributes and relational information from HGs.</p>
<h2 id="heterogeneous-graph-neural-networks-hgnns-and-label-scarcity-issues">Heterogeneous graph neural networks (HGNNs) and label scarcity issues</h2>
<p>HGNNs compute node embeddings that summarize each node’s local graph structures including the node and its neighbor’s input attribute distributions. Node embeddings are then fed into a classifier to predict each node’s label. To train an HGNN model and a classifier to predict labels for a specific node type, we require a good amount of labels for the node type.</p>
<p>A common issue in real-world applications of deep learning is label scarcity. With their diverse node types, HGNNs are even more likely to face this challenge. For instance, publicly available content node types are abundantly labeled, whereas labels for user nodes may not be available due to privacy restrictions. This means that in most standard training settings, HGNN models can only learn to make good inferences for a few label-abundant node types and can usually not make any inferences for the remaining node types, given the absence of any labels for them.</p>
<p>To solve this label scarcity issue, we will use a technique called zero-shot transfer learning that improves the performance of a model on a zero-labeled domain.</p>
<h2 id="transfer-learning-on-heterogeneous-graphs">Transfer Learning on Heterogeneous Graphs</h2>
<p>To improve the performance on a zero-labeled “target” domain, transfer learning exploits the knowledge earned from a related “source” domain, which has adequate labeled data. For instance, transfer learning on heterogeneous graphs first trains an HGNN model on the source domain using their labels, then reuses the HGNN model on the target domain.</p>
<p>In order to apply transfer learning on heterogeneous graphs to solve the label scarcity issue we described above, it is clear the target domain should be the zero-labeled node types. The question remained of what would be the source domain. Previous works commonly set the source domain as the same type of nodes but located in an external HG, assuming those nodes are abundantly labeled (Figure 2). For instance, the source domain is user nodes in the Yelp review graph, while the target domain is user nodes in the Amazon e-commerce graph. This approach, also known as <em>graph-to-graph transfer learning</em>, pre-trains an HGNN model on the external HG and then runs the model on the original label-scarce HG [2, 3].</p>
<center>
<figure>
<img src="./figure2.png" alt="graph-to-graph transfer learning" width="600"/>
<figcaption>Figure 2. Illustration of graph-to-graph transfer learning on heterogeneous graph.</figcaption>
</figure>
</center>
<p>However, this approach is not applicable in many real-world scenarios for three reasons. First, any external HG that could be used in a graph-to-graph transfer learning setting would almost surely be <em>proprietary</em>, thereby, making it hard to get access to. Second, even if practitioners could obtain access to an external HG, it is unlikely that the <em>distribution of the external HG</em> would match our target HG well enough to apply transfer learning. Finally, node types suffering from <em>label scarcity</em> are likely to suffer the same issue on other HGs. For instance, user nodes on the external HG also have scarce labels with privacy constraints.</p>
<h2 id="our-approach-transfer-learning-between-node-types-within-a-heterogeneous-graph">Our approach: transfer learning between node types within a heterogeneous graph</h2>
<p>To overcome the limitation of usage of external HGs for transfer learning, we introduce a practical source domain, <em>other node types with abundant labels located on the same HG</em>. Instead of using extra HGs, we transfer knowledge across different types of nodes within a single HG assumed to be fully owned by the practitioners. More specifically, we first pre-train an HGNN model and a classifier on a label-abundant “source” node type. Then, we reuse the models on the zero-labeled “target” node types located in the same HG without additional finetuning. The one requirement for this approach is that the source and target node types share the same label set. This requirement is frequently satisfied in real-world settings. For instance, product nodes have a label set describing product categories, and user nodes share the same label set describing their favorite shopping categories in e-commerce HGs.</p>
<h2 id="main-technical-challenge">Main technical challenge</h2>
<p>We now describe the main challenge in realizing our approach. We cannot directly reuse the pretrained HGNN and classifier on the target node type as described above because HGNN maps the source and target embeddings into the different embedding spaces.</p>
<figure>
<img src="./figure3.png" alt="l2 norm of gradients passed to each module in the HGNN" width="450"/>
<figcaption>
Figure 3. The L2 norm of gradients passed to each module in the HGNN while pretraining on the source node type. Green and Red lines show large amounts of gradients passed to source node type-specific modules, while blue and orange lines show little or no gradients passed to target type-specific modules.
</figcaption>
</figure>
<p>This happens because of one crucial characteristic of HGNNs — HGNNs are composed of modules specialized to each node type and use distinct sets of modules to compute embeddings for each node type. During pretraining HGNNs on the source node type, modules specialized to the source node type are well-trained, while modules specialized to the target node are untrained or under-trained. In Figure 3, we can observe the source modules (green and red lines) receive high L2 norms of gradients during pretraining. On the other hand, because of the specialization, the target modules (orange and blue lines) receive little or no gradients. With under-trained modules for the target node type, the pretrained HGNN model outputs poor node embeddings for the target node type, and, consequently, poor performance on the node prediction task.</p>
<h2 id="ktn-trainable-cross-type-transfer-learning-for-hgnns">KTN: Trainable Cross-Type Transfer Learning for HGNNs</h2>
<p>Now, we introduce a method to transform the under-trained poor embeddings of the target node type to follow source embeddings. This allows us to reuse the classifier that was trained on source node types. In order to derive the transformation in a principled manner, let us look into how HGNNs compute node embeddings and analyze the relationship between source and target embeddings.</p>
<figure>
<img src="./figure4.png" alt="HGNN structure" width="600"/>
<figcaption>
Figure 4. (left) In HGNNs, the final L-layer node embeddings for both source and target types are computed using the same input, the previous (L-1)-layer’s node embeddings. (right) The L-layer node embeddings of the source type (product, blue) can be represented by the L-layer node embeddings of the target type (user, red) using (L-1)-layer node embeddings as intermediate values.
</figcaption>
</figure>
<p>In each layer, HGNNs aggregate connected nodes’ embeddings from the previous layer to update each target node’s embeddings. Node embeddings for both source and target node types are updated using the same input: the previous layer’s node embeddings of any connected node types (Figure 4, left). This means that they can be represented by each other using the previous layer embeddings as intermediate values (Figure 4, right).</p>
<p>We prove there is a mapping matrix from the target domain to the source domain, which is defined by HGNN parameters (Theorem 1 in [1]). Based on this theorem, we introduce an auxiliary network, named Knowledge Transfer Networks (KTN), that learns the mapping matrix from scratch during pretraining HGNN on the source domain. At test time, we first compute target embeddings using the pretrained HGNN, then map the target embeddings to the source domain using our trained KTN. Finally, we can reuse the classifier with transformed target embeddings.</p>
<h2 id="experimental-results">Experimental results</h2>
<figure>
<img src="./figure5.png" alt="zero-shot transfer learning results on OAG and Pubmed" width="600"/>
<figcaption>
Figure 5. Zero-shot transfer learning performance measured in NDCG on Open Academic Graph (OAG) and Pubmed datasets. Higher is better. Our proposed method KTN (red bar) shows the highest accuracy among all baselines.
</figcaption>
</figure>
<p>To examine the effectiveness of our proposed KTN, we ran 18 different zero-shot transfer learning tasks on two public heterogeneous graphs, Open Academic Graph [4] and Pubmed [5]. We compare KTN with 8 state-of-the-art transfer learning methods. We show our results in Figure 5. Our proposed method KTN consistently outperforms all baselines on all tasks by up to 73.3%. The naive approach we discussed earlier — reuse the pretrained models directly on the target domain without any transfer learning — is presented as blue bar. We see our method KTN provides relative gains of up to 340% higher than the naive approach without using any labels from the target domain.</p>
<figure>
<img src="./figure6.png" alt="KTN with 6 different HGNN models" width="450"/>
<figcaption>
Figure 6. KTN can be applied to 6 different HGNN models and improve their zero-shot performance on target domains. Performance is measured in NDCG. Higher is better.
</figcaption>
</figure>
<p>KTN can be applied to almost all HGNN models that have node/edge type-specific modules and improve their zero-shot performance on target domains. In Figure 6, KTN improves accuracy on zero-labeled node types across 6 different HGNN models by up to 960%.</p>
<h2 id="takeaway">Takeaway:</h2>
<p>Various real-world applications can be presented as heterogeneous graphs. Heterogeneous graph neural networks (HGNNs) are an effective technique for summarizing heterogeneous graphs into concise embeddings. However, label scarcity issues on certain types of nodes have prevented the broader application of HGNNs. In this post, we introduced KTN, the first cross-type transfer learning method designed for HGNNs. With KTN, we can fully exploit the rich relational information of heterogeneous graphs with HGNNs on any nodes regardless of their label scarcity.</p>
<p>For more details about KTN, check out our paper [1].</p>
<p>[1] Minji Yoon, John Palowitch, Dustin Zelle, Ziniu Hu, Ruslan Salakhutdinov, Bryan Perozzi. <em>Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks</em>, Neural Information Processing Systems (NeurIPS) 2022.</p>
<p>[2] Tiancheng Huang, Ke Xu, and Donglin Wang. <em>Da-hgt: Domain adaptive heterogeneous graph transformer.</em> arXiv preprint arXiv:2012.05688, 2020.</p>
<p>[3] Shuwen Yang, Guojie Song, Yilun Jin, and Lun Du. <em>Domain adaptive classification on heterogeneous information networks.</em> International Joint Conferences on Artificial Intelligence (IJCAI) 2021.</p>
<p>[4] Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, et al. <em>Oag: Toward linking large-scale heterogeneous entity graphs.</em> In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2019.</p>
<p>[5] Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. <em>Heterogeneous network representation learning: A unified framework with survey and benchmark.</em> IEEE Transactions on Knowledge and Data Engineering, 2020.</p>
FIFO is Better than LRU: the Power of Lazy Promotion and Quick Demotion2023-09-20T00:00:00+00:002023-09-20T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/fifo-lru/<blockquote>
<p><strong>TL;DR:</strong>
Historically FIFO-based algorithms are thought to be less efficient (having higher miss ratios) than LRU-based algorithms.
In this blog, we introduce two techniques, <strong>lazy promotion</strong>, which promotes objects only at eviction time, and <strong>quick demotion</strong>, which evicts most new objects quickly. We will show that</p>
<ul>
<li>Conventional-wisdom-suggested “weak LRUs”, e.g., FIFO-Reinsertion, is actually more efficient (having lower miss ratios) than LRU;</li>
<li>Simply evicting most new objects quickly can improve the state-of-the-art algorithm’s efficiency.</li>
<li>Eviction algorithms can be designed like building with LEGOs by adding <strong>lazy promotion</strong> and <strong>quick demotion</strong> on top of FIFO.</li>
</ul>
</blockquote>
<h2 id="introduction">Introduction</h2>
<p>Caching is a well-known and widely deployed technique to speed up data access, reduce repeated computation and data transfer.
A core component of a cache is the eviction algorithm, which chooses the objects stored in the limited cache space.
Two metrics describe the performance of an eviction algorithm: efficiency measured by the miss ratio and throughput measured by the number of requests served that can be served per second.</p>
<p>The study of cache eviction algorithms has a long history, with a majority of the work centered around LRU (that is, to evict the least-recently-used object).
Generally, LRU maintains a doubly-linked list, promoting objects to the head of the list upon cache hits and evicting the object at the tail of the list when needed.
<a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3399709">Belady and others found</a> that memory access patterns often exhibit temporal locality — “the most recently used pages were most likely to be reused in the immediate future”. Thus, LRU using <em>recency</em> to promote objects was found to be better than FIFO.</p>
<p>Most eviction algorithms designed to achieve high efficiency start from LRU.
For example, many algorithms, such as <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast-03/arc-self-tuning-low-overhead-replacement-cache">ARC</a>, <a rel="noopener" target="_blank" href="https://research.facebook.com/publications/an-analysis-of-facebook-photo-caching/">SLRU</a>, <a rel="noopener" target="_blank" href="https://www.vldb.org/conf/1994/P439.PDF">2Q</a>, <a rel="noopener" target="_blank" href="https://www.usenix.org/legacy/events/usenix01/full_papers/zhou/zhou.pdf">MQ</a>, and <a rel="noopener" target="_blank" href="https://lwn.net/Articles/856931/">multi-generational LRU</a>, use multiple LRU queues to separate hot and cold objects. Some algorithms, e.g., <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/511399.511340?casa_token=x3My6rber5UAAAAA%3A7Gbpkgt2k6RMf95GUwvxrsY0-R-q5EpEN_uXRAfF4loxK2vo9yFtFh6Vo5R-30Vlkv1_3BtwnJiomlw">LIRS</a>, maintain an LRU queue but use different metrics to promote objects. While other algorithms, e.g., <a rel="noopener" target="_blank" href="https://www.computer.org/csdl/journal/tc/2001/12/t1352/13rRUxBJhES">LRFU</a>, <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/301464.301486">EE-LRU</a>, <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/hotstorage18/hotstorage18-paper-vietri.pdf">LeCaR</a>, and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-rodriguez.pdf">CACHEUS</a>, augment LRU’s recency with different metrics. In addition, many recent works, e.g., <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/abstract/document/7056022">Talus</a>, improve LRU’s ability to handle scan and loop requests.</p>
<p>Besides efficiency (miss ratio), there have been fruitful studies on enhancing the cache’s execution performance and thread scalability. Each cache hit in LRU promotes an object to the head of the queue, which requires updating at least six pointers guarded by locks.
These overheads are not acceptable in many deployments that need high performance.
Thus, performance-centric systems often use FIFO-based algorithms to avoid LRU’s overheads.
For example, FIFO-Reinsertion and variants of CLOCK have been developed, which serve as LRU approximations.
<em>It is often perceived that these algorithms trade miss ratio for better throughput and scalability.</em></p>
<p>In this blog, I am going to show that FIFO is in-fact better than LRU not only because of higher throughput, better scalability, but also because of improved effectiveness (having lower miss ratios).</p>
<h2 id="why-fifo-and-what-it-needs">Why FIFO and What it needs</h2>
<p>FIFO has many benefits over LRU.
For example, FIFO has <em>less metadata</em> and requires no metadata update on each cache hit, and thus is <em>faster and more scalable</em> than LRU. In contrast, LRU requires updating six pointers on each cache hit, which is not friendly for modern computer architecture due to random memory accesses. Moreover, FIFO is always the first choice when implementing a flash cache because it does not incur write amplification. Although FIFO has throughput and scalability benefits, it is conventional wisdom that FIFO is less effective (having higher miss ratio) than LRU.</p>
<p align="center">
<figure class="image"><img src="cacheAbs.svg" alt="A cache abstraction" style="width:80%; display: block; margin-left: auto; margin-right: auto;">
<figcaption>A cache can be viewed as a logically ordered queue with four operations: insertion, removal, promotion and demotion. Most eviction algorithms can be viewed as promotion algorithms because they focus on how to promote objects. </figcaption>
</figure>
</p>
<p>To understand the various factors that affect the miss ratio, we introduce a cache abstraction.
A cache can be viewed as a logically total-ordered queue with four operations: <span style="font-family:monaco;">insertion</span>, <span style="font-family:monaco;">removal</span>, <span style="font-family:monaco;">promotion</span>, and <span style="font-family:monaco;">demotion</span>.
Objects in the cache can be compared and ordered based on some metric (e.g., time since the last request), and the eviction algorithm evicts the least valuable object based on the metric.
<span style="font-family:monaco;">Insertion</span> and <span style="font-family:monaco;">removal</span> are user-controlled operations, where <span style="font-family:monaco;">removal</span> can either be directly invoked by the user or indirectly via the use of time-to-live (TTL).
<span style="font-family:monaco;">Promotion</span> and <span style="font-family:monaco;">demotion</span> are internal operations of the cache used to maintain the logical ordering between objects.</p>
<p>We observe that most eviction algorithms use <span style="font-family:monaco;">promotion</span> to update the ordering between objects.
For example, LRU-based algorithms promote objects to the head of the queue on cache hits, which we call <span style="font-family:monaco;">eager promotion</span>.
Meanwhile, <span style="font-family:monaco;">demotion</span> is performed implicitly: when an object is promoted, other objects are passively demoted.
We call this process <span style="font-family:monaco;">passive demotion</span>, a slow process as objects need to traverse through the cache queue before being evicted.
However, we will show that instead of eager promotion and passive demotion, eviction algorithms should use <strong>lazy promotion</strong> and <strong>quick demotion</strong>.</p>
<h2 id="lazy-promotion">Lazy Promotion</h2>
<p>To avoid popular objects from being evicted while not incurring much performance overhead, we propose adding <strong>lazy promotion</strong> on top of FIFO (called <span style="font-family:arial; font-variant-cap:petite-caps"> LP-FIFO</span>), which <em>promotes objects only when they are about to be evicted</em>.
<strong>lazy promotion</strong> aims to retain popular objects with minimal effort.
An example is FIFO-Reinsertion (note that FIFO-Reinsertion, 1-bit CLOCK, and Second Chance are different implementations of the same eviction algorithm): an object is reinserted at eviction time if it has been requested while in the cache. </p>
<p><span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> has several benefits over eager promotion (promoting on every access) used in LRU-based algorithms.
First, <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> inherits FIFO’s throughput and scalability benefits because few metadata operations are needed when an object is requested. For example, FIFO-Reinsertion only needs to update a Boolean field upon the <em>first</em> request to a cached object without locking.
Second, performing promotion at eviction time allows the cache to make better decisions by accumulating more information about the objects, e.g., how many times an object has been requested.</p>
<style>
td,th {
font-size: 96%;
}
</style>
<table><thead><tr><th>Trace</th><th>approx time</th><th align="right">#trace</th><th align="right">cache type</th><th align="right">#req (millions)</th><th align="right">#obj (millions)</th></tr></thead><tbody>
<tr><td>MSR</td><td>2007</td><td align="right">13</td><td align="right">block</td><td align="right">410</td><td align="right">74</td></tr>
<tr><td>FIU</td><td>2008</td><td align="right">9</td><td align="right">block</td><td align="right">514</td><td align="right">20</td></tr>
<tr><td>Cloudphysics</td><td>2015</td><td align="right">106</td><td align="right">block</td><td align="right">2,114</td><td align="right">492</td></tr>
<tr><td>Major CDN</td><td>2018</td><td align="right">219</td><td align="right">object</td><td align="right">3,728</td><td align="right">298</td></tr>
<tr><td>Tencent Photo</td><td>2018</td><td align="right">2</td><td align="right">object</td><td align="right">5,650</td><td align="right">1,038</td></tr>
<tr><td>Wiki CDN</td><td>2019</td><td align="right">3</td><td align="right">object</td><td align="right">2,863</td><td align="right">56</td></tr>
<tr><td>Tencent CBS</td><td>2020</td><td align="right">4030</td><td align="right">block</td><td align="right">33,690</td><td align="right">551</td></tr>
<tr><td>Alibaba</td><td>2020</td><td align="right">652</td><td align="right">block</td><td align="right">19,676</td><td align="right">1702</td></tr>
<tr><td>Twitter</td><td>2020</td><td align="right">54</td><td align="right">KV</td><td align="right">195,441</td><td align="right">10,650</td></tr>
<tr><td>Social Network</td><td>2020</td><td align="right">219</td><td align="right">KV</td><td align="right">549,784</td><td align="right">42,898</td></tr>
</tbody></table>
<p>To understand <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span>’s efficiency,
we performed a large-scale simulation study on 5307 production traces from 10 data sources, which include open-source and proprietary datasets collected between 2007 and 2020.
The 10 datasets contain 814 billion (6,386 TB) requests and 55.2 billion (533 TB) objects, and cover different types of caches, including block, key-value (KV), and object caches.
We further divide the traces into block and web (including Memcached and CDN).
We choose small/large cache size as 0.1%/10% of the number of unique objects in the trace.</p>
<p>We compare the miss ratios of LRU with two <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> algorithms:
FIFO-Reinsertion and 2-bit CLOCK.
2-bit CLOCK tracks object frequency up to three, and an object’s frequency decreases by one each time the CLOCK hand scans through it. Objects with frequency zero are evicted.</p>
<p>Common wisdom suggests that these two <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> examples are LRU approximations and will exhibit higher miss ratios than LRU.
However, we found that <strong><span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> often exhibits miss ratios lower than LRU</strong>.</p>
<div style="display: flex; justify-content: space-around;">
<img src="multi_LRU_FIFO_Reinsertion_1.svg" alt="small cache" style="width:40%">
<img src="multi_LRU_FIFO_Reinsertion_3.svg" alt="large cache" style="width:40%">
</div>
<div style="width: 88%; margin: 0 auto;">
Comparison of FIFO-Reinsertion and LRU on 10 datasets with 5307 traces. Left: small cache, right: large cache.
</div>
<div style="display: flex; justify-content: space-around;">
<img src="multi_LRU_Clock-2_1.svg" alt="small cache" style="width:40%">
<img src="multi_LRU_Clock-2_3.svg" alt="large cache" style="width:40%">
</div>
<div style="width: 88%; margin: 0 auto;">
Comparison of 2-bit CLOCK and LRU on 10 datasets with 5307 traces. Left: small cache, right: large cache. A longer bar means the algorithm is more efficient (having lower miss ratios on more traces). Note that we do not consider the overhead of LRU metadata in these evaluations.
</div>
<p>The figure above shows that FIFO-Reinsertion and 2-bit CLOCK are better than LRU on most traces.
Specifically, FIFO-Reinsertion is better than LRU on 9 and 7 of the 10 datasets using a small and large cache size, respectively.
Moreover, on half of the datasets, more than 80% of the traces in each dataset favor FIFO-Reinsertion over LRU at both sizes.
On the two social network datasets, LRU is better than FIFO-Reinsertion (especially at the large cache size). This is because most objects in these two datasets are accessed more than once, and using one bit to track object access is insufficient. Therefore, when increasing the one bit in FIFO-Reinsertion (CLOCK) to two bits (2-bit CLOCK), we observe that the number of traces favoring <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> increases to around 70%.
Across all datasets, 2-bit CLOCK is better than FIFO on all datasets at the small cache size and 9 of the 10 datasets at the large cache size.</p>
<figure class="image">
<img src="LP.svg" alt="Lazy promotion leads to quick demotion" style="width:64%">
<figcaption>FIFO-Reinsertion demotes new objects faster than LRU because objects requested before the new object also pushes it down the queue.</figcaption>
</figure>
<p>Two reasons contribute to <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span>’s high effectiveness.
First, <strong>lazy promotion</strong> often leads to <strong>quick demotion</strong>. For example, under LRU, a newly-inserted object <em>G</em> is pushed down the queue only by (1) new objects and (2) cached objects that are requested after <em>G</em>. However, besides the objects requested after <em>G</em>, the objects requested before <em>G</em> (but have not been promoted, e.g., <em>B</em>, <em>D</em>) also push <em>G</em> down the queue when using FIFO-Reinsertion.
Second, compared to promotion at each request, object ordering in <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> is closer to the insertion order, which we conjecture is better suited for many workloads that exhibit popularity decay — old objects have a lower probability of getting a request.</p>
<p>While <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> surprisingly wins over LRU in miss ratio, it does not outperform state-of-the-art algorithms. We next discuss another building block that bridges this gap.</p>
<h2 id="quick-demotion">Quick Demotion</h2>
<p>Efficient eviction algorithms not only need to keep popular objects in the cache but also need to evict unpopular objects fast. In this section, we show that <strong>quick demotion</strong> (QD) is critical for an efficient eviction algorithm, and it enables FIFO-based algorithms to achieve state-of-the-art efficiency.</p>
<p>Because demotion happens passively in most eviction algorithms, an object typically traverses through the cache before being evicted. Such traversal gives each object a good chance to prove its value to be kept in the cache.
However, cache workloads often follow Zipf popularity distribution, with most objects being unpopular.
This is further exacerbated by (1) the scan and loop access patterns in the block cache workloads, and (2) the vast existence of dynamic and short-lived data, the use of versioning in object names, and the use of short TTLs in the web cache workloads.
We believe the <em>opportunity cost of new objects demonstrating their values is often too high</em>: the object being evicted at the tail of the queue may be more valuable than the objects recently inserted.</p>
<figure class="image">
<img src="QD.svg" alt="An examplf of quick demotion" style="width:64%">
<figcaption>An example of quick demotion: adding a small FIFO to filter most new objects that do not have a request soon after insertion.</figcaption>
</figure>
<p>To illustrate the importance of <strong>quick demotion</strong>, we add a simple QD technique on top of state-of-the-art eviction algorithms.
The QD technique consists of a small probationary FIFO queue storing cached data and a ghost FIFO queue storing metadata of objects evicted from the probationary FIFO queue.
The probationary FIFO queue uses 10% of the cache space and acts as a filter for unpopular objects: objects not requested after insertion are evicted early from the FIFO queue. The main cache runs a state-of-the-art algorithm and uses 90% of the space.
And the ghost FIFO stores as many entries as the main cache.
Upon a cache miss, the object is written into the probationary FIFO queue unless it is in the ghost FIFO queue, in which case, it is written into the main cache.
When the probationary FIFO queue is full, if the object to evict has been accessed since insertion, it is inserted into the main cache. Otherwise, it is evicted and recorded in the ghost FIFO queue.</p>
<p>We add this FIFO-based QD technique to five state-of-the-art eviction algorithms, ARC, LIRS, CACHEUS, LeCaR, and LHD.
We used the open-source LHD implementation from the authors, implemented the others following the corresponding papers, and cross-checked with open-source implementations.
We evaluated the QD-enhanced and original algorithms on the 5307 traces.
Because the traces have a wide range of miss ratios, we choose to present each algorithm’s miss ratio reduction from FIFO calculated as <em>(mr<sub>FIFO</sub> - mr<sub>algo</sub>) / mr<sub>FIFO</sub></em>. Therefore, higher values are better. </p>
<div style="display: flex; justify-content: space-around;">
<img src="block_1.svg" alt="block cache traces, small cache size" style="width:48%">
<img src="block_3.svg" alt="block cache traces, large cache size" style="width:48%">
</div>
<div style="display: flex; justify-content: space-around;">
<img src="web_1.svg" alt="web cache traces, small cache size" style="width:48%">
<img src="web_3.svg" alt="web cache traces, large cache size" style="width:48%">
</div>
<div style="width: 88%; margin: 0 auto;">
On the block (first row) and web traces (second row), quick demotion can improve most state-of-the-art algorithm's efficiency. Left: small cache, right: large cache.
</div>
<p>The figures above show that the QD-enhanced algorithms further reduce the miss ratio of each state-of-the-art algorithm on almost all percentiles. For example, QD-ARC (QD-enhanced ARC) reduces ARC’s miss ratio by up to 59.8% with a mean reduction of 1.5% across all workloads on the two cache sizes, QD-LIRS reduces LIRS’s miss ratio by up to 49.6% with a mean of 2.2%, and QD-LeCaR reduces LeCaR’s miss ratio by up to 58.8% with a mean of 4.5%.
Note that achieving a large miss ratio reduction on a large number of diverse traces is non-trivial. For example, the best state-of-the-art algorithm, ARC, can only reduce the miss ratio of LRU by 6.2% on average.</p>
<p>The gap between the QD-enhanced algorithm and an original algorithm is wider (1) when the state-of-the-art is relatively weak, (2) when the cache size is large, and (3) on the web workloads.
First, With a weaker state-of-the-art, the opportunity for improvement is larger, allowing QD to provide more prominent benefits. For example, QD-LeCaR reduces LeCaR’s miss ratios by 4.5% on average, larger than the reductions on other state-of-the-art algorithms.
Second, when the cache size is large, unpopular objects spend more time in the cache, and <strong>quick demotion</strong> becomes more valuable.
For example, QD-ARC and ARC have similar miss ratios on the block workloads at the small cache size. But QD-ARC reduces ARC’s miss ratio by 2.3% on average at the large cache size.
However, when the cache size is too large, e.g., 80% of the number of objects in the trace,
adding QD may increase the miss ratio (not shown).
Third, QD provides more benefits on the web workloads than the block workloads, especially when the cache size is small. We conjecture that web workloads have more short-lived data and exhibit stronger popularity decay, which leads to a more urgent need for <strong>quick demotion</strong>.
While <strong>quick demotion</strong> improves the efficiency of most state-of-the-art algorithms, for a small subset of traces, QD may increase the miss ratio when the cache size is small because the probationary FIFO is too small to capture some potentially popular objects.</p>
<p>Although adding the probationary FIFO improves efficiency, it further increases the complexity of the already complicated state-of-the-art algorithms.
To reduce complexity, we add the same QD technique on top of 2-bit CLOCK and call it <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span>.
<span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> uses two FIFO queues to cache data and a ghost FIFO queue to track evicted objects.
It is not hard to see <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> is simpler than all state-of-the-art algorithms — it requires at most one metadata update on a cache hit and no locking for any cache operation. Therefore, we believe it will be faster and more scalable than all state-of-the-art algorithms.
Besides enjoying all the benefits of simplicity, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> also achieves lower miss ratios than state-of-the-art algorithms.
For example, compared to LIRS and LeCaR, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> reduces miss ratio by 1.6% and 4.3% on average, respectively, across the 5307 traces.
While the goal of this work is not to propose a new eviction algorithm, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> illustrates how we can build simple yet efficient eviction algorithms by adding <strong>quick demotion</strong> and <strong>lazy promotion</strong> techniques to a simple base eviction algorithm such as FIFO.</p>
<h2 id="discussion">Discussion</h2>
<p>We have demonstrated reinsertion as an example of LP and the use of a small probationary FIFO queue as an example of QD. However, these are not the only techniques.
For example, reinsertion can leverage different metrics to decide whether the object should be reinserted. Besides reinsertion, several other techniques are often used to reduce promotion and improve scalability, e.g., periodic promotion, batched promotion, promoting old objects only, and promoting with try-lock.
Although these techniques do not fall into our strict definition of <strong>lazy promotion</strong> (promotion on eviction), many of them effectively retain popular objects from being evicted.
On the <strong>quick demotion</strong> side, besides the small probationary FIFO queue, one can leverage other techniques to define and discover unpopular objects such as <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/atc17/atc17-blankstein.pdf">Hyperbolic</a> and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/nsdi18/nsdi18-beckmann.pdf">LHD</a>.
Moreover, admission algorithms, e.g., <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/3149371">TinyLFU</a>, Bloom Filter, probabilistic, and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi19-eisenman.pdf">ML-based admission algorithms</a>, can be viewed as a form of QD — though some of them are too aggressive at demotion (rejecting objects from entering the cache).</p>
<p>Note that QD bears similarity with some <a rel="noopener" target="_blank" href="https://wiki.c2.com/?GenerationalGarbageCollection">generational garbage collection algorithms</a>, which separately store short-lived and long-lived data in young-gen and old-gen heaps.
Therefore, ideas from garbage collection may be borrowed to strengthen cache eviction algorithms.</p>
<p>The design of <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> opens the door to designing simple yet efficient cache eviction algorithms by innovating on LP and QD techniques. And we envision future eviction algorithms can be designed like building LEGO — adding <strong>lazy promotion</strong> and <strong>quick demotion</strong> on top of a base eviction algorithm.</p>
Verus: A tool for verified systems code in Rust2023-08-03T00:00:00+00:002023-08-03T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/rust-verification-with-verus/<p>Part of the challenge (and fun) of low-level systems code is in the optimizations they employ:
developers might use manual memory management, they might use bit-packing and bit-twiddling optimizations,
or they might use multi-threading to speed up their code.
When dealing with such things for critical software, though, it can be difficult to ensure their correctness.
This is why my research group is interested in the formal verification of systems software:
ensuring through computer-checked mathematical proofs that software does what it is supposed to,
and ideally not compromising on these optimizations.</p>
<p>For this purpose, we have been developing <a rel="noopener" target="_blank" href="https://github.com/verus-lang/verus">Verus</a>,
a verification tool for <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/stable/book/">the Rust programming language</a>.
Rust is increasingly popular as a systems programming language today,
but we didn’t (just) choose it because of its popularity.
Rather, it turns out that the properties that make it attractive as a systems programming language—most notably,
that it allows manual memory management while simultaneously guaranteeing memory-safety—<em>also</em> make it excellent
in the setting of formal verification: in some ways straightforward,
and in some ways surprising. In this blog post, I’ll explain what these ways are.</p>
<h1 id="verification-mutable-memory-and-rust">Verification, mutable memory, and Rust</h1>
<p>First, we’re interested in proving code to be “correct”. What does that mean exactly?
Let’s get our feet wet in verification with some simple examples and then talk about a challenge that Rust helps us solve.</p>
<h2 id="intro-to-verus">Intro to Verus</h2>
<p>The key idea behind Verus is to check additional properties of programs that Rust doesn’t check on its own.
For example, consider the following valid Rust program, operating over an 8-bit integer.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">double</span><span>(i: </span><span style="color:#fffb9d;">u8</span><span>) -> </span><span style="color:#fffb9d;">u8 </span><span>{
</span><span> </span><span style="color:#fed6af;">return</span><span> i </span><span style="color:#ececec;">* </span><span style="font-weight:bold;color:#87d6d5;">2</span><span>;
</span><span>}
</span></code></pre>
<p>Though it’s a valid program, it (potentially) has a problem: if the argument <code>i</code> is more than 127, then the multiplication will overflow the 8-bit integer.
If you run Verus on it (which you can <a rel="noopener" target="_blank" href="https://play.verus-lang.org/?version=stable&mode=basic&edition=2021&code=use+vstd%3A%3Aprelude%3A%3A*%3B%0A%0Averus%21+%7B%0A%0A++++fn+double%28i%3A+u8%29+-%3E+u8+%7B%0A++++++++return+i+*+2%3B%0A++++%7D%0A++++%0A%7D%0A%0Afn+main%28%29+%7B%7D%0A%0A">try yourself at the Verus playground</a>),
Verus reports this error:</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>error: possible arithmetic underflow/overflow
</span><span> --> /playground/src/main.rs:6:16
</span><span> |
</span><span>6 | return i * 2;
</span><span> | ^^^^^
</span></code></pre>
<p>To remedy this, the programmer can declare their <em>intent</em>: namely, that the <code>double</code> function should never be called with any argument greater than 127.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">double</span><span>(i: </span><span style="color:#fffb9d;">u8</span><span>) -> </span><span style="color:#fffb9d;">u8
</span><span> requires i <= 127
</span><span>{
</span><span> return i </span><span style="color:#ececec;">* </span><span style="font-weight:bold;color:#87d6d5;">2</span><span>;
</span><span>}
</span></code></pre>
<p>The <code>requires</code> clause is not a standard Rust feature, but a feature of Verus: in general, Verus source code comprises both Rust code and extra directives for Verus, such as this
<code>requires</code> clause, also known as a <em>precondition</em>. In any case, Verus now accepts the program (<a rel="noopener" target="_blank" href="https://play.verus-lang.org/?version=stable&mode=basic&edition=2021&code=use+vstd%3A%3Aprelude%3A%3A*%3B%0A%0Averus%21+%7B%0A%0A++++fn+double%28i%3A+u8%29+-%3E+u8%0A++++++++requires+i+%3C%3D+127%0A++++%7B%0A++++++++return+i+*+2%3B%0A++++%7D%0A++++%0A%7D%0A%0Afn+main%28%29+%7B%7D%0A%0A">playground link</a>) because with the new assumption, Verus can determine that this arithmetic operation never overflows.</p>
<p>Furthermore, any time the developer calls <code>double</code> from elsewhere in the program, Verus will check that the call satisfies the precondition.
Keep in mind, also, that this is a check done statically, checked for all possible executions of the program, not a runtime check.</p>
<h2 id="specifications-and-program-correctness">Specifications and program correctness</h2>
<p>With Verus, we are actually interested in correctness criteria that go beyond simple arithmetic bounds checks.
Usually, we are interested in proving that a program’s behavior meets some <em>specification</em>, as in this function that computes the maximum of two integers:</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">max</span><span>(a: </span><span style="color:#fffb9d;">u64</span><span>, b: </span><span style="color:#fffb9d;">u64</span><span>) -> (result: </span><span style="color:#fffb9d;">u64</span><span>)
</span><span> ensures
</span><span> result </span><span style="color:#ececec;">==</span><span> a </span><span style="color:#ececec;">||</span><span> result </span><span style="color:#ececec;">==</span><span> b,
</span><span> result </span><span style="color:#ececec;">>=</span><span> a,
</span><span> result </span><span style="color:#ececec;">>=</span><span> b,
</span><span>{
</span><span> </span><span style="color:#fed6af;">if</span><span> a </span><span style="color:#ececec;">></span><span> b {
</span><span> </span><span style="color:#fed6af;">return</span><span> a;
</span><span> } </span><span style="color:#fed6af;">else </span><span>{
</span><span> </span><span style="color:#fed6af;">return</span><span> b;
</span><span> }
</span><span>}
</span></code></pre>
<p>Again, let’s highlight the Verus-specific parts:
first, we have the <code>ensures</code> clause (also known as a <em>postcondition</em>) serving as the function’s specification, along with the name <code>result</code> on the return type,
which is referenced from said postcondition.
Once again, the body of the <code>max</code> function is the Rust code we are verifying.</p>
<p>The <code>ensures</code> clause denotes a predicate that should hold true at the end of the call to <code>max</code>.
This determines what it means for an implementation of <code>max</code> to be “correct”: it is correct if every execution of its code
returns a result that meets its specification.</p>
<p>So, how does Verus actually check that this property holds?
To do this, Verus (and similar tools) encode the correctness of <code>max</code> as logical formulae called <em>verification conditions</em>:</p>
<p>\[ a > b \implies result = a \implies (result = a \lor result = b) \land (result \ge a) \land (result \ge b) \]</p>
<p>\[ \lnot(a > b) \implies result = b \implies (result = a \lor result = b) \land (result \ge a) \land (result \ge b) \]</p>
<p>These conditions are simplified a bit for presentation, but they are close enough for intuition.
The first of these would be read as, “if \( a < b \) (i.e., the first branch is taken), and if \( result \) is set to the return value \( a \), then the conditions of
the ensures clause hold”. The second condition is similar, but for the <code>else</code> side of the branch.</p>
<p>If we prove the verification conditions are correct, this implies the correctness of the program according to its specification.
To do so, Verus uses an automated theorem prover—in this case, <a rel="noopener" target="_blank" href="https://github.com/Z3Prover/z3">Z3</a>—to prove the verifification conditions hold for all
values of <em>a</em>, <em>b</em>, and <em>result</em>. This example is simple enough that Z3 can validate the conditions quickly, though for more complicated examples, the developer may need to write additional proofs
to help it out. If Z3 is unable to prove the theorem, either because it is wrong, or because it needs additional help to prove, then Verus outputs an error message like the one from
the previous section.</p>
<p>Specification-checking is extremely useful for situations where an implementation is optimized and handles low-level details, but we would like to provide a higher-level, mathematically precise specification.
For example:</p>
<ul>
<li>A program uses the bitwise operation <code>(x & (x - 1)) == 0</code> to determine if <code>x</code> is a power-of-2, but uses a more mathematically precise specification, \( \exists b.~ 2^b = x \).</li>
<li>A data-structure implements a hash table or a red-black tree, but has a specification stating that its operations are equivalent to those of a mathematical set.</li>
<li>A replicated data structure with a sophisticated synchronization algorithm uses a specification that it acts indistinguishably from a single copy of the data structure.</li>
</ul>
<h2 id="challenge-handling-mutable-memory">Challenge: handling mutable memory</h2>
<p>One such “low-level detail” we often have to reason about is <em>mutable heap state</em>.
To see why this is challenging without Rust’s help, let us set aside Rust for a moment,
and imagine we designed a programming language with general pointer types, like in C.
Consider a simple function that takes two pointers and updates one of them:</p>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> Imagined C-like verification language
</span><span style="color:#fffb9d;">void </span><span style="color:#fffd87;">compute_boolean_not</span><span>(</span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x, </span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x_not)
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">bool</span><span> tmp </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>This program looks straightforward at first, but it actually has a problem: what if <code>x</code> and <code>x_not</code> point to the same memory?
Then <code>*x</code> would be updated when we update <code>*x_not</code>. Therefore, a tool would never be able to prove this code matches its specification—it simply isn’t true.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/compute_boolean_not_graphical.png" alt="Visual representation of the above example" /></p>
<p align="center"><i><b>Left:</b> what the developer imagines happening. <b>Right:</b> what might actually happen.</i></p>
<p>One solution is to specify that the pointers do not <em>alias</em> with each other, i.e., that they don’t point to the same memory location:</p>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> Imagined C-like verification language
</span><span style="color:#fffb9d;">void </span><span style="color:#fffd87;">compute_boolean_not</span><span>(</span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x, </span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x_not)
</span><span> requires x </span><span style="color:#ececec;">!=</span><span> x_not </span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> This line has been added
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">bool</span><span> tmp </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>Recall the <code>requires</code> clause here indicates an assumption the function can make at the beginning of its execution.
By making this assumption, Verus can now check that the specification holds, although now every call to <code>compute_boolean_not</code>
will need to uphold this contract.</p>
<p>Unfortunately, adding these “non-aliasing conditions” gets unwieldy fast, as data structures increase both in breadth and depth.
This was our experience when we wrote the first version of <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">VeriBetrKV</a>, a key-value store developed in <a rel="noopener" target="_blank" href="https://dafny.org/">Dafny</a>, which has a similar aliasing situation to our C-like language.
Not only were the conditions difficult to write manually, but getting them wrong often led to error messages that were difficult to diagnose.</p>
<h2 id="rust-to-the-rescue">Rust to the rescue</h2>
<p>In Rust, it isn’t common to use general-purpose pointer types. Instead, Rust uses more restricted <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/book/ch04-02-references-and-borrowing.html"><em>reference</em> types</a>. In Rust, the types <code>&T</code> and <code>&mut T</code>
each denote a reference to a value of type <code>T</code>.
In the case of <code>&mut T</code>, which is specifically a <em>mutable</em> reference, the user is able
to modify the value behind the pointer.
Thus, in Rust/Verus, our boolean-negation example would look like this, with the <code>x_not</code> parameter marked as a mutable reference.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">compute_boolean_not</span><span>(x: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">bool</span><span>, x_not: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">mut bool</span><span>)
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">let</span><span> tmp: </span><span style="color:#fffb9d;">bool </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>At the machine code level, these references are just like pointers, but the Rust type system enforces additional properties: namely, a <code>&mut</code> reference to a piece of data can never coexist
with another reference to that data. Rust enforces this property because it is crucial to Rust’s guarantees about memory safety.</p>
<p>However, this property is also a huge boon for software verification. Because the non-aliasing property is checked by Rust’s type system,
the developer no longer has to write the non-aliasing conditions
manually. Furthermore, Rust’s type system is fast and often presents high-quality error messages when the property is violated.</p>
<p>One can think of this as if these non-aliasing conditions are
inserted automatically, so the developer doesn’t have to worry about it, but in fact, the situation is even better: the verification tool can simplify the verification conditions to not include any
notion of pointer addresses in the first place! In fact, some of my colleagues have <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">published a paper</a> quantifying the gains from this kind of simplification.</p>
<h1 id="are-reference-types-all-we-need">Are reference types all we need?</h1>
<p>The fact that Rust works as a language at all is evidence that reference types are sufficient
<em>most</em> of the time. Unfortunately, most of the time isn’t good enough. The non-aliasing
reference problem gets in the way for implementing any of the following:</p>
<ul>
<li>Doubly-linked lists</li>
<li>Reference-counted pointers (e.g., Rust’s <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/std/rc/struct.Rc.html"><code>Rc</code></a>, similar to C++’s <code>shared_ptr</code>)</li>
<li>Any manner of concurrent algorithm: locks, message-passing queues, memory allocators, systems with domain-specific logic for avoiding data races</li>
</ul>
<p>The reason these examples give difficulty is because Rust’s type system enforces that any object have a unique “owner” (unless those owners are immutable references).
However, these examples seemingly need to violate the restriction:</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/dlist.png" alt="Visual representation of a doubly-linked list. Each node has two incoming pointers from its neighbors, and two outgoing pointers to its neighbors." /></p>
<p align="center"><i>In a doubly-linked list, each node has two neighbors which point to it. Thus, these nodes do not have unique owners.</i></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/rc.png" alt="Visual representation of reference-counted smart pointer, Rc. The shared object has multiple reference objects pointing to it." /></p>
<p align="center"><i>When working with reference-counted smart pointers, each object may have multiple reference objects. These objects need to coordinate via the reference count to drop the given object at the appropriate time. This counter does not have a unique owner.</i></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/queue.png" alt="Visual representation of message-passing queue. The producer thread and the consumer thread each have a pointer to a shared queue buffer." /></p>
<p align="center"><i>In a message passing queue, the producer thread and the consumer thread have to share a queue buffer to store in-flight messages. This buffer does not have a unique owner.</i></p>
<p>So how can we tackle these kinds of problems?</p>
<p>For such things, Rust programmers need to use Rust’s notorious <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/stable/book/ch19-01-unsafe-rust.html">“unsafe code”</a> which opts in to various Rust features that
the type system is unable to validate are used safely. As such, the burden goes from the
type-checker to the programer to ensure they are used correctly.
Applications like the above are generally considered low-level, and they are often
relegated to time-tested libraries. It’s these kinds of low-level systems, though,
that we are especially interested in verifying! So what do we do?</p>
<h2 id="unsafe-code-in-verus-or-condititionally-safe-code">Unsafe code in Verus, or: “condititionally safe code”</h2>
<p>With Verus, we can recover the ability to implement such things while having a computer-checked guarantee of memory safety.
A Rust feature being “unsafe” really just means that the developer has to uphold a certain contract to use it safely, which Rust cannot check.
It is for this reason that I like to call unsafe code <em>conditionally safe</em>—i.e., it is safe subject to meeting
certain conditions. Rust cannot check these conditions, but Verus <em>can</em>.</p>
<p>Here is a simple example: Rust’s common vector indexing operation performs a bounds-check to ensure there is no memory corruption from an out-of-bounds access.
Therefore, this function is <em>unconditionally</em> “safe” to call, no matter what index the caller provides: even if the caller provides something out-of-bounds, the program might panic and exit, but it will never corrupt memory.
However, there is a lesser-used <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked"><code>get_unchecked</code></a> operation which performs <em>no</em> such bounds check.
Thus, <code>get_unchecked</code> is only safe to call if the index is
<em>already known to be in-bounds</em>, thus making it unsafe (conditionally safe).
This condition can be codified as a Verus <code>requires</code> clause:</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">unsafe fn </span><span style="color:#fffd87;">get_unchecked</span><span><T>(vec: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">Vec</span><span><T>, idx: </span><span style="color:#fffb9d;">usize</span><span>) -> </span><span style="color:#ececec;">&</span><span>T
</span><span> requires idx < vec</span><span style="text-decoration:underline;font-weight:bold;font-style:italic;color:#ffccee;">.</span><span>len()
</span><span> </span><span style="color:#ececec;">...
</span></code></pre>
<p>Now, Verus will check that the index is in-bounds whenever <code>get_unchecked</code> is called.
Thus, we can regain assurance in code that uses this function, provided that Verus is able to validate the code.</p>
<h2 id="handling-unsafe-ownership">Handling unsafe ownership</h2>
<p>Bounds-checking makes for an easy example, but when we consider programs like the ones
diagrammed above, the situation gets a little more complicated.
Recall that what characterizes these systems is that the objects may be pointed to
from multiple owners, which have to coordinate their access somehow.</p>
<p>As a result, the “conditions” of the conditionally safe operations become
substantially more involved. For example, accessing data through a pointer is only safe if there is no
<em>data race</em>, i.e., another thread trying to access it at the same time. Such a condition seems inherently “non-local” as it involved talking about all threads at once,
and therefore is much harder to check than that of a simple index being in bounds.</p>
<p>However, we have already discussed that Rust’s type system allows us to ensure the unique ownership of data, which then rules out illegal operations such as data races.
Therefore, the kind of “condition” we need to check is already the exact kind of condition that Rust’s type system is designed to ensure.
The problem here is just that these particular data structures do not use the specific types that are designed to ensure this. So how can we apply Rust’s philosophy anyway?</p>
<p>Since the data structures we want to verify use objects that don’t obey Rust’s unique ownership, our trick is to add <em>new</em> objects that <em>do</em>.
However, we don’t want to bog down the program with extra data—that would defeat the point of writing optimized code—so these new objects are merely “conceptual proof objects.”
In verification languages, such objects are often called <em>ghost</em> objects, not because they are spooky, but because they have no influence on the “physical world.” The real data structures in the compiled binary
would be the ones diagramed above, but Verus treats the program as if the ghost objects were really there when generating its verification conditions.</p>
<p>For example, for a program that uses pointers, Verus programs can use a ghost object that represents “the right to read or write memory from the given location.”
Just like for ordinary (“real”) data, Rust’s type system ensures that ownership this object is unique. Verus in turn ensures that such an object is
present when the program accesses the data behind the pointer. Combining both results, we can be confident that such an access really is data-race-free.
Even while multiple owners might point to the same piece of data, in the sense of physically having a pointer to it, only one owner at a time can have the <em>right</em> to manipulate that data.</p>
<p>To verify a doubly-linked list, then, we would arrange nodes with pointers in the usual way, but in addition to the “real” nodes, we would have an additional collection of ghost objects
that represent the right to access those nodes. By writing additional Verus annotations, we can explain, mathematically, how these ghost objects relate to the structure of the linked list,
and as a result we can use the ghost objects to traverse the list.
For more details, you can see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">our paper</a>, where we present the doubly-linked list in detail.</p>
<h1 id="further-reading">Further reading</h1>
<p>There is currently one paper on Verus available, which introduces Verus and works out
the doubly-linked list example in detail, among others. (If you compare to this blog post,
you may notice Verus’ syntax has evolved a bit since this paper was written.)</p>
<p><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.05491">Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. <em>Verus: Verifying Rust Programs Using Linear Ghost Types.</em> (OOPSLA 2023)</a></p>
<p>Before Verus, we explored this space of verification techniques through a language
we developed called <em>Linear Dafny</em>, an extension of the verification langauge <a rel="noopener" target="_blank" href="https://dafny.org/">Dafny</a>. Verus incorporates a lot of our
learnings from Linear Dafny, which there are several papers on.
We first introduced Linear Dafny in this paper on VeriBetrKV, a verified key-value store:</p>
<p><a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-hance.pdf">Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. <em>Storage Systems are Distributed Systems (So Verify Them That Way!).</em> (OSDI 2020)</a></p>
<p>Some of my colleagues quantified the utility of Linear Dafny’s type system via direct comparison:</p>
<p><a rel="noopener" target="_blank" href="https://homes.cs.washington.edu/%7Ejlli/papers/oopsla2022.pdf">Jialin Li, Andrea Lattuada, Yi Zhou, Jonathan Cameron, Jon Howell, Bryan Parno, and Chris Hawblitzel. <em>Linear Types for Large-Scale Systems Verification.</em> (OOPSLA 2022)</a></p>
<p>Finally, we explored the combination of ghost objects and ownership types to verify
some sophisticated concurrent systems in a Linear Dafny framework called IronSync:</p>
<p><a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi23-hance.pdf">Travis Hance, Yi Zhou, Andrea Lattuada, Reto Achermann, Alex Conway, Ryan Stutsman, Gerd Zellweger, Chris Hawblitzel, Jon Howell, and Bryan Parno. <em>Sharding the State Machine: Automated Modular Reasoning for Complex Concurrent Systems.</em> (OSDI 2023)</a></p>
<h1 id="related-work">Related work</h1>
<p>Verus is far from the only Rust verification tool around.
<a rel="noopener" target="_blank" href="https://plv.mpi-sws.org/rustbelt/popl18/">RustBelt</a> is a framework for verifying unsafe code within a precise mathematical formalism of a model of the Rust langauge.
It is notable because it can prove general memory-safety theorems about Rust’s type system, even in the presence of libraries that use unsafe code.
However, it does not take <em>advantage</em> of Rust’s type system for the sake of verification, and it doesn’t target developers writing actual Rust code.</p>
<p>Other tools which, like Verus, target developers include <a rel="noopener" target="_blank" href="https://www.pm.inf.ethz.ch/research/prusti.html">Prusti</a>,
<a rel="noopener" target="_blank" href="https://github.com/model-checking/kani">Kani</a>,
<a rel="noopener" target="_blank" href="https://arxiv.org/abs/2206.07185">Aeneas</a>,
and <a rel="noopener" target="_blank" href="https://github.com/xldenis/creusot">Creusot</a>.
Of these, the one most similar to Verus is likely Creusot, which takes advantage of the Rust type system in a similar way to generate simple verification conditions.
Creusot is also notable for its “prophecy encoding” of mutable references, which is more general than Verus’ current mutable reference support.
What distinguishes Verus, by contrast, is its support for these ghost objects and especially their use in concurrency.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Rust’s type system, and similar type systems that enforce unique ownership over data,
are enormously helpful in designing a verification language for low-level code.
Just as Rust guarantees memory safety, thus taking the burden off the developer in the common case,
Verus takes advantage of the same to remove the burden of complex aliasing conditions for verification developers.
More surprisingly, though, we can apply Rust’s type system even for code that initially seems very un-Rust-like, which is common in highly-optimized systems code.
Specifically,
by utilizing ghost objects
we recover the ability to use Rust’s ownership system
(together with Verus to check conditionally safe code)
to check code where the type system would not help in ordinary Rust.</p>
Provably-Safe Sandboxing with WebAssembly2023-07-25T00:00:00+00:002023-07-25T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/provably-safe-sandboxing-wasm/<blockquote>
<p>What if you could run untrusted code and still be able to sleep at night, safe and sound?</p>
</blockquote>
<p></p>
<p>Disclaimer: our award-winning work <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> can only calm your unsafe-software related fears; we recommend complementing this by additionally checking for monsters under your bed, and leaving a night light on, for any fears of things that go bump in the night.</p>
<figure><a name="fig1"></a><br>
<p><img src="./intra-process-sandboxing.svg" alt="A block diagram, representing intra-process sandboxing. Multiple sandboxes are shown inside a single host process, each of which interact via an API with the runtime. The runtime itself interacts with the kernel via syscalls. Multiple sandboxes can run within a single process, and multiple processes can run on the same OS kernel." /></p>
<figcaption>Figure 1: Intra-process sandboxing</figcaption>
<p><br></figure></p>
<p>Whether you want to include third party libraries in your code, support software plugins, use a smart content delivery network, or just browse the Web, you might need to execute untrusted code, which creates a risk that it will compromise its environment. Intra-process software sandboxing (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#fig1">Figure 1</a>), such as with Software Fault Isolation (SFI), is a useful primitive that allows for safe and lightweight execution of such untrusted code in the same process as its environment. Unfortunately, despite being a well-studied technique with a rich and long history, previous efforts to deploy it in production have failed, due to technical and marketplace hurdles, such as requiring access to original source code, complex binary rewriting, or only being supported by a single vendor.</p>
<p><a rel="noopener" target="_blank" href="https://webassembly.org/">WebAssembly</a> (Wasm) is ideally positioned to provide this crucial primitive and support such applications, since Wasm promises both safety <em>and</em> performance, while serving as a popular compiler target for many high-level languages. As a virtual architecture designed with sandboxing in mind, it has clean, succinct, and well-defined semantics, allowing for safe execution of high-performance code on the Web. However, this same design can also benefit non-Web applications, since the Wasm standard explicitly separates the core Wasm language from the specific API provided to each Wasm module by the runtime or other modules. For example, instead of offering a Web-oriented API, (say) for manipulating the DOM, many runtimes offer the WebAssembly System Interface (WASI) API to run Wasm beyond the Web. All of this has made Wasm an attractive compilation target, and compilers for most popular languages, such as C, C++, Rust, Java, Go, C#, PHP, Python, TypeScript, Zig, and Kotlin, now support it as a target. Thus, a single compiler <em>from</em> Wasm to executable code is sufficient to immediately support sandboxed code execution for all such languages. This makes Wasm an attractive narrow waist to provide high-performance lightweight sandboxing.</p>
<p>However, Wasm’s safety guarantees are only as strong as the implementation that enforces them. While Wasm might seem to immediately provide sandboxing, note that the actual implementation of the compiler from Wasm is a critical part of the trusted computing base (TCB) for the guarantee of sandboxing. In particular, any bug in the compiler could threaten the sandboxing protections, and indeed such bugs have been found in existing runtimes, and would lead to arbitrary code execution by an adversary. For example, using carefully crafted Wasm modules, an attacker could achieve:</p>
<ul>
<li>a memory-out-of-bounds read in Safari/WebKit using a logic bug (CVE-2018-4222),</li>
<li>memory corruption in Chrome/V8 using an integer overflow bug (CVE-2018-6092),</li>
<li>an arbitrary memory read in Chrome/V8 using a parsing bug (CVE-2017-5088),</li>
<li>arbitrary code execution in Safari/WebKit using an integer overflow bug (CVE-2021-30734),</li>
<li>a sandbox escape in both Lucet <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[6]</a> and Wasmtime <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[7]</a> using an optimization bug (CVE-2021-32629),</li>
<li>a memory-out-of-bounds read/write in Wasmtime (CVE-2023-26489),</li>
<li>and many others.</li>
</ul>
<p>A plausible explanation for such disastrous sandbox-compromising bugs—even in code designed with sandboxing as an explicit focus, is that the correct (let alone secure) implementation of high-performance compilers is difficult and remains an active area of research, despite decades of work.</p>
<p><span style="color:rgba(65,120,150,1);font-size:1.3rem;margin:0.5em 1em 0.5em 1em;display:block;">Upon reviewing the design space for executing Wasm code, we identified a crucial gap: Wasm implementations that provide <em>both</em> strong security and high performance. In our work, we thus propose, explore, and implement two distinct techniques, with varying performance and development complexity, which guarantee safe sandboxing using provably-safe compilers.</span> The first draws on traditional formal methods to produce mathematical, machine-checked proofs of safety. The second carefully embeds Wasm semantics in safe Rust code such that the Rust compiler can emit safe executable code with good performance. We describe each of these techniques in the upcoming sections, but additionally refer the interested reader to our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> for further details.</p>
<h2 id="vwasm-a-formally-verified-sandboxing-compiler">vWasm: A Formally Verified Sandboxing Compiler</h2>
<p>The first of our techniques, implemented as an open-source compiler, vWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[2]</a>, achieves provably-safe sandboxing via formal verification. Formal verification of software consists of writing a formal (mathematical) statement of the property we wish to prove about the software, and then writing a formal proof that shows that this statement is true for our software. The proof is machine-checked and thus provides the highest degree of assurance in its correctness. In contrast to techniques such as software testing, fuzzing, and manual reviews, formal verification is able to reason about all execution paths, thereby accounting for any possible input. This means that behaviors like buffer overflows, use-after-frees, etc. are completely ruled out. We describe vWasm’s top-level property, as well as our proof strategy, shortly.</p>
<p>Our choice of verification tool, F* <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[4]</a>, is a general-purpose functional programming language with effects, built for formal verification. Syntactically, it is closest to languages from the ML family (such as OCaml, F#, or SML). It has the full expressive power of dependent types, and has proof automation backed by Z3, an SMT solver. Code written in F* can be extracted to multiple languages, and for vWasm, we use F*’s OCaml extraction. Proofs are written within vWasm as a combination of pre-/post-conditions, extrinsic lemmas, intrinsic dependently-typed values, and layered effects.</p>
<p>vWasm is implemented as a compiler from Wasm to x86-64 (abbreviated as x64 henceforth), but it is designed to keep most of its code and proofs generic with respect to the target architecture. Here, we describe the process of compiling to x64, but the techniques generalize in a straightforward way to other architectures such as ARM. In compiling from Wasm to x64, there are three important conceptual stages: (i) a frontend which compiles Wasm to an architecture-parametric intermediate representation (IR), (ii) a sandboxing pass which acts upon the architecture-parametric IR, and (iii) a printer which outputs the x64 assembly code.</p>
<p>The frontend for the compiler is both untrusted and unverified. This means that one neither needs to trust its correctness for the overall theorem statement to be true, nor does one need to write proofs about it. Note that this is in stark contrast with traditional compiler verification, such as with CompCert <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[5]</a>, where any stage of the compilation must either be trusted or verified. This means that we are free to use any compiler technology for the compiler’s frontend, including arbitrarily complicated optimizations, as long as it outputs code within our architecture-parametric IR. Since compiler optimization is orthogonal to our primary goal, for vWasm’s frontend, we implemented only a simple register allocator and a basic peep-hole optimizer. We leave other optimizations for future work.</p>
<p>On the other end of the compilation pipeline is the x64 assembly printer, which is trusted to be correct. This means it is included in vWasm’s overall TCB, but we note that the printer is largely a straightforward one-to-one translation of our IR to strings, making it fairly simple to audit.</p>
<p>Finally, the sandboxing pass, which lies between the above two, is untrusted but verified to be correct. We define this formally below, but informally, this means that the sandboxing code has been proven (and the proof mechanically checked) to produce safely sandboxed code, given any input. Within the sandboxing pass, all accesses (reads or writes) into the Wasm module’s linear memory, indirect function call table, imports, globals, etc. are proven (sometimes after suitable transformations) to be safe. To prove sandbox safety, we additionally prove that the sandboxing pass also guarantees (a restricted form of) Control-Flow Integrity (CFI) that ensures that any checks performed for sandboxing cannot be bypassed, and thus must be obeyed.</p>
<p>Formally reasoning about the safety of sandboxing requires first defining a machine model, and then defining what sandbox safety is in that model. Our machine model covers the subset of x64 targeted by the compiler. A simplified version of this model can be found in our paper, while the complete model can be found in our open-sourced code. We define the semantics for x64 as small-step semantics, allowing for reasoning about even potentially infinitely running code. Within this machine model, the program state contains an <code>ok</code> field, which is set to the value <code>AllOk</code> if and only if, until that point in execution, nothing invalid has occurred. Crucially, this also means that no accesses outside the memory allocated to the module have occurred. Sandboxing is safe if and only if, informally, starting from any initial <code>AllOk</code> state, executing the sandboxed code for any number of steps leads to an <code>AllOk</code> state.</p>
<p>Written more formally in F*, but still slightly simplified for easier reading:</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>val sandbox_compile
</span><span> (a:aux) (c:code) (s:erased state): Err code
</span><span> (requires (
</span><span> (s.ok = AllOk) /\
</span><span> (reasonable_size a.sandbox_size s.mem) /\
</span><span> (s.ip `in_code` c) /\ ...))
</span><span> (ensures (fun c' ->
</span><span> forall n. (eval_steps n c' s).ok = AllOk))
</span></code></pre>
<br>
<p>This statement, written as pre- and post-conditions for the sandboxing pass <code>sandbox_compile</code>, shows that any code (<code>c'</code>) output by the sandboxer is formally guaranteed via the machine-checked proof to be safe. The pass takes two arguments <code>a</code> (auxiliary data) and <code>c</code> (the input program), and a computationally-irrelevant argument <code>s</code> (the initial state of the program, which is used for reasoning in our proofs, but that is erased when running the compiler), and returns output code <code>c'</code> under the custom effect <code>Err</code> (which allows the compiler to quit early upon error, for example if it finds a call to a non-existent function). The statement guarantees that as long as the pre-conditions in the requires clause are satisfied, the post-condition in the ensures clause provably holds on the produced output code. The pre-conditions say that the initial state must be safe, have a reasonable sandbox size, and start from a valid location in the code; if these conditions are met, the output code <code>c'</code> will be safe when executed for any number of steps <code>n</code>.</p>
<p>The proofs for this theorem span approximately 3,500 lines of F* code, not including the machine model or any of the supporting framework we built to write this proof. In total, vWasm consists of approximately 15,000 lines of F* code and proofs, and required approximately two person-years of development effort.</p>
<h2 id="rwasm-high-performance-informally-proven-safe-sandboxing">rWasm: High-Performance Informally-Proven-Safe Sandboxing</h2>
<p>Our second technique, implemented as an open-source compiler, rWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[3]</a>, achieves provably-safe sandboxing via a careful embedding of Wasm semantics into safe Rust, such that the Rust compiler can then emit high-performance, safe machine code. This approach provides multiple benefits, such as portability across architectures, performance that is competitive with other unsafe compilers, and the ability to introduce runtime extensions (such as inline reference monitors—IRMs) that can be optimized in-tandem with the executed code.</p>
<p>Our insight for this approach is that the specific property of safe sandboxing is heavily intertwined with memory safety. In particular, code written in a memory-safe language cannot escape the confines of the memory provided to it. Informally, this means that by lifting (potentially unsafe) code to a memory-safe language, and then compiling that lifted code to machine code, the generated machine code must be safely sandboxed, due to the memory safety of the intermediate memory-safe language.</p>
<p>While other memory-safe languages would also suffice to obtain safe sandboxing, we pick Rust as our memory-safe language of choice for rWasm, since it is a non-garbage-collected systems-oriented language, which allows us to obtain predictable performance. While Rust <em>does</em> have a non-memory-safe escape hatch via the <code>unsafe</code> keyword (since certain scenarios, such as writing an operating system, might need more control than directly allowed by the language), as long as this keyword is not used (ensured by the declaration <code>#![forbid(unsafe)]</code>), Rust guarantees memory safety. Given the prevalence of Rust in industry, and how seriously the Rust team takes unsoundness bugs, safe Rust is thus battle-tested to be memory safe, even if not (yet) proven to be so. Early efforts towards formalization of Rust and its security guarantees have already begun, such as with the RustBelt and Oxide projects.</p>
<p>We implement all stages of rWasm in safe Rust, but note that none of it needs to be trusted or verified. This means we do not need to depend upon the safety or correctness of any part of rWasm for the safety of the produced executable machine code. Instead, the safety of the produced code simply comes from the lack of any <code>unsafe</code> in the generated Rust code (and that unsafe-free Rust guarantees memory safety, as mentioned before). Contrast this with say, wasm2c, which requires either trusting (in addition to the C compiler itself) the wasm2c compiler, or its generated C code, since C does not guarantee memory safety.</p>
<p>Astute readers will note that sandbox safety in any type-safe language also depends on the language’s runtime libraries. Fortunately, rWasm imports nothing, uses only allocation-related features (for <code>Vec</code>), and even eliminates dependency on the Rust standard library via the <code>#![no_std]</code> directive. As with any sandbox, care is required when exposing an API to sandboxed code (e.g., to avoid APIs enabling sandbox bypasses directly or via confused deputies), but such concerns are orthogonal to sandbox construction.</p>
<h2 id="evaluation">Evaluation</h2>
<p>How do vWasm and rWasm perform in practice? We measure both techniques on a collection of quantitative and qualitative metrics, and while more details can be found in our full paper, we show some selected results here.</p>
<figure><a name="fig2"></a><br>
<p><img src="./execution-time.svg" alt="A graph, plotting normalized slowdown (on a log scale) on the y-axis against the Wasm runtimes on the x-axis. A summary of the graph is in the upcoming text." /></p>
<figcaption>Figure 2: Mean execution time of PolyBench-C benchmarks across the Wasm runtimes, normalized to pure native execution. Interpreters have square brackets; just-in-time (JIT) compilers have braces; the rest are ahead-of-time (AOT) compilers. vWasm* disables sandboxing.</figcaption>
<p><br></figure></p>
<p>Run-time performance is critical for practical adoption in most applications. Hence, we benchmark our compilers and various baselines using the PolyBench-C benchmark suite, which consists of thirty programs and has been a standard benchmark suite for Wasm since its inception. <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#fig2">Figure 2</a> summarizes our results, showing the normalized execution time of the benchmarks on the Wasm runtimes. Each point in the chart is the ratio of the mean time taken to execute the benchmark with the particular runtime vs. the mean time taken to execute by compiling the C code directly to non-sandboxed x64, skipping Wasm entirely.</p>
<p>The results indicate that, unsurprisingly, compiled code strictly outperforms interpreted code for run-time performance. <span style="color:rgba(65,120,150,1);font-size:1.3rem;margin:0.5em 1em 0.5em 1em;display:block;">With respect to our compilers, we see that vWasm consistently outperforms the interpreters on all benchmarks, and that rWasm is competitive even with the compilers which are optimized for speed, and not necessarily safety.</span> We note that the relative performance amongst the compilers can vary drastically based upon the workload (for example, on some of the longer-running programs in the benchmark suite, rWasm is more than twice as fast as WAVM <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[8]</a>, which itself is twice as fast as rWasm on other benchmarks). Looking at vWasm and vWasm* (which is vWasm but with the sandboxing pass disabled), we find that the run time is marginally affected (by only 0.2%), indicating that almost all of the slowdown for vWasm, compared to other compilers, is due to the unverified portion of the compiler, which can be improved without needing to write any new proofs or even impacting existing proofs.</p>
<p>Next, we quantify the development effort needed to implement both vWasm and rWasm. The former took approximately two person-years to develop, including both code and proofs, while the latter took one person-month. This stark contrast is a testament to the daunting amount of work formal verification requires, even with modern, automated tools like F*. It also illustrates the significant benefit of rWasm’s carefully leveraging Rust’s investment in safety.</p>
<p>Finally, provable safety is an important property of a verified sandboxing compiler, but one might wish to prove other properties, such as traditional compiler correctness. Here, vWasm has the upper hand, as this is feasible to do in F*, and we have even structured the compiler to make such proofs possible. In contrast, proving correctness for rWasm would be a challenging task, since one would need to formally model the Rust language, show that rWasm preserves Wasm semantics in compiling to Rust, and then implement a semantics-preserving Rust compiler (or prove <code>rustc</code> as semantics-preserving). The nature of the provable sandboxing property is what puts it into the sweet spot where we obtain it “for free” when compiling to Rust, and we believe there may be other such properties where one can obtain provable guarantees in a similar fashion. However, all these properties are a strict subset of what might be proven for an implementation like vWasm, which is built in a full-blown verification-oriented language.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this work, we have explored two concrete points in the design space for implementing a sandboxing execution environment, with a focus on WebAssembly. We proposed designs for these two points, implemented them as open-source tools, vWasm and rWasm, and evaluated them on a collection of both quantitative and qualitative metrics. We show that run-time performance and provable safety are not in conflict, and indeed rWasm is the first Wasm runtime that is both provably-sandboxed and fast.</p>
<p>We refer the interested reader to our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> and to our open-source tools vWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[2]</a> and rWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[3]</a>.</p>
<hr />
<p>A version of this blogpost was previously posted as an <a rel="noopener" target="_blank" href="https://www.usenix.org/publications/loginonline/provably-safe-multilingual-software-sandboxing-using-webassembly">article in USENIX ;login:</a>.</p>
<hr />
<p><a name="references"></a>
<small>
[1] Provably-Safe Multilingual Software Sandboxing using WebAssembly. Jay Bosamiya, Wen Shih Lim, and Bryan Parno. In Proceedings of the USENIX Security Symposium, August, 2022. Distinguished Paper Award <em>and</em> Internet Defense Prize. <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/usenixsecurity22/presentation/bosamiya">https://www.usenix.org/conference/usenixsecurity22/presentation/bosamiya</a><br>
[2] vWasm: A formally-verified provably-safe sandboxing Wasm-to-native compiler. <a rel="noopener" target="_blank" href="https://github.com/secure-foundations/vWasm/">https://github.com/secure-foundations/vWasm/</a><br>
[3] rWasm: A cross-platform high-performance provably-safe sandboxing Wasm-to-native compiler. <a rel="noopener" target="_blank" href="https://github.com/secure-foundations/rWasm/">https://github.com/secure-foundations/rWasm/</a><br>
[4] F*: A Proof-Oriented Programming Language. <a rel="noopener" target="_blank" href="https://fstar-lang.org/">https://fstar-lang.org/</a><br>
[5] Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. CompCert - a formally verified optimizing compiler. In Embedded Real Time Software and Systems (ERTS). SEE, 2016.<br>
[6] Announcing Lucet: Fastly’s native WebAssembly compiler and runtime. <a rel="noopener" target="_blank" href="https://www.fastly.com/blog/announcing-lucet-fastly-native-webassembly-compiler-runtime">https://www.fastly.com/blog/announcing-lucet-fastly-native-webassembly-compiler-runtime</a>, March 2019.<br>
[7] Wasmtime: A small and efficient runtime for WebAssembly & WASI. <a rel="noopener" target="_blank" href="https://wasmtime.dev/">https://wasmtime.dev/</a><br>
[8] WAVM: WebAssembly virtual machine. <a rel="noopener" target="_blank" href="https://wavm.github.io/">https://wavm.github.io/</a><br>
</small></p>
Code Conversion in Distributed Storage Systems2023-07-19T00:00:00+00:002023-07-19T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/convertible-codes/<p><a name="fig-intro"></a>
<img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/conversion_intro.png" alt="Code conversion" />
<em>Figure 1: diagram showing the code conversion process in a distributed storage system.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Today’s society is data-driven, and many of the applications that society relies on require storing ever-increasing amounts of data.
To this end, distributed storage systems have become the foundation of the data infrastructure, for example cloud storage systems.
These large-scale systems are typically run on massive clusters which have thousands to millions of disks, and store amounts of data on the scale of exabytes (\(10^{18}\) bytes).
At this large scale, failures become a common ocurrence.
Given the fundamental role that distributed storage systems play in supporting other applications, they must guarantee high levels of reliability despite these failures.
One common way to ensure reliability is through replication.
However, duplicating (or triplicating) the amount space used with replication is prohibitively expensive.
Instead, most current large-scale storage systems primarily employ <em>erasure codes</em>.
An erasure code encodes data in a way that makes it resilient against failures with lower overhead than replication.</p>
<p>The level of fault tolerance and the storage space consumed by an erasure code are determined by its parameters.
For example, the popular Reed-Solomon codes (and other traditional maximum-distance separable codes) have two main parameters: code length (\(n\)) and dimension (\(k\)).
These parameters are typically set based on the failure rate of storage devices, the required degree of reliability, and some additional requirements on system performance and storage overhead.</p>
<p>In practice, there are multiple reasons which necessitate changing the parameters of an erasure code for <em>already-encoded data</em>.
The process of transforming the data from the old encoding to the new encoding is known as <em>code conversion</em> (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#fig-intro">figure 1</a>).
One of the reasons for doing code conversions is <em>disk-adaptive redundancy</em> (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2019cluster">Kadekodi et al. 2019</a>):
it has been shown that the failure rate of disks can vary drastically across make/models and over time (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#schroeder2007disk">Schroeder and Gibson, 2007</a>), and that significant savings in storage space (and hence operating costs) can be achieved by tuning the code parameters to the observed failure rates.
For example, on production cluster traces from Google, disk-adaptive redundancy can lead to up to 25% space savings (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2020pacemaker">Kadekodi et al. 2020</a>).
Due to the large scale, this translates to savings of millions of dollars and significant reductions in the carbon footprint.
Another reason for code conversion is the changes in the popularity of data.
More popular data is typically encoded with higher redundancy (to support faster reads) and less popular data is encoded with less redundancy (for storage efficiency).
Thus, as popularity of the data changes, it is beneficial to change the parameters of the code used to encode the data (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#xia2015tale">Xia et al. 2015</a>).
In this case, data needs to be redistributed to make use of the new disks I/O throughput.</p>
<p>The default approach to code conversion is to read all data, re-encode it, and write it back.
This approach requires a large amount of disk I/O access and bandwidth which, due to inherent physical limitations of hard disk drives, are very scarce resources.
The following figure (from <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2020pacemaker">Kadekodi et al. 2020</a>) shows the fraction of the total cluster I/O used by code conversion in a simulation using real-world cluster traces.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/heart_mono_front.png" alt="Conversion IO in simulated cluster" />
<em>Figure 2: Trace-based simulation of a cluster using the default approach to code conversion. X-axis represents calendar date, left Y-axis represents the total fraction of the IO used by conversion, right Y-axis shows the size of the cluster in terms of the number of disks.</em></p>
<p>Observe that the total IO used by code conversion can use up to 100% of the cluster IO during significant periods of time.
Therefore, the default approach to code conversion can easily overwhelm a storage system, interfering with other important (foreground) tasks such as serving user requests.
In this post, we summarize our work on <strong>convertible codes</strong>, which are erasure codes that can be converted more efficiently than the default approach.
So far the information theory community has extensively studied various aspects of storage codes such as rate, update cost, repair bandwidth, and repair locality.
The conversion problem opens up a new dimension to optimize for when designing storage codes.
There are several open problems in this new design space, with a high potential for real-world impact.</p>
<p>We start by providing some <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#background">background</a> about storage systems, erasure codes, and the way erasure codes are used in storage systems.
Then, we introduce and formally define <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#conversion">code conversion and convertible codes</a>.
Afterwards, we provide a <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#min-cost">summary of our results</a> and showcase some examples that show how convertible codes can reduce conversion cost.
Finally, we conclude with some <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#conclusion">open problems</a>.</p>
<h2 id="background">Background on storage systems</h2>
<p>Many modern applications require storing large amounts of data; amounts which far exceed the capacity of a single disk drive.
In such cases, data needs to be distributed across many disks, attached to many different machines.
One immediate problem that emerges in this scenario is that, as the number of components in the system increases, the probability that at least one component fails becomes very high.
Because distributed systems need to keep running despite of these failures, reliability is an essential part of their design.</p>
<p>The simplest way of adding reliability to a distributed storage system is to use <em>replication</em>: each piece of data has multiple copies each stored in a different disk, so that if any disk fails, no data is permanently lost.
However, replication significantly increases the total amount of space used by the system, which makes it very expensive to use in large-scale systems.
For example, three-replication (which tolerates up to two failures) is used in some systems, and it uses 200% additional storage.
Storage cost is normally measured as the ratio of the total space used to the size of the original data, and is called <em>storage overhead</em>.
So three-replication incurs a storage overhead of 3.</p>
<p>Given the high cost of replication, nowadays most storage systems use <em>erasure coding</em> instead, which can offer the same (or even higher) reliability guarantees while using much less storage space.
For example, an \([5, 3]\) MDS code (explained in detail in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#erasure-codes">Background on erasure codes</a>) can tolerate up to two failures (same as three-replication) and has a storage overhead of \(\frac{5}{3} = 1.66\), i.e., only 66.6% extra storage.</p>
<p>Storage overhead is one of the main cost metrics for distributed storage systems.
This is because the top costs of running a system come from the cost of buying all the necessary hardware, and operating it: i.e. providing infrastructure, power, cooling, networking, etc.
Such is the scale of these systems, that even a single-digit percentage reduction in storage overhead is significant.</p>
<p>There are many other costs and metrics apart from storage overhead.
Among them, disk I/O resources come first, because it is important for the performance of the system.
HDDs offer relatively low I/O bandwidth compared to the total amount of storage space, so this is often the bottleneck of the system’s throughput.
Due to the mechanical overheads involved in moving the read head to the right place within a disk, the number of I/O operations (called accesses) is also an important metric.
Similarly, the amount of network I/O operations, CPU, and memory usage are also important.</p>
<h3 id="design">Distributed storage system design</h3>
<p>One of the most well-known types of distributed storage systems is <abbr title="Redundant Array of Inexpensive Disks">RAID</abbr>.
A RAID system typically consists of an array of \(n\) disks with same capacity attached to a single machine.
Data is encoded with an \([n, k]\) MDS (maximum distance separable) code and for each codeword, each of the \(n\) codeword symbols is placed on a different disk.</p>
<p>Modern distributed storage systems require to scale past a single machine, and thus have a different design from RAID.
An example of such a system is <a rel="noopener" target="_blank" href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a>, which also supports erasure coding.
These systems manage a large number of nodes and disks: sometimes thousands or tens of thousands of them.
As in RAID, data is encoded with an \([n, k]\) MDS code, and each codeword symbol is placed on a different disk, but \(n\) is much smaller than the number of disks (typically \(n \leq 50\)).
The disks where a codeword is placed are chosen by a semi-random placement algorithm, which tries to avoid choosing disks that might fail at the same time (e.g., by choosing disks in different racks).</p>
<h3 id="erasure-codes">Background on erasure codes</h3>
<p>While many types of erasure codes exists, in this post we will focus specifically on <em>linear</em> erasure codes with the <em>MDS</em> property, which we explain in the following.
An \([n, k]\) MDS (maximum-distance separable) erasure code takes \(k\) symbols of data, and encodes them into \(n\) code symbols with the property that any \(k\) out of the \(n\) code symbols can recover the original \(k\) data symbols.
Symbols are elements from a specific finite field (denoted \(\mathbb{F}\)).
Many practical applications use the finite field \(\mathrm{GF}(2^8)\), where each symbol is represented as a single byte.
Let \(r \coloneqq n - k\), and let \([i] = \{1,2,\ldots,i\}\).
Mathematically, an \([n, k]\) erasure code over \(\mathbb{F}\) is a function \(\mathcal{C}: \mathbb{F}^{k} \to \mathbb{F}^{n}\).
Elements in the image of \(\mathcal{C}\) are called <em>codewords</em>.</p>
<p>A linear code can be described by the mapping \(\mathbf{m} \to \mathbf{m} \mathbf{G}\), where \(\mathbf{G}\) is the \(k \times n\) <em>generator matrix</em>.
A code is MDS iff for any codeword it is possible to recover the original data after erasing any \(r\) arbitrary symbols.
For linear codes, this is equivalent to the property that the \(k \times k\) matrix formed by the columns corresponding to any \(k\) code symbols is invertible.
In practice, <em>systematic</em> codes are often used, which permit reading data without decoding.
A linear code is said to be <em>systematic</em> if its generator matrix contains a \(k \times k\) identity matrix as a submatrix; for such codes, we refer to the \(k \times r\) submatrix defined by the remaining columns as the <em>parity matrix</em> \(\mathbf{P}\).</p>
<h2 id="conversion">The code conversion problem</h2>
<p>The problem of changing data encoded under an initial code \(\mathcal{C}^I\) to its corresponding encoding under a final code \(\mathcal{C}^F\) is called <em>code conversion</em>.
In this section, we describe <em>convertible codes</em>, which are capable of efficiently changing the erasure code parameters from \([n^I, k^I]\) to \([n^F, k^F]\).
Let us start by showing an example of convertible codes in action.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/convertible_codes_example.png" alt="Code conversion" />
<em>Figure 2: Example of code conversion from a \([7,4]\) MDS code to a \([11,8]\) MDS code.</em></p>
<blockquote>
<p><a name="ex-merge"></a>
<strong>Example 1.</strong> Consider conversion from \([n^I = 7, k^I = 4]\) to \([n^F = 11, k^F = 8]\).
In this example, both codes are systematic (i.e. the data symbols are contained in the codewords), and each box represents a symbol.
Two codewords from the \([7,4]\) code are combined to obtain a single codeword from the \([11,8]\) code.
The first observation we make is that since both codes are systematic, we can simply keep the data symbols where they are (i.e., unchanged) through the conversion (dashed arrows).
Thus, in this case, we only need to define how the new parities are computed.</p>
<p>The default approach to conversion would be to read all of the data symbols \((a_1,\ldots,a_8)\), and use those to compute the new parities \(q_1, q_2\).
However, it is possible to reduce that cost.
Let the field \(\mathbb{F}\) be the integers modulo 17.
When we define the code, we want to ensure two properties: (1) the initial and final codes are MDS, and (2) the new parities can be computed efficiently.
To ensure this, we build the code from a <em>Vandermonde matrix</em>.
In a Vandermonde matrix, each column is determined by an <em>evaluation point</em> (an element from the field), and row \(i\) corresponds to the evaluation point raised to the power of \(i - 1\).
We can carefully choose the evaluation points to ensure the MDS property holds (it does not suffice to just choose distinct point).
Choosing evaluation points \((\theta_1 = 1, \theta_2 = 2, \theta_3 = 6)\) we have:
\[
\mathbf{P}^F =
\begin{bmatrix}
1 & 1 & 1 \\
\theta_1 & \theta_2 & \theta_3 \\
\theta_1^2 & \theta_2^2 & \theta_3^2 \\
\theta_1^3 & \theta_2^3 & \theta_3^3 \\
\theta_1^4 & \theta_2^4 & \theta_3^4 \\
\theta_1^5 & \theta_2^5 & \theta_3^5 \\
\theta_1^6 & \theta_2^6 & \theta_3^6 \\
\theta_1^7 & \theta_2^7 & \theta_3^7 \\
\end{bmatrix} =
\begin{bmatrix}
1 & 1 & 1 \\
1 & 2 & 6 \\
1 & 4 & 2 \\
1 & 8 & 12 \\
1 & 16 & 4 \\
1 & 15 & 7 \\
1 & 13 & 8 \\
1 & 9 & 14 \\
\end{bmatrix}.
\]
Let \(\mathbf{P}^I\) denote matrix defined by the first 4 rows of \(\mathbf{P}^F\).
The parities for the initial code are computed using \(\mathbf{P}^I\), and the parities of the final code are computed using \(\mathbf{P}^F\), i.e., the \((p,p^{\prime},q)\) elements in Figure 2 are defined as:
\[
(p_1,p_2,p_3) = (a_1,\ldots,a_4) \mathbf{P}^I \\
(p^{\prime}_1,p^{\prime}_2,p^{\prime}_3) = (a_5,\ldots,a_8) \mathbf{P}^I \\
(q_1,q_2,q_3) = (a_1,\ldots,a_8) \mathbf{P}^F.
\]
It is straightforward (although tedious) to check that the codes defined with these matrices have the MDS property.
During conversion, instead of reading all data we can compute simply compute the new parities from the old ones by scaling them by the appropriate powers of the chosen evaluation points:
\[
(a_1,\ldots,a_4)\mathbf{P}^I +
(a_5,\ldots,a_8)\mathbf{P}^I
\begin{bmatrix}
\theta_1^4 & 0 & 0 \\
0 & \theta_2^4 & 0 \\
0 & 0 & \theta_3^4 \\
\end{bmatrix} = (a_1,\ldots,a_8)\mathbf{P}^F.
\]
Notice that this is possible due to the Vandermonde structure of the matrices, which allows us to turn \(\mathbf{P}^I\) into the bottom half of \(\mathbf{P}^F\) by scaling each column.
This allows us to compute the final parities by using the existing initial parities, without the need to read the data.</p>
<p>By doing this, we can achieve code conversion by reading (and transferring) just 6 symbols in total.
In comparison, the default approach of read-reencode-write would require reading (and transferring) 8 symbols (i.e., all the original data symbols).</p>
</blockquote>
<h3 id="framework">The convertible codes framework</h3>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/framework_example.png" alt="Diagram of code conversion" />
<em>Figure 3: Abstract depiction of a conversion from an \([n^I,k^I]\) MDS code to a \([n^F,k^F]\) MDS code.</em>
<em>Each box represents a symbol, and the boxes are grouped into codewords.</em>
<em>The top row represents initial codewords and the bottom row represents final codewords.</em>
<em>Some symbols are kept unchanged, and reused in the final codewords (denoted with dashed arrows).</em>
<em>The converter (the node labeled “c”), reads data from some symbols in the initial codewords, and computes the new symbols in the final codewords (denoted with solid arrows).</em></p>
<p>Convertible codes focus on conversions where \(\mathcal{C}^I\) is an \([n^I,k^I]\) code and \(\mathcal{C}^F\) is an \([n^F,k^F]\) code.
In this post, we focus on the case where \(\mathcal{C}^I\) and \(\mathcal{C}^F\) are linear and MDS.
To achieve the change in code dimension from \(k^I\) to \(k^F\) the conversion procedure needs to consider multiple codewords at a time.
Let \(\lambda^I\) be the number of codewords of \(\mathcal{C}^I\) taken as input, and let \(\lambda^F\) be the number of codewords of \(\mathcal{C}^F\) produced as output.
To preserve the amount of data, we must have \(\lambda^I k^I = \lambda^F k^F\).
In particular, we define \(\lambda^I\) and \(\lambda^F\) as the smallest possible integers that satisfy the above equation, i.e.:
\[
\lambda^I \coloneqq \frac{\mathrm{lcm}(k^I,k^F)}{k^I}
\text{ and }
\lambda^F \coloneqq \frac{\mathrm{lcm}(k^I,k^F)}{k^F}.
\]
For example, in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">Example 1</a> we have \(k^I = 4\) and \(k^F = 8\), which means that we consider a total of \(\mathrm{lcm}(k^I,k^F) = 8\) data symbols in total, which at the beginning form \(\lambda^I = 2\) codewords, and at the end form \(\lambda^F = 1\) codeword.</p>
<p>Since multiple codewords are being converted, we also need to specify how data is distributed across different codewords.
This is specified through an <em>initial partition</em> \(\mathcal{P}^I\) and <em>final partition</em> \(\mathcal{P}^F\) of the set \([\mathrm{lcm}(k^I,k^F)]\), which indicate the \(k^I\) data symbols encoded by each initial codeword, and \(k^F\) data symbols encoded by each final codeword.
Let \(\mathbf{m} \in \mathbb{F}^{\mathrm{lcm}(k^I,k^F)}\) be the data to be stored, let \(P \subseteq [\mathrm{lcm}(k^I,k^F)]\) be a subset of indexes, and let \(\mathbf{m}_{P} \in \mathbb{F}^{|P|}\) be the entries of \(\mathbf{m}\) indexed by \(P\).
Then, the set of <em>initial codewords</em> is \(\{\mathcal{C}^I(\mathbf{m}_P) \mid P \in \mathcal{P}^I\}\), and the set of <em>final codewords</em> is \(\{\mathcal{C}^F(\mathbf{m}_P) \mid P \in \mathcal{P}^F\}\).
In the case of <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">Example 1</a>, the initial partition is \(\mathcal{P}^I = \{\{1,2,3,4\},\{5,6,7,8\}\}\), and the final partition is \(\mathcal{P}^F = \{\{1,2,3,4,5,6,7,8\}\}\), and thus the initial codewords are \(\{\mathcal{C}^I(a_1,\ldots,a_4), \mathcal{C}^I(a_5,\ldots,a_8)\}\) and the final codeword is \(\mathcal{C}^F(a_1,\ldots,a_8)\).</p>
<p>The <em>conversion procedure</em> takes the initial codewords as input, and output the final codewords.
Formally, a convertible code is defined as follows.</p>
<blockquote>
<p><strong>Definition (Convertible code).</strong>
A convertible code over \(\mathbb{F}\) is defined by:</p>
<ol>
<li>a pair of codes \((\mathcal{C}^I, \mathcal{C}^F)\) where \(\mathcal{C}^I\) is an \([n^I, k^I]\) code over \(\mathbb{F}\) and \(\mathcal{C}^F\) is an \([n^F, k^F]\) code over \(\mathbb{F}\);</li>
<li>a pair of partitions \(\mathcal{P}^I, \mathcal{P}^F\) of \([\mathrm{lcm}(k^I, k^F)]\) such that each subset in \(\mathcal{P}^I\) is of size \(k^I\) and each subset in \(\mathcal{P}^F\) is of size \(k^F\); and</li>
<li>a conversion procedure that on input \({\mathcal{C}^I(\mathbf{m}_P) \mid P \in \mathcal{P}^I}\) outputs \({\mathcal{C}^F(\mathbf{m}_P) \mid P \in \mathcal{P}^F}\), for any \(\mathbf{m} \in \mathbb{F}^{\mathrm{lcm}(k^I,k^F)}\).</li>
</ol>
</blockquote>
<h3 id="conversion-procedure">Conversion procedure</h3>
<p>The objective of the conversion procedure is to convert the initial codewords into the final codewords efficiently.
This is modeled with a <em>converter</em> which reads data from some symbols in the initial codewords, and computes new symbols in the final codewords.
As seen in the figure above, not all symbols in the final codewords need to be new; some symbols can be kept unchanged from the initial codewords, which incurs no cost.
Since our objective is to minimize cost, we will focus only on the so-called <em>stable</em> convertible codes, which have \(k^F\) unchanged symbols in each final codeword (which was proven in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi, 2022a</a> to be the maximum possible).</p>
<p>To decide whether a conversion procedure is efficient, we need to measure its cost.
Since each final codeword has exactly \(r^F\) new symbols, the cost of writing the new symbols is fixed, regardless of the conversion procedure.
Therefore, we will focus only the read costs of conversion.
Two types of cost have been considered in the literature.</p>
<blockquote>
<p><strong>Definition (Access cost).</strong>
The total number of symbols read by the converter.</p>
</blockquote>
<p></p>
<blockquote>
<p><strong>Definition (Conversion bandwidth).</strong>
The total size of the data read by the converter (note that the converter may read only part of a symbol).</p>
</blockquote>
<p></p>
<p>In this post, we will focus only on <em>access cost</em>.</p>
<h3 id="conversion-regimes">Conversion regimes</h3>
<p>To facilitate the study of convertible codes, two special subcases have been identified in the literature.</p>
<blockquote>
<p><strong>Definition (Merge regime).</strong>
Code conversions which merge multiple codewords into a single one, i.e., \(\lambda^I \geq 2\), \(\lambda^F = 1\), and \(k^F = \lambda^I k^F\), with arbitrary \(n^I\) and \(n^F\).</p>
</blockquote>
<p></p>
<blockquote>
<p><strong>Definition (Split regime).</strong>
Code conversions which split a single codeword into a multiple one, i.e., \(\lambda^I = 1\), \(\lambda^F \geq 2\), and \(k^I = \lambda^F k^F\), with arbitrary \(n^I\) and \(n^F\).</p>
</blockquote>
<p></p>
<p>The case where all parameters \((n^I,k^I,n^F,k^F)\) are arbitrary is referred to as the <em>general regime</em>.</p>
<p>The benefit of studying the merge regime and the split regime separately, is that in these two subcases one need not worry about defining the partitions \(\mathcal{P}^I\) and \(\mathcal{P}^F\), which simplifies the analysis.
This is because in these two subcases all data gets mapped to the same codeword (either in the initial or final configuration).
Thus, all partitions are equivalent by just relabeling the symbols.</p>
<h2 id="access-opt">Minimizing the access cost of conversion</h2>
<p>The following table shows the known lower bounds for the access cost of conversion.
In this section, we will describe the constructions that achieve each of the non-trivial bounds.</p>
<p><a name="table-access-lb"></a>
<strong>Table.</strong> <em>Summary of known lower bounds on access cost (assuming \(r^F = n^F - k^F \leq \min\{k^I, k^F\}\)).</em></p>
<table><thead><tr><th>Regime</th><th>\(r^I < r^F\)</th><th>\(r^I \geq r^F\)</th></tr></thead><tbody>
<tr><td>Merge</td><td>\( \lambda^I k^I \) <sup class="footnote-reference">(1)</sup></td><td>\( \lambda^I r^F \) <sup class="footnote-reference">(1)</sup></td></tr>
<tr><td>Split</td><td>\( \lambda^F k^F \) <sup class="footnote-reference">(2)</sup></td><td>\( (\lambda^I - 1) k^F + r^F \) <sup class="footnote-reference">(2)</sup></td></tr>
<tr><td>\( k^I = k^F \)</td><td>\( k^I \)</td><td>0</td></tr>
<tr><td>\( k^I \neq k^F\)</td><td>\(\mathrm{lcm}(k^I,k^F)\) <sup class="footnote-reference">(2)</sup></td><td>\(\lambda^I r^F + (\lambda^I \bmod \lambda^F) (k^I - \max\{k^F \bmod k^I, r^F\})\) <sup class="footnote-reference">(2)</sup></td></tr>
</tbody></table>
<p><em><sup class="footnote-reference">(1)</sup>: <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi, 2022a</a>.</em>
<em><sup class="footnote-reference">(2)</sup>: <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2020access">Maturana et al., 2020</a>.</em></p>
<h3 id="merge-regime">Merge regime</h3>
<p>Recall that, in this case, \(\lambda^I\) codewords are merged into a single one.
During conversion, all the data nodes are kept unchanged.
To meet the bound in the table above, the converter can access \(r^F\) symbols from each initial codeword.
As we saw in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">example 1</a>, this is possible by designing the parity matrices in a way that allows the converter to compute the new parities using only the old parities.
This can be done, for example, by using a Vandermonde matrix, although Vandermonde parity matrices are not guaranteed to produce MDS codes.
However, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi (2022a)</a> provide a method for constructing access-optimal codes that are guaranteed to be MDS over large enough field sizes.</p>
<h3 id="split-regime">Split regime</h3>
<p>Achieving the bound in table above is simple: during conversion, the converter reads the data symbols from all initial codewords except one, along with \(r^F\) initial parity symbols.
Then, the read data symbols are used to compute the corresponding parity symbols, and to remove their interference from the read initial parities.</p>
<blockquote>
<p><strong>Example 2.</strong> <a name="ex-access-split"></a>
Consider the conversion from \([n^I = 11, k^I = 8]\) to \([n^F = 7, k^F = 4]\) over \(\mathrm{GF}(17)\).
Suppose we use the same \(\mathbf{P}^I\) and \(\mathbf{P}^F\) from <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#merge-ex">example 1</a> but swapped.
During conversion, the converter reads \((a_5,\ldots,a_8)\) and the 3 initial parities \((a_1,\ldots,a_8)\mathbf{P}^I\).
The parity symbols of the second final codeword can be computed directly from the data;
the parity symbols of the first final codeword are computed as follows:
\[
(a_1,\ldots,a_8)\mathbf{P}^I -
(a_5,\ldots,a_8)
\begin{bmatrix}
1 & 16 & 4 \\
1 & 15 & 7 \\
1 & 13 & 8 \\
1 & 9 & 14 \\
\end{bmatrix}
=
(a_1,\ldots,a_4)\mathbf{P}^F.
\]
Thus, in total 7 symbols are read, compared to the default approach of reading all 8 data symbols.</p>
</blockquote>
<h3 id="general-regime">General regime</h3>
<p>In the general regime, partitions need to specified; <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2020access">Maturana et al., 2020</a> show the optimal way of choosing them.
At a high level, the optimal partition keeps data from the same initial codeword together in the final codeword whenever possible; that way, parity symbols can be used more effectively.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/general_regime_example.png" alt="General regime example" />
<em>Figure 4: Code conversion from \([6,5]\) MDS code to \([13,12]\) MDS code.</em></p>
<blockquote>
<p><strong>Example 3.</strong>
Consider conversion from \([n^I=6, k^I=5]\), to \([n^F=13, k^F=12]\).
Thus, there are a total of \(\mathrm{lcm}(5,12)=60\) data symbols, organized into \(\lambda^I=12\) initial codewords or \(\lambda^F=5\) final codewords.
The parity matrices of the codes are designed as if the final code was \([16,15]\) (which combines 3 codewords into 1).
The conversion procedure splits two of the initial codewords into “intermediate codewords” (which are not materialized, but only used to describe the construction).
Then, two initial codewords are merged along with two data symbols from the intermediate codewords.
The split and merge are executed with the same techniques we showcased for the merge and split regime, and thus only 18 symbols need to be read (marked by a dot in the figure).
Compare this the default approach of reading all 60 data symbols.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>The code conversion problem adds a new dimension to the design of codes.
This new dimension not only opens a variety of interesting theoretical questions, but has a huge potential for real-world impact in distributed storage systems.
In this post, we only scratched the surface of the code conversion problem:
other work on code conversion has focused on minimizing conversion bandwidth instead of access cost (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022bandwidth">Maturana & Rashmi, 2022</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2023bandwidth">Maturana & Rashmi, 2023</a>) and on codes with better repair properties (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#xia2015tale">Xia et al., 2015</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#wu2022optimal">Wu et al. 2022</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2023locally">Maturana & Rashmi, 2023</a>)
Even when considering these additional works, there are still many open questions in this nascent area of research.</p>
<h2 id="references">References</h2>
<ul>
<li><a name="kadekodi2019cluster"></a> S. Kadekodi, K. V. Rashmi, and G. R. Ganger, “Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity,”
<em>FAST 2019</em>.</li>
<li><a name="kadekodi2020pacemaker"></a> S. Kadekodi, F. Maturana, S. J. Subramanya, J. Yang, K. V. Rashmi, and G. R. Ganger, “PACEMAKER: Avoiding HeART attacks in storage clusters with disk-adaptive redundancy,”
<em>OSDI 2020</em>.</li>
<li><a name="maturana2020access"></a> F. Maturana, V. S. C. Mukka, and K. V. Rashmi, “Access-optimal linear MDS convertible codes for all parameters,” <em>ISIT 2020</em>.</li>
<li><a name="maturana2023bandwidth"></a> F. Maturana and K. V. Rashmi, “Bandwidth cost of code conversions in distributed storage: fundamental limits and optimal constructions,” <em>IEEE Transactions on Information Theory</em>, 2023.</li>
<li><a name="maturana2022convertible"></a> F. Maturana and K. V. Rashmi, “Convertible codes: enabling efficient conversion of coded data in distributed storage,” <em>IEEE Transactions on Information Theory</em>, 2022.</li>
<li><a name="maturana2022bandwidth"></a> F. Maturana and K. V. Rashmi, “Bandwidth cost of code conversions in the split regime,” <em>ISIT 2022</em>.</li>
<li><a name="maturana2023locally"></a> F. Maturana and K. V. Rashmi, “Locally repairable convertible codes: erasure codes for efficient repair and conversion,” <em>ISIT 2023</em>.</li>
<li><a name="schroeder2007disk"></a> B. Schroeder and G. A. Gibson, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?,” <em>FAST 2007</em>.</li>
<li><a name="wu2022optimal"></a> S. Wu, Z. Shen, P. P. C. Lee, and Y. Xu, “Optimal repair-scaling trade-off in locally repairable codes: analysis and evaluation,” <em>IEEE Transactions on Parallel and Distributed Systems</em>, 2022.</li>
<li><a name="xia2015tale"></a> M. Xia, M. Saxena, M. Blaum, and D. Pease, “A tale of two erasure codes in HDFS,” <em>FAST 2015</em>.</li>
</ul>
The Quantum Physicist's Method of Resource Analysis2023-06-06T00:00:00+00:002023-06-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/quantum-physicists-method/<p>The physicist’s method is a powerful framework for cost analysis that
many a computer scientist will learn at some point in their undergraduate career.
However, its high-level description leaves some practical gaps, especially concerning
how to actually bookkeep its finer details, and these details become important
when trying to build a more explicit accounting framework.
This post explains how to fill in these gaps with
the <em>quantum</em> physicist’s method, a refinement of the physicist’s method
that is robust enough for automatic program analysis, as in
my paper <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>. (Quick disclaimer: There is
no quantum computing in here, despite the name.) To do explain the new
bookkeeping devices of the quantum physicist’s method,
this post will first explain the classical physicist’s method
for algorithm analysis, then describe the difficulties it encounters when
adapted to the domain of program analysis, and finally lay out the
solution granted by bookkeeping with the quantum physicist’s method.</p>
<h1 id="the-classical-physicist-s-method">The Classical Physicist’s Method</h1>
<p>To make sense of the physicist’s method (and the later refinements we’ll make to it), it is
good to start by recalling the physical reasoning behind it. Think back to your highschool physics class
where you learned about energy. If you drop a 1 kilogram ball from
1 meter above the Earth, and drop an identical ball from the top of a 1 meter high ramp, how do their speeds compare
when they hit the ground?</p>
<p>It might seem like I haven’t given you enough information, but a neat little physical
principle called <em>conservation of energy</em> tells you all you need to know. At the start, both balls
have the same amount<sup class="footnote-reference"><a href="#grav">1</a></sup> of (gravitational) potential energy since they are the same height, same mass, and subject to the same
gravity. And at the end, both balls have none, since their distance from the ground is 0. Because the
total energy is <em>conserved</em>, we know that all that energy must still be around, just in some other form - in this case,
as kinetic energy in the balls’ speeds. So even though both
balls took different routes to the ground, the same energy goes into their speed, and thus the speeds are the same<sup class="footnote-reference"><a href="#speed">2</a></sup>.
Let me emphasize that point: <em>As long as we know energy is conserved,
we can measure expenditure with the difference between starting and ending energy</em>.</p>
<p>Eventually Robert Tarjan and Danny Sleator brought this idea to computer science.
They introduced it to define
<em>amortized</em> algorithm costs (see <a rel="noopener" target="_blank" href="https://epubs.siam.org/doi/pdf/10.1137/0606031?casa_token=cR8nppnD8MQAAAAA%3AgK8XhJzUtPvkIVXTHIe299HSRuczuwiYVM74VDBjOMpHDlLcZLIVlziYWpRQMHeuN3lz84b9kIUg&">here</a>).
However, the idea of amortization itself is much older, and comes from the financial industry.
Amortization is used to express a notion of average cost where occasional
spikes of high cost are prepaid over longer periods of time, like paying off a loan
in installments instead of all at once at the due date. However, if we think about this
prepayment as storing extra potential energy for later, the reasoning becomes exactly the
same as reasoning about conservation of energy. Hence, Tarjan and Sleator suggested
calling the approach “the physicist’s method”<sup class="footnote-reference"><a href="#personal">3</a></sup>.</p>
<p>To see how this all comes up in the analysis of algorithms, consider implementing an arbitrarily sized list using
fixed-size arrays<sup class="footnote-reference"><a href="#list">4</a></sup>. In particular, lets look at the list’s insertion function, and measure its
cost in how many array write accesses it uses. The common case is that list insertion will just be able to directly write a new
element into the next unused slot in the array, for a cost of 1 write. But eventually, the array will be full with no unusued slots.
When this happens, our implementation will:</p>
<ol>
<li>allocate a new array double the size of the old one</li>
<li>write all the elements from the old array into the new one</li>
<li>write the new element into the next empty space of the new array</li>
</ol>
<p>If you count them, you’ll find this uncommon case uses a number of array writes equal to the length of the list plus one,
which is a far cry from the common case’s constant cost. Worst-case cost analysis thus makes this algorithm look much more
inefficient than it usually is.</p>
<p>If instead we think through a lens of amortization, we find that insertion is, morally-speaking, a constant cost operation.
Essentially, insertion is cheap enough often enough that prepaying a little extra at each common case
can offset the high cost of the uncommon case. We can see how that looks in the graph below<sup class="footnote-reference"><a href="#graph">5</a></sup>, where the
black spikes of cost
never exceed the red constant-per-step payment.</p>
<p><img src="./amortizedgraph.jpeg" alt="a graph showing a constant-per-step bound over spiky costs" /></p>
<p>To show this formally, we define a suitable <em>potential</em> function \(\Phi\) giving the amount of prepaid potential energy stored in the
program state. Specifically, our desired \(\Phi(A)\) will be equal to twice the number of
filled slots past
the halfway point in the array \(A\).
We can think of this like attaching a 2-charge battery to each individual array cell past the halfway point, so that
we deal with that battery’s energy if and only if we access that array cell.
The amortized cost of an operation \(o\) is then defined as \(\Phi(o(A)) - \Phi(A) + C(A,o)\), which is the difference in potential induced by \(o\) plus \(C(A,o)\) its true cost on the array \(A\).
If we account for this potential energy alongside our normal costs, suddenly the cost profile becomes much smoother:</p>
<ul>
<li>
<p>In the common case, insertion just writes into the next unused slot of \(A\). We still pay the true cost of
\(C(A,\mathsf{insert}) = 1\) for the that write, but now might also need to pay 2 more to “charge the battery”
if the element is past the halfway point in the array. This “battery-charging” is how we pay for the difference in
potential as given by \(\Phi\). In the worst case, the total amortized cost is therefore 3.</p>
</li>
<li>
<p>In the uncommon case, our array \(A\) is now full. Thus, we have stored 2 units of potential with half the elements of \(A\),
which works out to one unit of potential for each element. So, potential can pay for each element’s write into the new array,
with none leftover. The new array itself then has no potential, because it is exactly half full. At this point, stored
potential has exactly covered the cost of all writes and all the state’s potential, accruing a running cost of 0. Finally
list insertion behaves like its common case again, giving worst-case amortized cost of 3.</p>
</li>
</ul>
<p>Thus, through mediating energy expenditure with potential, we find that
insertion into these array-backed lists takes an amortized constant number of writes. The magic
happened when we prepaid for two extra writes in the common case to “charge the battery”.
Eventually, that prepayment gets spent in the uncommon case to cover the writes into the
new array.</p>
<p>Now that you’ve seen an example, we can look at the general case:</p>
<blockquote>
<p>Given:</p>
<ul>
<li>a set of operations \(\mathsf{op} = \mathsf{state} \rightarrow \mathsf{state} \)</li>
<li>a true cost function \(C : \mathsf{state} \times \mathsf{op} \rightarrow \mathbb{R}\)</li>
</ul>
<p>If you can find:</p>
<ul>
<li>a potential function \(\Phi : \mathsf{state} \rightarrow \mathbb{R}_{\geq 0}\) </li>
<li>amortized cost \(a_o\) for each operation \(o\)</li>
</ul>
<p>such that \(\Phi(S) + a_{o_i} \geq \Phi(o_i(S)) + C(S, o_i)\)
for any state \(S\),</p>
<p>Then for any sequence of \(n\) operations \((o_i)\) and the sequence of states \((S_i)\) that they induce :</p>
<p>\[\sum_{i=0}^{i<n} a_{o_i} + \Phi(S_{0}) - \Phi(S_{n}) \geq \sum_{i=0}^{i<n} C(S_i, o_i)\]</p>
<p>i.e., the total amortized cost plus change in potential covers the total true cost.</p>
</blockquote>
<p></p>
<p>The condition placed on \(\Phi\) and \(a_{o_i}\) is what corresponds to conservation of energy<sup class="footnote-reference"><a href="#technically">6</a></sup>.
The potential in the state \(\Phi(S)\), and the extra energy paid \(a_{o_i}\) are sufficient to
cover the potential stored in the resulting state \(\Phi(o_i(S))\) and the energy
expenditure \(C(S, o_i)\) – no new energy is created. With that condition in place, just like in physics,
we can forget about intermediate states and just focus on the initial and ending states \(S_0\) and \(S_{n}\).
Hence the conclusion of the theorem, that the potential difference between \(\Phi(S_{0})\) and \(\Phi(S_{n})\)
plus all the total supplied extra energy can pay for the total energy expenditure.</p>
<p>In the above formalization, you might notice that the form of the potential function \(\Phi\) is left abstract.
The function <em>could</em> be any sort of complicated, non-uniform, ugly function. But it is no coincidence that
the \(\Phi\) we chose in our above example was “nice”. Specifically, this “niceness” amounts to potential being
<em>local</em> – one can think of the state \(S\) as broken up into many pieces (our array cells),
each with their own local amount of potential (our “batteries”).
Then \(\Phi\) just gives the sum of potential stored on these different pieces,
and adjusts the potential on a piece only when that piece is directly operated on.
In fact, this appears to be exactly how Tarjan intended the
bookkeeping for the physicist’s method to be conceptualized:</p>
<blockquote>
<p>In order to keep track of saved or borrowed credits [potential], it is generally convenient to
store them in the data structure. Regions of the structure containing credits are
unusually hard to access or update (the credits saved are there to pay for extra work);
regions containing “debits” are unusually easy to access or update. It is important to
realize that this is only an accounting device; the programs that actually manipulate
the data structure contain no mention of credits or debits.</p>
</blockquote>
<p>– Tarjan in <a rel="noopener" target="_blank" href="https://epubs.siam.org/doi/pdf/10.1137/0606031?casa_token=cR8nppnD8MQAAAAA%3AgK8XhJzUtPvkIVXTHIe299HSRuczuwiYVM74VDBjOMpHDlLcZLIVlziYWpRQMHeuN3lz84b9kIUg&"><em>Amortized Computational Complexity</em></a></p>
<p>This local-view of potential has been time-tested, and is basically the only form of potential
you will find in the literature. As such, our goal throughout the rest of this
post will be to keep our definition of potential as local as possible.</p>
<h1 id="building-a-program-analysis">Building a Program Analysis</h1>
<p>To build a program analysis based on the physicist’s method, we first need to
adapt the framework above. This is because some of the assumptions made
above are simply not applicable in our programmatic setting. The differences
are mostly technical, but accounting for them does lead to a slightly
different-looking theorem.</p>
<ol>
<li>
<p>The above framework assumes that operations can be executed in any order.
This makes sense when treating the collection of operations like an
interface – you don’t know what order an external user might call operations, so
your analysis needs to be prepared for anything. However this assumption
is wrong for analyzing a program (like the implementation of such an interface).
The program itself dictates specific sequences of operations, and the
analysis must take this into account to get sensible results<sup class="footnote-reference"><a href="#timesensitive">7</a></sup>.</p>
</li>
<li>
<p>The above framework assumes that extra energy \(a_o\) is
paid out on a per-operation basis.
Again, this makes sense when reasoning about an interface, since an external
user pays for each operation they call. However, when a program executes an operation,
there is no external user to introduce extra energy into the system, so costs
must be paid solely out of the energy supply internal to the program, i.e., the potential
of the state<sup class="footnote-reference"><a href="#pool">8</a></sup>.</p>
</li>
</ol>
<p>After adapting the theorem from the previous section to account for these
differences we are left with something
like the statement below. The main changes are that we consider only certain
sequences of operations, and that we drop amortized costs.</p>
<blockquote>
<p>Given:</p>
<ul>
<li>a set of operations \(\mathsf{op} = \mathsf{state} \rightarrow \mathsf{state} \)</li>
<li>a collection of possible sequences of such operations \(\mathsf{seq}\)</li>
<li>a true cost function \(C : \mathsf{state} \times \mathsf{op} \rightarrow \mathbb{R}\)</li>
</ul>
<p>If you can find:</p>
<ul>
<li>a potential function \(\Phi : \mathsf{state} \rightarrow \mathbb{R}_{\geq 0}\) </li>
</ul>
<p>such that \(\Phi(S_i) \geq \Phi(S_{i+1}) + C(S_i, o_i)\)
across all state sequences induced by \(\mathsf{seq}\)
from any initial state \(S_0\)</p>
<p>Then for any sequence of \(n\) operations \((o_i)\) prefixing \(\mathsf{seq}\)
and the sequence of states \((S_i)\) that they induce:</p>
<p>\[\Phi(S_{0}) - \Phi(S_{n}) \geq \sum_{i=0}^{i<n} C(S_i, o_i)\]</p>
<p>i.e., difference in energy bounds the total cost at every point<sup class="footnote-reference"><a href="#corollary">9</a></sup></p>
</blockquote>
<p></p>
<p>With this framework, our program analysis just needs to find a suitable \(\Phi\).
We are currently only considering a <em>local</em> definition of \(\Phi\), so our
task is really just finding way of
locally assigning potential
to the parts of each individual data structure at each point in our program.</p>
<p>There might be many ways to find such a local \(\Phi\),
but one simple option is to type the data structures. These
types can then include some annotation indicating how much potential the data structure
stores where, like “list but with 2 unit of potential per element”. This tells
you exactly how much potential each piece holds, making it easy to recover a
locally-definable \(\Phi\).</p>
<p>If you run
with this idea, you might eventually get something that looks similar to
the type system called Automatic Amortized Resource Analysis (AARA).
AARA can infer a valid \(\Phi\) through the inference of
potential-carrying types, and is fully automatable (as its name suggests).
See <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/640128.604148">here</a> for AARA’s origin
and <a rel="noopener" target="_blank" href="https://www.raml.co/">here</a> for an up-to-date implementation.</p>
<p>There are also a lot of different ways to approach this problem
apart from AARA. Some approaches are more manual
(like <a rel="noopener" target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-89884-1_19">this</a>
verification framework using separation logic). Some use potential with
other traditional techniques (like <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3408979">this</a>
adaptation of recurrence solving). And some are designed for different
programming environments (like <a rel="noopener" target="_blank" href="https://drops.dagstuhl.de/opus/volltexte/2020/12355/pdf/LIPIcs-FSCD-2020-33.pdf">this</a>
one for client-server interactions). I’m certain there are many more options still,
but the reason I bring up AARA in particular is that,
while all of these approaches <em>could</em> potentially employ the quantum phyisicist’s method in
the future, AARA is the only one that <em>has</em> (and I’m the one that did it).</p>
<h1 id="trouble-in-paradise">Trouble in Paradise</h1>
<p>This localized-potential approach happens to work rather well in many cases. For instance, AARA
can analyze sorting functions and many list manipulations without issue. Nonetheless, it is not hard to confound this approach.
Consider a simple loading function that populates one of our array-backed list from one of two other lists.
When called, the load function first executes some code (e.g. <code>shouldComeFromList1</code>) to decide which list the data should
come from, and then inserts it all one element at a time. Here we see what this might look like in pseudo-code<sup class="footnote-reference"><a href="#python">10</a></sup>.</p>
<pre data-lang="python" style="background-color:#393939;color:#dedede;" class="language-python "><code class="language-python" data-lang="python"><span style="color:#fed6af;">def </span><span style="color:#fffd87;">load</span><span>(target, list1, list2):
</span><span> </span><span style="color:#fed6af;">if </span><span>shouldComeFromList1():
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span>list1:
</span><span> target.insert(i)
</span><span> </span><span style="color:#fed6af;">else</span><span>:
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span>list2:
</span><span> target.insert(i)
</span></code></pre>
<p>If we assume that <code>shouldComeFromList1</code> has no array writes, then we only need consider the cost
of insertion. Clearly, only one list’s-worth of insertions occurs, and each insertion has an amortized cost of 3,
so \(\Phi\) need only assign 3 energy-per-element to the list selected by <code>shouldComeFromList1</code>.
However, there is in general no way to statically know which list that is – it is <em>undecidable</em>,
even if we had access to the source code for <code>shouldComeFromList1</code>.
This confounds our local method of accounting, since it must store potential in a specific list,
but cannot say which list will end up sourcing the data. We might get around this by having \(\Phi\) yield something
like \(3*\mathsf{max}(|\verb“list1“|, |\verb“list2“|)\) to cover the worst case, but this \(\mathsf{max}\) is not
expressible in a local way - at best, the local approach can overapproximate
\(\mathsf{max}\) with a sum, giving potential of \(3*(|\verb“list1“| + |\verb“list2“|)\), the cost for loading <em>both</em> lists.
And while this bound can only be loose by a constant factor of 2, other examples can loosen the bound to be exponentially worse
(like binary search <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>).</p>
<p>At this point, you might think the bound looseness
is just some weakness on <em>the analysis’s</em> end, where
presumably <em>some</em> localization of the tightest potential exists, but the analysis just can’t figure it out.
However, the situation is actually worse:
We can create an example where <em>no</em> tight localization suffices, even while nonlocal reasoning
makes a tight solution obvious<sup class="footnote-reference"><a href="#Bell">11</a></sup>.</p>
<p>This problem happens especially when measuring the cost of a resource like memory,
since memory is returned after use<sup class="footnote-reference"><a href="#neg">12</a></sup> and can be reused. When a resource returned, it is as
if additional payment is provided midway through the computation. This <em>could</em> lessen the amount
of resources needed upfront, but only if those resources aren’t needed prior to when the extra
resources are returned. In effect, the cost of resources like memory is measured with
their <em>peak</em> cost, the high water mark of the number of resources in use at one time.
These resources are therefore a bit more complicated than resources that only tick down, like
time. This makes it easy to create a situation with no tight localization of potential, like that below.</p>
<p>To see this problem in action, imagine we have a list of data, and two different
data processing procedures <code>process1</code> and <code>process2</code>. To compare the results of
these procedures, we might write the code below.
How should we account for the <em>memory</em> cost of the comparison, if each of <code>copy</code><sup class="footnote-reference"><a href="#copy">13</a></sup>,
<code>process1</code>, and <code>process2</code> temporarily
use one unit of memory per element in the list?</p>
<pre data-lang="python" style="background-color:#393939;color:#dedede;" class="language-python "><code class="language-python" data-lang="python"><span style="color:#fed6af;">def </span><span style="color:#fffd87;">copy</span><span>(list):
</span><span> ret </span><span style="color:#ececec;">= </span><span>emptyList()
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span style="color:#fffb9d;">list</span><span>:
</span><span> ret.insert(i)
</span><span> </span><span style="color:#fed6af;">return </span><span>ret
</span><span>
</span><span style="color:#fed6af;">def </span><span style="color:#fffd87;">processBoth</span><span>(data):
</span><span> dataCopy </span><span style="color:#ececec;">= </span><span>copy(data)
</span><span> </span><span style="color:#fed6af;">return </span><span>(process1(data), process2(dataCopy))
</span></code></pre>
<p>It seems obvious from the outset that whatever memory <code>copy</code> uses can be
reused for <code>process1</code>, and that <code>process1</code>’s memory in turn can be reused for <code>process2</code>,
since all act on lists of equal length. So, we should only need to allocate \(|\verb“data“|\)
memory units. However, if that is all the memory we have,
accounting for it locally is impossible.</p>
<p>To follow the accounting, let’s step through a call to <code>processBoth</code>. We start with the only
data structure being our input <code>data</code>, so it must contain all the potential.
We proceed to copy <code>data</code> to ready it for each of the processing functions.
This copying procedure temporarily uses all the \(|\verb“data“|\) memory units,
leaving some amount stored on <code>data</code> and some amount stored on <code>dataCopy</code> when
the memory is returned.
Then <code>process1</code> is applied to <code>data</code>, requiring all of
the \(|\verb“data“|\) memory units. Now, because <code>process1</code> doesn’t touch
<code>dataCopy</code>, <code>process1</code> cannot use any of the potential in <code>dataCopy</code>
– this means <code>data</code>
needs to have recieved all the potential, and none is stored on <code>dataCopy</code>. However,
this is followed by applying <code>process2</code> to <code>dataCopy</code>, which results in mirrored accounting for
potential: all potential should have been returned to <code>dataCopy</code>, with none stored in <code>data</code>!
While we intuitively know that this could be solved by having <code>process1</code> return
potential to <code>dataCopy</code>, there is never a time where <code>process1</code> and <code>dataCopy</code>
are local to the same operation.
Thus, no local allocation of \(|\verb“data“|\) potential suffices.
Just like before, the local
approach can only manage to overapproximate this example by a factor of 2, and can
be exponentially worse in other examples.</p>
<h1 id="the-quantum-physicist-s-method">The Quantum Physicist’s Method</h1>
<p>So far, our situation is rather unfortunate. We have this beautiful framework
from algorithm analysis, but when we naively adapt it to a program analysis we
must sacrifice either the efficacy of the result or the beauty of locality.
However, there is a solution: bookkeeping using the <em>quantum</em> physicist’s method.
To keep this section intelligible to non-physicists, this section will focus on
the actual execution of the method, while any quantum
physical parallels that come up will be kept
contained in the footnotes.</p>
<p>The idea behind the quantum physicist’s method is to introduce
the accounting device of “worldviews”. Each individual worldview
\(\phi_j : \mathsf{state} \rightarrow \mathbb{R} \) is
just a normal local accounting of potential like like our previous \(\Phi\),
though with the added caveat that they are
allowed to locally assign <em>negative</em> amounts of potential under special
conditions<sup class="footnote-reference"><a href="#detail">14</a></sup>.</p>
<p>Formally, the collection of worldviews satisfies the following
properties for all state sequences induced by \(\mathsf{seq}\)
from any initial state \(S_0\)</p>
<ol>
<li>
<p>\(\forall j. \hspace{4pt} \phi_j(S_i) \geq \phi_j(S_{i+1}) + C(S_i, o_i)\),
i.e., every worldview pays out the usual costs</p>
</li>
<li>
<p>\(\exists j. \hspace{4pt} \forall T\subseteq S_i. \hspace{4pt} \phi_j(S_i) \geq 0 \),
i.e., some worldview is classically valid, wherein potential is non-negative
everywhere<sup class="footnote-reference"><a href="#whole">15</a></sup></p>
</li>
</ol>
<p>Given these properties, one can prove the following key theorem:</p>
<blockquote>
<p>Theorem: \(max_j\phi_j\) is a suitable definition
of \(\Phi\) for the classical physicist’s method</p>
</blockquote>
<p></p>
Indeed, the first property meets the bulk of the requirements for a valid
potential function, and the second property ensures that the max potential
is always classically valid.
<p>You might at this point wonder what this new way of finding a potential
function buys us. The answer is that this simple way of combining our
familiar local accounts of potential introduces some powerful <em>nonlocal</em>
flexibility. By allowing different worldviews to tactically “go into debt”,
this method can infer tighter cost bounds than naive local reasoning can usually
supply.</p>
<p>To better understand how the mechanics of these worldviews actually work,
it might help to walk through a situation without so much technical cruft:
Suppose that
Alice and Bob get $5 to share from their parents to spend on candy in a candy store. Alice wants a $3
pack of caramels and Bob wants a $2 chocolate bar. However, Alice’s caramels are in a
vending machine that only takes $5 bills. If Alice keeps $5 to herself, then Bob can’t buy his candy.
But if Bob keeps $2 to himself, then Alice can’t use the vending machine for her candy.
So, what do Alice and Bob do?</p>
<p>The answer is quite simple: Alice buys her caramels with all the money, gets the change back,
and then gives it to Bob –
they both can then get what they want with no extra money needed.
I’m sure I have done the same with my brother plenty of times growing up.
We can bookkeep this the following way:</p>
<table><thead><tr><th align="center"></th><th align="center">start</th><th>Alice buys</th><th>get change</th><th>transfer</th><th>Bob buys</th></tr></thead><tbody>
<tr><td align="center">Alice/Bob money split</td><td align="center">5/0</td><td>0/0</td><td>2/0</td><td>0/2</td><td>0/0</td></tr>
</tbody></table>
<p>And this is exactly what we want, with one small caveat:
The “transfer” operation is actually quite nontrivial to work with. Only
highly specialized programming languages will even have constructs for <em>mentioning</em>
potential, and those that do will be burdened (or burden the programmer) with
figuring out how such constructs can be soundly used. But, by using worldviews for
bookkeeping, this whole problem can be bypassed entirely. We provide such an
account below:</p>
<table><thead><tr><th align="center"></th><th align="center">start</th><th>Alice buys</th><th>get change</th><th>Bob Buys</th></tr></thead><tbody>
<tr><td align="center">worldview 1 Alice/Bob money split</td><td align="center">5/0</td><td>0/0</td><td>2/0</td><td>2/-2</td></tr>
<tr><td align="center">worldview 2 Alice/Bob money split</td><td align="center">3/2</td><td>-2/2</td><td>0/2</td><td>0/0</td></tr>
</tbody></table>
<p>With this worldview accounting<sup class="footnote-reference"><a href="#qt">16</a></sup>, we pay the exact same amount out of the same place at each step.
The only difference between the two worldviews is that worldview 1 starts in the allocation of money needed
for Alice to buy her candy, and worldview 2 starts in the allocation needed for Bob to buy his. Then,
we find that the problematic “transfer” occurs where different worldviews become classically valid –
we see that happen at “get change”, since worldview 1 is classically valid at “Alice buys”, and
worldview 2 is classically valid at “Bob buys”. This pattern will hold in general, allowing transfers
to be coded completely implicitly into our analysis.</p>
<p>Using worldviews like this, we can solve both of the problems from the previous section:</p>
<ul>
<li>
<p>To solve the first – the data loading problem – simply start with 2 worldviews: one where <code>list1</code> carries all potential,
and one where <code>list2</code> does. No matter which list pays, there will then be a classically valid worldview.</p>
</li>
<li>
<p>To solve the second – the data processing problem – start with 2 worldviews assigning <code>data</code> all the potential. Then upon copying,
let the worldviews diverge – one leaves all the potential on <code>data</code>, and one moves it all
to <code>dataCopy</code>. The former can be the classically valid one while applying <code>process1</code>, and the latter when applying <code>process2</code>.</p>
</li>
</ul>
<p>In either case the max amount of potential across the worldviews is exactly the tight amount of potential
we wanted assigned.</p>
<p>And so, with worldviews in hand, we can salvage the niceness of locality by wrapping a bunch of local accountings
together and letting them make each other more flexible. From such an accounting we can
reconstruct a potential function that satisfies the standard framework for
amortized analysis, just as our new theorem ensures. This
leaves us with a program analysis built off the physicist’s method that can give many tighter
bounds than its predecessors.</p>
<h1 id="wrap-up">Wrap Up</h1>
<p>If you are interested in seeing such an analysis in action,
I’ll point you again to my work extending AARA <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>.
My paper adds the quantum physicist’s method along with some special infrastructure called <em>remainder contexts</em>, and then
uses its new capabilities to be able to automatically reason about memory usage and tree depth. The work also
comes with an implementation, a description of how it was designed, and tables of experiments run with it on
the OCaml standard library <code>Set</code> module. The implementation never gave worse cost bounds than the local approach, and often
gave much better ones. You can check it out and see for yourself!</p>
<div class="footnote-definition" id="grav"><sup class="footnote-definition-label">1</sup>
<p>Specifically, they both have \(1\mathsf{kg} * 9.81\frac{\mathsf{m}}{\mathsf{s}^2} * 1 \mathsf{m} = 9.81 \mathsf{J}\) of energy.</p>
</div>
<div class="footnote-definition" id="speed"><sup class="footnote-definition-label">2</sup>
<p>Specifically, solving for
\(v\) in the conversion between energy and speed
\(9.81\mathsf{J} = \frac 1 2 * 1\mathsf{kg} * (v \frac{\mathsf{m}}{\mathsf{s}})^2 \) gives
\( 4.43\frac{\mathsf{m}}{\mathsf{s}}\) as the speed of the balls at ground level.</p>
</div>
<div class="footnote-definition" id="personal"><sup class="footnote-definition-label">3</sup>
<p>I personally found this analogy with physical reasoning very useful to my
understanding when I was learning about algorithm analysis in undergrad. I’m sure many students feel the same.</p>
</div>
<div class="footnote-definition" id="list"><sup class="footnote-definition-label">4</sup>
<p>This list would be the data structure called a <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Dynamic_array">dynamic array</a>.
It is the <a rel="noopener" target="_blank" href="https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html"><code>ArrayList</code> in Java</a>
and the <a rel="noopener" target="_blank" href="https://www.cplusplus.com/reference/vector/vector/"><code>vector</code> in C++</a>, and probably underlies a lot of other list implementations too.</p>
</div>
<div class="footnote-definition" id="graph"><sup class="footnote-definition-label">5</sup>
<p>Taken from <a rel="noopener" target="_blank" href="https://stackoverflow.com/questions/200384/constant-amortized-time">here</a>.</p>
</div>
<div class="footnote-definition" id="technically"><sup class="footnote-definition-label">6</sup>
<p>Technically speaking, it only corresponds to the non-creation of energy, since we are interested
in an upper-bound on cost. Energy conservation means both non-creation and
non-loss of energy. Adapting the amortized analysis framework to non-loss would result in a lower-bound on cost.</p>
</div>
<div class="footnote-definition" id="timesensitive"><sup class="footnote-definition-label">7</sup>
<p>To help with this order-sensitivity, we will also from
now on consider the program state to have some notion of where it lies in
execution, like a program counter. However, this is just a technical point to
allow \(\Phi\) the flexibility to leverage operation order, and its exact
implementation is not important.</p>
</div>
<div class="footnote-definition" id="pool"><sup class="footnote-definition-label">8</sup>
<p>One might consider that external energy could be introduced at the
very start when a user calls on the program to execute. However, we will just
streamline this initial
payment by treating it as part of the energy assigned
to the initial program state.</p>
</div>
<div class="footnote-definition" id="corollary"><sup class="footnote-definition-label">9</sup>
<p>As a corollary, since the amortized cost payments are gone,
we also find that the potential of the initial
state bounds the peak cost. This is more useful to measure resources like memory.</p>
</div>
<div class="footnote-definition" id="python"><sup class="footnote-definition-label">10</sup>
<p>By pseudo-code I mean python.</p>
</div>
<div class="footnote-definition" id="Bell"><sup class="footnote-definition-label">11</sup>
<p>For those with a physics background, you might consider this our version of a <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Bell_test">Bell test</a>.
In physics, this is a case proving that <em>local realism</em> is incompatible with
quantum quantum mechanics; in our setting, it is a case proving that
purely local potential is insufficient for a tight cost analysis.</p>
</div>
<div class="footnote-definition" id="neg"><sup class="footnote-definition-label">12</sup>
<p>This return of energy is modeled in our framework simply by letting \(C\) return negative costs.</p>
</div>
<div class="footnote-definition" id="copy"><sup class="footnote-definition-label">13</sup>
<p>Copying is only really needed in this code if <code>process1</code> or <code>process2</code> might mutate the underlying list. However,
the pertinent features of code pattern also come up in side-effect free settings during, e.g., tree traversal.
See <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>.</p>
</div>
<div class="footnote-definition" id="overpay"><sup class="footnote-definition-label">17</sup>
<p>Well, technically a worldview could choose to overpay for the cost too.</p>
</div>
<div class="footnote-definition" id="detail"><sup class="footnote-definition-label">14</sup>
<p>This sets up our worldviews to begin looking somewhat like
states in <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Quantum_superposition">quantum superposition</a>.
Both are collections of simultaneous classical-looking states, just with negative
values allowed where they usually wouldn’t be. In quantum physics, those values are
probabilities; in our setting, they are potentials.</p>
</div>
<div class="footnote-definition" id="whole"><sup class="footnote-definition-label">15</sup>
<p>While only a technical point here, the consequences of
allowing the accumulation of negative potential in some parts of the
program state does
provide another quantum physical parallel. Two famous no-go theorems of
quantum physics, <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/No-cloning_theorem"><em>no-cloning</em></a>
and <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/No-deleting_theorem"><em>no-deleting</em></a>, mean
that a quantum state cannot simply duplicate or delete one of its pieces. These
same principles are relevant to the progam states of the quantum physicist’s method: We cannot
simply duplicate potential when copying a datastructure, nor may we simply lose potential
when deleting/ignoring a datastructure. Either case could
introduce extra potential, when positive amounts are duplicated or negative amounts
are lost, which would violate conservation.</p>
</div>
<div class="footnote-definition" id="qt"><sup class="footnote-definition-label">16</sup>
<p>We call this particular way of accounting for how to get around the barrier of the vending machine
“resource tunneling”, because it is analagous to
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Quantum_tunnelling">quantum tunneling</a>
around a potential barrier. In quantum physics, this occurs because a
particle’s position (or energy, depending on what you measure) is in a
superposition of many states, a small portion of which allow being on the
other side of the potential barrier; in our setting, it is because potential
is tracked through the collection of worldviews, at least one of which is
sufficient to pay for the potential needed. In either case, there may be no
one state of the collection that can explain the tunneling; no state that,
if tracked individually from the start, could pass the potential barrier.</p>
</div>
Robustness between the worst and average case2023-04-21T00:00:00+00:002023-04-21T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/intermediate-robustness/<p>As machine learning systems become increasingly implemented in safety-critical applications, such as autonomous driving and healthcare, we need to ensure these systems are reliable and trustworthy. For example, we might wish to determine whether a car’s camera-based autopilot system can correctly classify the color of the light even in the presence of severe weather conditions, such as snow. Consider that the average snowy day looks something like the following:</p>
<img src=./snow1.png width="400">
<p>Overall, the visibility is not too bad, and we can guess that these weather conditions do not present too much of an issue for the car’s autopilot system. However, every once in a while, we might get a snowy day that looks more like this:</p>
<img src=./snow2.png width="400">
<p>The visibility is much worse in this scenario, and these conditions might be more difficult for the car’s autopilot system to safely navigate. However, the traffic light color, as well as most of the objects on the road, can still be identified, and we would hope that the autopilot would be able to operate correctly in these conditions. Finally, very rarely, we might have a snow squall like the following: </p>
<img src=./snow3.png width="400">
<p>These conditions are so extreme that a human driver would probably need to pull over to the side of the road rather than attempt to drive in with such little visibility. Therefore, we probably should not require the autopilot system to be robust to such conditions. Now we ask the question, how should we evaluate the robustness of the the car’s autopilot to severe weather conditions? </p>
<p>Existing methods for evaluating the robustness of a machine learning model to perturbed inputs (e.g. images that have been corrupted due to severe weather) are largely based on one of two notions. Average-case robustness, measures the model’s average performance on randomly sampled perturbations. In the autopilot example, for instance, we could randomly sample a bunch of images from all days recorded snow precipitation, and measure the average accuracy of the traffic light detection on those days. If most of those samples look like the first image shown above, we should expect the system’s average robustness to be pretty good. This notion of robustness, however, doesn’t tell us much about how our autopilot system will operate on more extreme conditions as depicted in the second and third images. </p>
<p>Alternatively, worst-case robustness, or adversarial robustness, measures the model’s worst-case performance across all possible perturbations. For example, the worst-case performance of the autopilot system might be the result of navigating in the conditions depicted by the third image, displaying the snow squall. In this case, we should expect the system’s worst-case robustness to be pretty bad. But as we mentioned previously, we may not care so much if the system is able to navigate the worst-case conditions shown in the third image. </p>
<p>So then, how do we best measure the robustness of the system to conditions like those shown in the second image, i.e. conditions worse than average, but not the worst possible conditions? In this blog post, we present an alternative method for evaluating the test-time performance of machine learning models that measures robustness <em>between</em> the worst and average case, or <em>intermediate</em> robustness. </p>
<h2 id="a-simple-example-robustness-to-gaussian-noise">A simple example: robustness to Gaussian noise</h2>
<p>To further motivate the notion of intermediate robustness, consider the simple scenario in which we are interested in evaluating the robustness of an image classification model to Gaussian noise applied to the input images. The image below is a sample from the ImageNet validation dataset, which an image classifier trained on the ImageNet training dataset correctly classifies as “pizza”. </p>
<img src=./pizza1.png width="500">
<p>Given ten thousand random samples of Gaussian noise, the model classifies 97% of these noised images correctly, including the image below. Given the model correctly classifies a large majority of the randomly perturbed images, evaluating according to average-case robustness will place most weight on “easy” noise samples like this image.</p>
<img src=./pizza2.png width="500">
<p>The following image shown below illustrates an example of a randomly noised image that the model incorrectly classifies as “soup bowl”. Evaluating according to average-case robustness will place not put much weight on these samples that are harder for the model to classify correctly. </p>
<img src=./pizza3.png width="500">
<p>What if we want to evaluate the model’s robustness on a stricter level than average-case robustness? Evaluating the image classifier according to worst-case robustness doesn’t make much sense for this particular noise distribution, because the worst-case noise could be an arbitrarily large amount of noise with extremely low probability due to the Gaussian distribution being unbounded. A more practical evaluation of robustness would consider something stricter than simply average performance on random perturbations, but not quite as strict as adversarial robustness, which is exactly what our intermediate robustness metric enables.</p>
<h2 id="an-intermediate-robustness-metric">An intermediate robustness metric</h2>
<p>We’ll now go into the details of how we formulate an intermediate robustness metric. We start by observing that we can naturally generalize average-case and worst-case robustness under one framework. To show this, we will make use of the mathematical definition of an \( L^p \)-norm of a function \( f \) on a measure space \( X \): \( ||f||_p = (\int_X |f|^p )^{(1/p)} \). However, to differentiate this from the use of \( \ell_p \)-norm balls as perturbation sets in adversarial robustness, we will from here on out refer this definition as a \(q \)-norm of a function. Ultimately, average and worst-case robustness can be generalized by taking the \( q \)-norm of the loss function over the perturbation distribution, where the loss just measures how well the model performs on the perturbed data. The setting of \( q=1 \) results in average-case robustness, whereas the setting of \( q = \infty \) results in worst-case robustness, because by definition the \( L^\infty \)-norm is given by the pointwise maximum of a function. Then, any value of \( q \) between \( 1 \) and \( \infty \) results in <em>intermediate</em> robustness. This is more formally written below:</p>
<blockquote>
<p>Define a neural network \( h \) with parameters \( \theta \), and a loss function \( \ell \) that measures how different the model predictions are from the true label \( y \) given an input \( x \). Consider we are interested in measuring the robustness of this model to perturbations \( \delta \) from some perturbation distribution with density \( \mu \). Now consider the following expectation over the \( q \)-norm of the loss according to this perturbation density,
$$ \mathbf{E}_{x, y \sim \mathcal{D}} \Big[ ||\ell(h_\theta(x+\delta), y)||_{\mu, q} \Big], $$
where the \( q \)-norm of the loss with perturbation density \( \mu \) is the following:
$$ ||\ell(h_\theta(x+\delta), y)||_{\mu, q} = \mathbf{E}_{\delta \sim \mu} [|\ell(h_\theta(x+\delta), y)|^q]^{1/q} = \Big( \int |\ell(h_\theta(x+\delta), y)|^q \mu(\delta)d\delta) \Big)^{1/q}.$$
Assuming a smooth, non-negative loss function, this expectation corresponds to the expected loss on random perturbations (average-case) when \( q = 1 \),
$$ || \ell(h_\theta(x+\delta), y) ||_{\mu, 1} = \mathbf{E}_{\delta \sim \mu} [\ell(h_\theta(x+\delta), y)], $$
and the expected maximum loss over all possible perturbations (worst-case) when \( q = \infty \),
$$ || \ell(h_\theta(x+\delta), y) ||_{\mu, \infty} = \text{max}_{\delta \in \text{Supp}(\mu)}[\ell(h_\theta(x+\delta), y)].$$</p>
</blockquote>
<p> </p>
<p>Intuitively, as we increase \( q \), more emphasis will be placed on high loss values, as the losses become more strongly “peaked” due to the exponent of \( q \). And so by increasing \(q \) from \( 1 \) to \( \infty \), we enable a full spectrum of intermediate robustness measurements that are increasingly strict by placing more weight on high loss values. This formulation allows us to evaluate a model’s robustness in a wide range between the two extremes of average and worst case performance.</p>
<h2 id="approximating-the-metric-using-path-sampling">Approximating the metric using path sampling</h2>
<p>In most cases, we have to approximate the metric we just defined, which cannot be calculated exactly because it requires computing a high-dimensional integral over the perturbation space. Ultimately, we approximate the integral using the path sampling method [Gelman and Meng, 1998], but to motivate why this is important, we’ll first give an example of a naive, yet suboptimal, way of estimating the integral.</p>
<h3 id="monte-carlo-estimator">Monte Carlo estimator</h3>
<p>For illustration purposes, let’s consider approximating the integral \( \int_a^b f(x)^q \mu(x)dx \) for an arbitrary function \( f \) and probability density \( \mu \). Recall that the integral of a function can be interpreted as calculating the area below the function’s curve. We could pick a random sample \( x \), evaluate the function \( f(x)^q \) at \( x \) and multiply by \( (b-a ) \) to estimate the area. However, using just one sample, this will likely underestimate or overestimate the area. If we instead pick many samples and take the average of their estimates, with enough samples this theoretically should eventually converge to something close to the desired integral. This is known as the Monte Carlo estimator, and can be visualized in the plot below for the function \( f(x)^q \) with \( q = 1 \).</p>
<img src=./integral1.png width="400">
<p>Now let’s see what this plot looks like for \( q=2 \). We see that values of \( x \) for which the value \( f(x) \) is large make a larger contribution to the integral. However, because these values of \( x \) have a lower probability of being sampled, random sampling places a disproportionate amount of weight on estimates from \( x \) with lower values of \( f(x) \).</p>
<img src=./integral2.png width="400">
<p>As we continue to increase the value of \( q \), as shown in the plot below for \( q=3 \), we can see that Monte Carlo sampling will be increasingly insufficient to approximate this integral well.</p>
<img src=./integral3.png width="400">
<p>Translating this back to our integral of interest, when the perturbation density is concentrated in a region with low loss values, the Monte Carlo estimator will be less capable of producing an accurate approximation of the integral when we want to evaluate intermediate robustness for larger values of \( q \).</p>
<h3 id="path-sampling-estimator">Path sampling estimator</h3>
<p>To better approximate the integral for large values of \( q \), we need to sample perturbations that contribute more largely to the integral (e.g. result in higher loss values) more frequently. Path sampling is one method that boosts the frequency of more “important” samples by sampling from a “path” of alternative densities that encourages samples where the integrand is large. </p>
<p>The path sampling estimator of the intermediate robustness metric ultimately takes the form of the geometric mean of the losses given the sampled perturbations from these alternative densities, which are annealed to follow an increasingly “peaked” distribution. Practically, these samples can be drawn using Markov chain Monte Carlo (MCMC) methods. The path sampling estimator is written more formally below:</p>
<blockquote>
<p>Consider the following class of densities,
$$ p(\delta|t) \propto \ell(h_\theta(x+\delta),y)^t \mu(\delta),$$
and construct a path by interpolating \( t^{(i)} \) between 0 and \( q \) and sampling a perturbation \( \delta^{(i)} \) from \( p(\delta|t^{(i)}) \) using MCMC. Then, the path sampling estimator of the intermediate robustness metric is the following geometric mean,
$$ \hat{Z}_\text{Path sampling} := \Big( \prod_{i=1}^m \ell \big( h_\theta(x+\delta^{(i)}), y \big) \Big)^{1/m}.$$</p>
</blockquote>
<h2 id="evaluating-the-intermediate-robustness-of-an-image-classifier">Evaluating the intermediate robustness of an image classifier</h2>
<p>Now that we have introduced a metric for evaluating the intermediate robustness of a model, along with methods for approximating this metric, let’s evaluate the performance of a model at different robustness levels. We’ll see that the intermediate robustness metric interpolates between measurements of average and the worst-case robustness, providing a multitude of additional ways in which we can measure a model’s robustness, and we’ll empirically show the advantage of the path sampling estimator over the Monte Carlo estimator.</p>
<p>Because it is a setting commonly considered in the adversarial (worst-case) robustness literature, consider evaluating the robustness of an image classifier to perturbations \( \delta \) uniformly distributed within the \( \ell_\infty \)-norm ball with radius \( \epsilon \) (i.e. each component of \( \delta \) is uniformly distributed between \( [-\epsilon, \epsilon] \)).</p>
<p>In the figure below, we plot the test-time performance of an image classifier, trained on the CIFAR-10 dataset, using our intermediate robustness metric for different values of \( q \).</p>
<img src=./interpolating.jpeg width="500">
<p>This figure shows that our proposed intermediate robustness metric does indeed capture the gap between the two existing robustness metrics, effectively interpolating between average-case robustness (\( q=1 \)) and worst-case (adversarial) robustness measurements when increasing the value of \( q \) from left to right.</p>
<p>We can also compare the Monte Carlo and path sampling estimators for different values of \( q \). This figure illustrates that while both of the approximation methods result in a similar estimate for \( q=1 \), for larger values of \( q \), path sampling results in a higher, more accurate estimate of the intermediate robustness metric, more closely approaching the adversarial loss, when compared to Monte Carlo sampling.</p>
<p>The benefit of the path sampling estimator can be further shown in the figure below, which plots the convergence of the Monte Carlo sampling and path sampling estimates of the intermediate robustness metric given an increasing number of samples.</p>
<table><thead><tr><th align="center">Convergence with \( q=1 \)</th><th align="center">Convergence with \( q=100 \)</th></tr></thead><tbody>
<tr><td align="center"><img src="./convergence-q=1.jpeg" alt="q=1" /></td><td align="center"><img src="./convergence-q=100.jpeg" alt="q=100" /></td></tr>
</tbody></table>
<p>Again, when approximating the robustness metric for \( q=1 \), shown on the left, both estimators converge to the same value with relatively few iterations. However, when approximating the intermediate robustness metric for \( q=100 \), shown on the right, the Monte Carlo sampler results in estimates that are consistently lower and less accurate than those of path sampling, even with a large number of samples. </p>
<h2 id="training-for-different-levels-of-robustness">Training for different levels of robustness</h2>
<p>We can also <em>train</em> machine learning models according to specific levels of robustness by choosing a value of \( q \) and minimizing the intermediate robustness objective. However, training intermediate robust models is computationally challenging because a non-trivial number of perturbation samples is needed to accurately estimate the robustness objective, even when using the path sampling method. While evaluating models simply requires one iteration over the test dataset, training requires multiple iterations over the training dataset, resulting in an extremely expensive procedure when effectively multiplying the dataset size by the number of perturbaton samples.</p>
<p>Due to this computational complexity, we train an image classifier on the simpler MNIST dataset (using the same perturbation set) to minimize the intermediate robustness objective for different values of \( q \) (approximated using path sampling). We train one model with \( q=1 \), which is just like training with data augmentation (training on randomly sampled perturbations), and we train one model with \( q=100 \), which is somewhere in between training with data augmentation and adversarial training (training on worst-case perturbations).</p>
<p>We evaluate the intermediate and adversarial robustness of each of the final trained models, the results of which can be seen in the figure below.</p>
<table><thead><tr><th align="center">Training with \( q=1 \)</th><th align="center">Training with \( q=100 \)</th></tr></thead><tbody>
<tr><td align="center"><img src="./train_q1.png" alt="q=1" /></td><td align="center"><img src="./train_q100.png" alt="q=100" /></td></tr>
</tbody></table>
<p>While the model trained with \( q=1 \), shown on the left, and the model trained with \( q=100 \), shown on the right, have similar performance when evaluated at less strict robustness levels, \( q=1 \) and \( q=10 \), the model trained with \( q=100 \) is much more robust when comparing the stricter \( q=1000 \) and adversarial robustness measurements.</p>
<p>Ultimately, the main takeaway from training using the proposed intermediate robustness objective is that the choice of \( q \) allows for fine-grained control over the desired level of robustness, rather than being restricted to average-case or worst-case objectives.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We’ve introduced a new robustness metric that allows for evaluating a machine learning model’s intermediate robustness, bridging the gap between evaluating robustness to random perturbations and robustness to worst-case perturbations. This intermediate robustness metric generalizes average-case and worst-case notions of robustness under the same framework as functional \( q \)-norms of the loss function over the perturbation distribution. We introduced a method for approximating this metric using path sampling, which results in a more accurate estimate of the metric compared to naive Monte Carlo sampling when evaluating at robustness levels approaching adversarial robustness. Empirically we showed that by evaluating an image classifier on additive noise perturbations, the proposed intermediate robustness metric enables a broader spectrum of robustness measurements, between the least strict notion of average performance on random perturbations and the most conservative notion of adversarial robustness. Finally, we highlighted the potential ability to train for specific levels of robustness using intermediate-\( q \) robustness as a training objective. For additional details, see our paper <a rel="noopener" target="_blank" href="https://proceedings.neurips.cc/paper/2021/file/ea4c796cccfc3899b5f9ae2874237c20-Paper.pdf">here</a> and code <a rel="noopener" target="_blank" href="https://github.com/locuslab/intermediate_robustness">here</a>.</p>
<h2 id="references">References</h2>
<p>Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling
to bridge sampling to path sampling. Statistical science, pages 163–185, 1998.</p>
<p>Bennett, Charles H. “Efficient estimation of free energy differences from Monte Carlo data.” Journal of Computational Physics 22.2 (1976): 245-268.</p>
<p>Meng, Xiao-Li, and Wing Hung Wong. “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration.” Statistica Sinica (1996): 831-860.</p>
<p>Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo.
Physics letters B, 195(2):216–222, 1987</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This blog post is based on the NeurIPS 2021 paper titled <a rel="noopener" target="_blank" href="https://proceedings.neurips.cc/paper/2021/file/ea4c796cccfc3899b5f9ae2874237c20-Paper.pdf">Robustness between the worst and average case</a>, which was joint work with <a rel="noopener" target="_blank" href="https://annaebair.github.io/">Anna Bair</a>, <a rel="noopener" target="_blank" href="https://www.huan-zhang.com/">Huan Zhang</a>, and <a rel="noopener" target="_blank" href="http://zicokolter.com/">Zico Kolter</a>. This work was supported by a grant from the Bosch Center for Intelligence.</p>
Classification with Strategically Withheld Data2023-02-21T00:00:00+00:002023-02-21T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/withheld/<p><em>This blog post is based on a <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2012.10203.pdf">research paper</a> with the same title, authored by Anilesh Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent Conitzer, published at AAAI 2021.</em></p>
<p><em>TL;DR: We investigate a classification problem where each data point being classified is controlled by an agent who has its own goals or incentives, and may strategically withhold certain features in order to game the classifer and get a more desirable label. We use (an oversimplied version of) college admissions as a running example to illustrate how traditional methods may fail in such settings, as well as how insights from the economic field of mechanism design may help. We then demonstrate a principled method — Incentive-Compatible Logistic Regression — for classification problems with strategically withheld features, which achieves remarkable empirical performance on credit approval data.</em></p>
<p>Applicants to most colleges in the US are required to submit their scores for at least one of the SAT and the ACT.
Applicants usually take one of these two tests — <a rel="noopener" target="_blank" href="https://www.princetonreview.com/college/sat-act">whichever works to their advantage</a>.
However, given the growing competitiveness of college admissions, many applicants now take both tests and then strategically decide whether to <a rel="noopener" target="_blank" href="https://blog.collegevine.com/should-you-submit-your-sat-act-scores/">drop one of the scores</a> (if they think it will hurt their application) or report both.
The key issue here is that it is impossible to distinguish between an applicant who takes both tests but reports only one, and an applicant who takes only one test — for example, because the applicant simply took the one required by their school, the dates for the other test did not work with their schedule, or for other reasons that are not strategic in nature.
Such ambiguity makes it harder for colleges to accurately evaluate applicants, especially since colleges now increasingly <a rel="noopener" target="_blank" href="https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-help-pick-who-gets-in-what-could-go-wrong">rely on machine learning techniques to help make admissions decisions</a>.</p>
<h2 id="what-can-go-wrong">What Can Go Wrong?</h2>
<p>Consider the following simplified scenario: each applicant may naturally (i.e., before they strategically drop one of the scores) have an SAT score, an ACT score, or both.
We also assume these scores are normalized, so they become real numbers between 0 and 1.
Suppose the true competitiveness of an applicant is the average of the scores they naturally have — that is, if an applicant naturally has only one score, then that score is their true competitiveness; if an applicant naturally has both scores, then their true competitiveness is the average of the two scores.
We will use this setup as our running example from now on.
We will not try to “solve” this example problem (later we will see that in some cases, there is no satisfactory solution to the problem), but rather, we will use the example to illustrate the limitations of some classical methods, as well as to motivate the more principled method that we propose.</p>
<p>Now a college wishes to assess each applicant’s competitiveness based on the scores, and admit all applicants whose competitiveness is at least 0.5 (or some threshold chosen by the college).
Assuming all applicants report all scores they naturally have, it is easy to make admissions decisions: the college simply computes each applicant’s average score, and admits that applicant if the average is at least 0.5.
In other words, the college implements a simple <strong>classifier</strong>, which assigns any applicant <strong>label “admitted”</strong> if the average value of their <strong><em>natural</em> features</strong> is at least 0.5.</p>
<p>However, the simple classifier has its problems: after it is used for admissions for a couple of years, applicants may eventually figure out how it works (for example, by talking to past applicants and researching their test scores and application results).
Once applicants know how the decisions are made, they can easily game the system by strategically withholding information.
Consider, for example, an applicant with an SAT score of 0.6 and an ACT score of 0.2.
The applicant would normally be rejected since their true competitiveness is 0.4, which is smaller than the classifier’s threshold, 0.5.
However, knowing how the classifier works, the applicant can withhold the ACT score and report the SAT score only to the college.
Then the classifier would mistakenly believe that the applicant’s competitiveness is 0.6, and admit the applicant.
As a result, the classifier is not accurate anymore when applicants act strategically and try to game it.</p>
<h2 id="how-can-we-fix-it">(How) Can We Fix It?</h2>
<p>Taking into consideration the fact that applicants will eventually figure out how decisions are made, and in response to that, withhold scores strategically to maximize their chances, is it still possible for the college to admit exactly those applicants that the college wants to admit?
The answer is — perhaps not so surprisingly — it <em>depends on the <strong>distribution</strong> of applicants</em>, including how often each score is missing, as well as how the two scores correlate.
To illustrate this dependence, below we discuss two extreme cases.</p>
<p><img src="./examples.png" alt="two extreme cases" /></p>
<p>In one extreme case (illustrated in the left of the figure), every applicant naturally has both scores and the college knows that.
Then, the college’s problem is again simple: the college admits an applicant if and only if that applicant reports both scores, and the average of the two scores is at least 0.5.
This ensures that no applicant would want to withhold a score, because that would lead to automatic rejection.
Moreover, no applicant would be mistakenly rejected because they cannot report both scores, since everyone naturally has both scores.</p>
<p>In another extreme case (illustrated in the right of the figure), there are only two types of applicants: a type-1 applicant naturally has an SAT score of 0.6 and does not have an ACT score; a type-2 applicant naturally has an SAT score of 0.6 and an ACT score of 0.2.
Ideally, the college would like to admit all type-1 applicants (because their competitiveness is 0.6), and reject all type-2 applicants (because their competitiveness is 0.4).
However, this is impossible once applicants respond strategically to the college’s classifier.
For example, if the college admits all type-1 applicants whose SAT score is 0.6 and ACT score is missing, then a type-2 applicant would pretend to be a type-1 applicant by withholding their ACT score, and get admitted too.
On the other hand, to prevent type-2 applicants getting in by pretending to be type-1 applicants, the college would have to reject all type-1 applicants too, eventually admitting no one.</p>
<h2 id="a-principled-approach-via-mechanism-design">A Principled Approach via Mechanism Design</h2>
<p>The above discussion highlights one fact: when applicants respond strategically, the optimal classifier must depend on the distribution of applicants, even if the college’s criteria for admissions stays the same, and there is no restrictions whatsoever on how many applicants can be admitted.
This is reminiscent of problems in <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Mechanism_design">mechanism design</a>.
In a mechanism design problem, a <strong>principal</strong> designs and commits to a decision rule, or a <strong>mechanism</strong> — in the admissions problem discussed above, the principal is the college, and the decision rule is the classifier used for admissions.
Self-interested <strong>agents</strong> (e.g., applicants) then respond to this rule by reporting (possibly nontruthfully) their private information (e.g., their test scores) to the principal.
The mechanism then chooses an <strong>outcome</strong> (e.g., admissions decisions) based on the reported information.
Taking the agents’ strategic behavior into consideration, the principal aims to design a mechanism to maximize their own <strong>utility</strong> (e.g., accuracy of admissions decisions), which generally depends on both the outcome and the agents’ private information.
In fact, in our running example, the college’s problem can be cast directly as a mechanism design problem.
Below we will see how tools from mechanism design can help in solving the college’s classification problem.</p>
<h3 id="incentive-compatibility-and-the-revelation-principle">Incentive Compatibility and the Revelation Principle</h3>
<p>A key notion in mechanism design is <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Incentive_compatibility">incentive compatibility</a>: a mechanism is incentive-compatible if it is always in the agents’ best interest to truthfully report their private information.
Applied to our running example, incentive compatibility means that applicants would never want to withhold a test score that they naturally have.
One reason that incentive compatibility is so important in mechanism design is that it is often <em>without loss of generality</em>: if there is no restriction on the ways in which an agent can (mis)report their private information, then for any (possibly not incentive-compatible) mechanism, there always exists an “incentive-compatible version” of that mechanism which achieves the same effects.
This is famously known as the <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Revelation_principle">revelation principle</a>.
The reason that the revelation principle holds is simple: the principal can adapt any mechanism into an incentive-compatible one by “misreporting for” the agents, in the exact way that the agents would misreport in response to the original mechanism.
We show that a variant of the revelation principle applies to the college’s classification problem (and more generally, to all classification problems with strategically withheld features).
This greatly simplifies the problem, because without loss of generality, we only need to consider classifiers under which applicants have no incentive to withhold any score.
This effectively removes the strategic aspect and leaves a clean classification problem.</p>
<h3 id="incentive-compatible-logistic-regression">Incentive-Compatible Logistic Regression</h3>
<p>Given the revelation principle, we propose a principled method, <strong>incentive-compatible logistic regression</strong>, for classification problems with strategically withheld data.
The idea is simple: we run the classical gradient-based algorithm for logistic regression, <em>but with the search space restricted to classifiers that are incentive-compatible</em>.
The college can then use the resulting model to classify applicants in an incentive-compatible way.
We will see below how this can be done by adding a projection step to the region of incentive-compatible classifiers, after each gradient step.</p>
<p>Recall that in logistic regression, the goal is to learn a set of coefficients \({\beta_i}\), one for each feature \(i\), as well as an intercept \(\beta_0\), such that for each data point \((x, y)\), the predicted label \(\hat{y}\) given by
\[
\hat{y} = \mathbb{I}\left[\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) \ge 0.5\right]
\]
fits the true label \(y\) as well as possible.
Here, \(\sigma\) is the logistic function, defined as
\[
\sigma(t) = 1 / (1 + e^{-t}).
\]
Mapping these notions back to our running example, each data point \((x, y)\) corresponds to an applicant, where each feature \(x_i\) is one of the two scores, and the true label \(y\) is \(1\) (corresponding to “admitted”) if the applicant’s true competitiveness is at least the college’s desired threshold, and \(0\) (corresponding to “rejected”) otherwise.
The classifier computes a predicted label \(\hat{y}\) for each data point, which is the admissions decision for that specific applicant.
Naturally, the college wants \(\hat{y}\) to fit \(y\) as well as possible.</p>
<p>It turns out there is a simple condition for the classifier of the above form to be incentive-compatible.
Without loss of generality, suppose each feature \(x_i\) is always nonnegative.
this is naturally true in our running example, since each feature is a score between \(0\) and \(1\); in general, we can shift the features if they are not nonnegative.
Moreover, if a feature is missing in a data point, then we simply treat that feature as \(0\).
Then a classifier induced by \({\beta_i}\) is incentive-compatible if and only if each \(\beta_i\) is nonnegative.
This is because if some \(\beta_i < 0\), then a data point with feature \(x_i > 0\) will be able to increase their score, \(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right)\), by withholding feature \(x_i\).
Depending on the values of other features, this will sometimes change the predicted label of that data point from \(0\) (i.e., rejected) to \(1\) (i.e., admitted).
In other words, such a classifier cannot be incentive-compatible.
On the other hand, if each \(\beta_i\) is nonnegative, then for any data point, withholding a feature \(x_i\) can never increase the score, so there is no incentive to withhold any feature.</p>
<p>Given the above characterization, we can simply adapt the gradient-based algorithm for (unconstrained) logistic regression to find a good incentive-compatible classifier.
We initialize the classifier arbitrarily, and repeat the following steps for each data point \((x, y)\) until convergence:</p>
<ul>
<li>
<p><strong>The gradient step</strong>: Let
\[
\beta_0 \gets \beta_0 - \eta_t \cdot \left(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) - y\right).
\]
For each feature \(i\), let
\[
\beta_i \gets \beta_i - \eta_t \cdot \left(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) - y\right) \cdot x_i.
\]
Here, \(\eta_t\) is the learning rate in step \(t\).
This rate normally decreases in \(t\), e.g., \(\eta_t = 1 / \sqrt{t}\).</p>
</li>
<li>
<p><strong>The projection step</strong>: For each feature \(i\), let
\[
\beta_i \gets \max\{0, \beta_i\}.
\]</p>
</li>
</ul>
<p>This can be viewed as an instantiation of the projected gradient descent algorithm: the gradient step is exactly the same as in (unconstrained) logistic regression, and the projection step ensures that the incentive-compatibility constraint is satisfied.</p>
<p>Coming back to our running example, incentive-compatible logistic regression will assign a nonnegative weight to each test score and admit an applicant if the weighted sum of the two scores exceeds some threshold. Note that this does not “solve” the college’s problem in all cases: for example, between the two extreme cases discussed above, incentive-compatible logistic regression would work very well in the first case, but in the second case its performance would not be practically meaningful, simply because the second case is intrinsically hard and no classifier can achieve a reasonable accuracy there.</p>
<h2 id="experimental-results">Experimental Results</h2>
<p>We empirically evaluate incentive-compatible logistic regression on 4 real-world credit approval datasets from the <a rel="noopener" target="_blank" href="http://archive.ics.uci.edu/ml/index.php">UCI ML Repository</a>, based on historical data collected in Australia, Germany, Poland, and Taiwan.
Each data point in each dataset corresponds to a single credit application, with tens of features (3 datasets provide 15-23 features, and the other provides 64), including annual income, employment status, current balance in savings account, etc.
Each data point has a binary label, which is either “approve” (i.e., 1) or “reject” (i.e., 0).
We preprocess the datasets by randomly dropping some features for each data point, thus simulating naturally missing features.
We consider two ways of reporting in our evaluation:</p>
<ul>
<li>
<p><strong>Truthful reporting</strong>: Each data point always reveals all features it naturally has to the classifier.
This is the assumption made by the baseline methods, which we compare against in our evaluation.</p>
</li>
<li>
<p><strong>Strategic reporting</strong>: In reponse to the classifier, each data point optimally withholds a subset of features to maximize the chance of getting approved (i.e., label 1).
For incentive-compatible logistic regression, strategic reporting is equivalent to truthful reporting.
However, as we will see, the baseline methods perform significantly worse with strategic reporting (which is natural, since they were not designed to be robust against strategic manipulation).</p>
</li>
</ul>
<p>As for the baseline methods, we compare against <strong>logistic regression</strong> (without incentive-compatibility), <strong>neural networks</strong>, and <strong>random forests</strong>.
These are the most popular and accurate methods in credit approval applications.
For more details of the experiments, please see Section 6 of <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2012.10203.pdf">our paper</a>.</p>
<p>The accuracy of each classifier tested on each dataset can be found in the table below.
Note that there are two numbers in each cell: the left one corresponds to the accuracy under truthful reporting, and the right one corresponds to the accuracy under strategic reporting.</p>
<table><thead><tr><th>Classifier</th><th>Australia</th><th>Germany</th><th>Poland</th><th>Taiwan</th></tr></thead><tbody>
<tr><td>Incentive-compatible logistic regression</td><td><strong>0.800</strong> / <strong>0.800</strong></td><td>0.651 / <strong>0.651</strong></td><td>0.698 / <strong>0.698</strong></td><td>0.646 / <strong>0.646</strong></td></tr>
<tr><td>Logistic regression (baseline)</td><td><strong>0.800</strong> / 0.763</td><td><strong>0.652</strong> / 0.580</td><td>0.714 / 0.660</td><td>0.670 / 0.618</td></tr>
<tr><td>Artificial neural networks (baseline)</td><td><strong>0.800</strong> / 0.747</td><td><strong>0.652</strong> / 0.580</td><td><strong>0.719</strong> / 0.636</td><td><strong>0.688</strong> / 0.543</td></tr>
<tr><td>Random forest (baseline)</td><td>0.797 / 0.541</td><td>0.633 / 0.516</td><td>0.709 / 0.522</td><td>0.684 / 0.588</td></tr>
</tbody></table>
<p>Here we make two observations:</p>
<ul>
<li>Under strategic reporting, incentive-compatible logistic regression is consistently much more accurate than all 3 baseline methods.
This highlights the importance of robustness against strategic manipulation by design.</li>
<li>The accuracy of incentive-compatible logistic regression under strategic reporting is often comparable to that of the baseline methods under truthful reporting.
In other words, although strategic manipulation poses challenges in the design of a good classifier, from an information-theoretic perspective, the classification problem does not become much harder.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>We study the problem of classification when each data point can strategically withhold some of its features to obtain a more favorable outcome.
We propose a principled classification method, incentive-compatible logistic regreggsion, which is robust to strategic manipulation.
The new method is tested on real-world datasets, showing that it outperforms out-of-the-box methods that do not account for strategic behavior.
More generally, we draw connections between strategic classification and mechanism design, which may inspire future work in other strategic classification settings.</p>
Designing Data Structures for Collaborative Apps2023-02-17T00:00:00+00:002023-02-17T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/collaborative-data-design/<h1 id="introduction-collaborative-apps-via-crdts">Introduction: Collaborative Apps via CRDTs</h1>
<blockquote>
<p>An extended version of this post appears <a rel="noopener" target="_blank" href="https://mattweidner.com/2022/02/10/collaborative-data-design.html">on my personal site</a>.</p>
</blockquote>
<p></p><br />
<p>Suppose you’re building a collaborative app, along the lines of Google Docs/Sheets/Slides, Figma, Notion, etc., but <em>without a central server</em>. One challenge you’ll face is the actual collaboration: when one user changes the shared state, their changes need to show up for every other user. For example, if multiple users type at the same time in a text field, the result should reflect all of their changes and be consistent (identical for all users).</p>
<p><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type"><strong>Conflict-free Replicated Data Types (CRDTs)</strong></a> provide a solution to this challenge. They are data structures that look like ordinary data structures (maps, sets, text strings, etc.), except that they are collaborative: when one user updates their copy of a CRDT, their changes automatically show up for everyone else. Each user sees their own changes immediately, while under the hood, the CRDT broadcasts a message describing the change to everyone else. Other users see the change once they receive this message.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/message_sending.png" alt="CRDTs broadcast messages to relay changes" /></p>
<p><a name="correctness"></a>Note that multiple users might make changes at the same time, e.g., both typing at once. Since each user sees their own changes immediately, their views of the document will temporarily diverge. However, CRDTs guarantee that once the users receive each others’ messages, they’ll see identical document states again: this is the definition of <strong>CRDT correctness</strong>. Ideally, this state will also be “reasonable”, i.e., it will incorporate both of their edits in the way that the users expect.</p>
<blockquote>
<p>In distributed systems terms, CRDTs are <em>Available</em>, <em>Partition tolerant</em>, and have <em>Strong Eventual Consistency</em>.</p>
</blockquote>
<p></p><br />
<p>CRDTs work even if messages might be arbitrarily delayed, or delivered to different users in different orders. This lets you make collaborative experiences that don’t need a central server, work offline, and/or are end-to-end encrypted (<a rel="noopener" target="_blank" href="https://www.inkandswitch.com/local-first/"><strong>local-first software</strong></a>).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/google_docs_offline.png" alt="Google Docs doesn’t let you type while offline" /></p>
<p align="center"><i>CRDTs allow offline editing, unlike Google Docs.</i></p>
<p>I’m particularly excited by the potential for <strong>open-source collaborative apps</strong> that anyone can distribute or modify, without requiring app-specific hosting.</p>
<h2 id="the-challenge-designing-crdts">The Challenge: Designing CRDTs</h2>
<p>Having read all that, let’s say you choose to use a CRDT for your collaborative app. All you need is a CRDT representing your app’s state, a frontend UI, and a network of your choice (or a way for users to pick the network themselves). But where do you get a CRDT for your specific app?</p>
<p>If you’re lucky, it’s described in a <a rel="noopener" target="_blank" href="https://crdt.tech/papers.html">paper</a>, or even better, implemented in a <a rel="noopener" target="_blank" href="https://crdt.tech/implementations">library</a>. But those tend to have simple or one-size-fits-all data structures: maps, text strings, unstructured JSON, etc. You can usually rearrange your app’s state to make it fit in these CRDTs; and if users make changes at the same time, CRDT correctness guarantees that you’ll get <em>some</em> consistent result. However, it might not be what you or your users expect. Worse, you have little leeway to customize this behavior.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/json_anomaly.png" alt="Anomaly in a published JSON CRDT: In a collaborative todo-list, concurrently deleting an item and marking it done results in a nonsense list item with no text field." /></p>
<p align="center"><i>
In a <a href="https://doi.org/10.1109/TPDS.2017.2697382" target="_blank">published JSON CRDT</a>, when representing a todo-list using items with "title" and "done" fields, you can end up with an item <code>{"done": true}</code> having no "title" field. Image credit: Figure 6 from the paper.
</i></p>
<!--figure: hypothetical user Q&A asking for a change in the conflict-resolution, and you just reply "sorry".-->
<p>This blog post will instead teach you how to design CRDTs from the ground up. I’ll present a few simple CRDTs that are obviously correct, plus ways to compose them together into complicated whole-app CRDTs that are still obviously correct. I’ll also present principles of CRDT design to help guide you through the process. To cap it off, we’ll design a CRDT for a collaborative spreadsheet.</p>
<p><strong>Ultimately, I hope that you will gain not just an understanding of some existing CRDT designs, but also the confidence to tweak them and create your own!</strong></p>
<h2 id="related-work">Related Work</h2>
<p>The CRDTs I describe are based on <a rel="noopener" target="_blank" href="http://dx.doi.org/10.1007/978-3-642-24550-3_29">Shapiro et al. 2011</a> unless noted otherwise. However, the way I describe them, and the design principles and composition techniques, are my own way of thinking about CRDT design. It’s inspired by the way <a rel="noopener" target="_blank" href="https://www.figma.com/blog/how-figmas-multiplayer-technology-works/">Figma</a> and <a rel="noopener" target="_blank" href="https://hex.tech/blog/a-pragmatic-approach-to-live-collaboration">Hex</a> describe their collaboration platforms; they likewise support complex apps by composing simple, easy-to-reason-about pieces. Relative to those platforms, I incorporate more academic CRDT designs, enabling more flexible behavior and server-free operation.</p>
<!--I'll describe most CRDTs in terms of an implementation, because I find implementations easier to explain. However, my real goal is to describe their *semantics*: what users see after they perform various operations, possibly concurrently. If you can find alternate implementations that have the same behavior as the ones I describe but are more efficient, then by all means, use those instead. -->
<h1 id="basic-designs">Basic Designs</h1>
<p>I’ll start by going over some basic CRDT designs.</p>
<h2 id="unique-set">Unique Set</h2>
<p>Our foundational CRDT is the <strong>Unique Set</strong>. It is a set in which each added element is considered unique.</p>
<p>Formally, the user-facing operations on the set, and their collaborative implementations, are as follows:</p>
<ul>
<li><code>add(x)</code>: Adds an element <code>e = (t, x)</code> to the set, where <code>t</code> is a <em>unique new tag</em>, used to ensure that <code>(t, x)</code> is unique. To implement this, the adding user generates <code>t</code>, e.g., as a pair (device id, device-specific counter), then serializes <code>(t, x)</code> and broadcasts it to the other users. The receivers deserialize <code>(t, x)</code> and add it to their local copy of the set.</li>
<li><code>delete(e)</code>: Deletes the element <code>e = (t, x)</code> from the set. To implement this, the deleting user serializes <code>t</code> and broadcasts it to the other users. The receivers deserialize <code>t</code> and remove the element with tag <code>t</code> from their local copy, if it has not been deleted already.</li>
</ul>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/message_flow.png" alt="In response to user input, the operator calls “Output message”. The message is then delivered to every user’s “Receive & Update display” function." /></p>
<p align="center"><i>The lifecycle of an <code>add</code> or <code>delete</code> operation.</i></p>
<p>When displaying the set to the user, you ignore the tags and just list out the data values <code>x</code>, keeping in mind that (1) they are not ordered (at least not consistently across different users), and (2) there may be duplicates.</p>
<p><strong>Example:</strong> In a collaborative flash card app, you could represent the deck of cards as a Unique Set, using <code>x</code> to hold the flash card’s value (e.g., its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed. <!--Note that the collaborative state is just the *set* of cards; there is no ordering info. You could perhaps sort them alphabetically in editing mode (to make them consistent), and randomly in practice mode (deliberately inconsistent).--></p>
<p><a name="causal-order"></a>When broadcasting messages, we require that they are delivered <em>reliably</em> and <em>in causal order</em>, but it’s okay if they are arbitarily delayed. (These rules apply to all CRDTs, not just the Unique Set.) Delivery <strong>in causal order</strong> means that if a user sends a message \(m\) after receiving or sending a message \(m^\prime\), then all users delay receiving \(m\) until after receiving \(m^\prime\). This is the strictest ordering we can implement without a central server and without extra round-trips between users, e.g., by using <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Vector_clock">vector clocks</a>.</p>
<p>Messages that aren’t ordered by the causal order are <strong>concurrent</strong>, and different users might receive them in different orders. But for <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#correctness">CRDT correctness</a>, we must ensure that all users end up in the same state regardless, once they have received the same messages.</p>
<p>For the Unique Set, it is obvious that the state of the set, as seen by a specific user, is always the set of elements for which they have received an <code>add</code> message but no <code>delete</code> messages. This holds regardless of the order in which they received concurrent messages. Thus the Unique Set is correct.</p>
<blockquote>
<p>Note that delivery in causal order is important—a <code>delete</code> operation only works if it is received after its corresponding <code>add</code> operation.</p>
</blockquote>
<p></p><br />
<p>We now have our first principle of CRDT design:</p>
<p><a name="principle-1"></a><strong>Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.</strong></p>
<p>Although it is simple, the Unique Set forms the basis for the rest of our CRDTs.</p>
<!-- > **Aside.** Traditionally, one proves CRDT correctness by proving that concurrent messages *commute*---they have the same effect regardless of delivery order ([Shapiro et al. 2011](http://dx.doi.org/10.1007/978-3-642-24550-3_29))---or that the final state is a function of the causally-ordered message history ([Baquero, Almeida, and Shoker 2014](https://doi.org/10.1007/978-3-662-43352-2_11)). However, as long as you stick to the techniques in this blog post, you won't need explicit proofs: everything builds on the Unique Set in ways that trivially preserve CRDT correctness. For example, a deterministic view of a Unique Set is obviously still a CRDT.
<p></p><br /> -->
<h2 id="lists">Lists</h2>
<p>Our next CRDT is a <strong>list CRDT</strong>. It represents a list of elements, with <code>insert</code> and <code>delete</code> operations. For example, you can use a list CRDT of characters to store the text in a collaborative text editor, using <code>insert</code> to type a new character and <code>delete</code> for backspace.</p>
<p>Formally, the operations on a list CRDT are:</p>
<ul>
<li><code>insert(i, x)</code>: Inserts a new element with value <code>x</code> at index <code>i</code>, between the existing elements at indices <code>i</code> and <code>i+1</code>. All later elements (index <code>>= i+1</code>) are shifted one to the right.</li>
<li><code>delete(i)</code>: Deletes the element at index <code>i</code>. All later elements (index <code>>= i+1</code>) are shifted one to the left.</li>
</ul>
<p>We now need to decide on the semantics, i.e., what is the result of various insert and delete operations, possibly concurrent. The fact that insertions are unique suggests using a Unique Set (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>). However, we also have to account for indices and the list order.</p>
<p>One approach would use indices directly: when a user calls <code>insert(i, x)</code>, they send <code>(i, x)</code> to the other users, who use <code>i</code> to insert <code>x</code> at the appropriate location. The challenge is that your intended insertion index might move around as a result of users’ inserting/deleting in front of <code>i</code>.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/ot.png" alt="The gray cat jumped on the table." /></p>
<p align="center"><i>Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.</i></p>
<p>It’s possible to work around this by “transforming” <code>i</code> to account for concurrent edits. That idea leads to <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Operational_transformation"><strong>Operational Transformation (OT)</strong></a>, the earliest-invented approach to collaborative text editing, and the one used in Google Docs and most existing apps. Unfortunately, OT algorithms are quite complicated, leading to numerous <a rel="noopener" target="_blank" href="https://core.ac.uk/download/pdf/54049928.pdf">flawed algorithms</a>. You can reduce complexity by using a central server to manage the document, like Google Docs does, but that precludes decentralized networks, end-to-end encryption, and server-free open-source apps.</p>
<!--Several incorrect attempts at server-free OT were published before the [first correct one](https://core.ac.uk/download/pdf/54049928.pdf) in 2005 (cite, check correctness via citations)---the same year the [first CRDT paper](https://hal.inria.fr/inria-00071240/document) was published. -->
<p>List CRDTs use a different perspective from OT. When you type a character in a text document, you probably don’t think of its position as “index 17” or whatever; instead, its position is at a certain place within the existing text.</p>
<p>“A certain place within the existing text” is vague, but at a minimum, it should be between the characters left and right of your insertion point (“on” and “ table“ in the example above) Also, unlike an index, this intuitive position is <em>immutable</em>.</p>
<p>This leads to the following implementation. The list’s state is a Unique Set whose values are pairs <code>(p, x)</code>, where <code>x</code> is the actual value (e.g., a character), and <code>p</code> is a <strong>unique immutable position</strong> drawn from some abstract total order. The user-visible state of the list is the list of values <code>x</code> ordered by their positions <code>p</code>. Operations are implemented as:</p>
<ul>
<li><code>insert(i, x)</code>: The inserting user looks up the positions <code>pL</code>, <code>pR</code> of the values to the left and right (indices <code>i</code> and <code>i+1</code>), generates a unique new position <code>p</code> such that <code>pL < p < pR</code>, and calls <code>add((p, x))</code> on the Unique Set. </li>
<li><code>delete(i)</code>: The deleting user finds the element <code>e</code> of the Unique Set at index <code>i</code>, then calls <code>delete(e)</code> on the Unique Set.</li>
</ul>
<p>Of course, we need a way to create the positions <code>p</code>. That’s the hard part—in fact, the hardest part of any CRDT—and I don’t have space to go into it here; you should use an existing algorithm (e.g., <a rel="noopener" target="_blank" href="http://dx.doi.org/10.1016/j.jpdc.2010.12.006">RGA</a>) or implementation (e.g., <a rel="noopener" target="_blank" href="https://docs.yjs.dev/api/shared-types/y.array">Yjs’s <code>Y.Array</code></a>). <!--Generally, solutions involve a tree, sorted by the tree walk on nodes; you create a unique new position in between `pL` and `pR` by adding a new leaf somewhere between `pL` and `pR`, e.g., as a right child of `pL`.--></p>
<p>The important lesson here is that we had to translate indices (the language of normal, non-CRDT lists) into unique immutable positions (what the user intuitively means when they say “insert here”). That leads to our second principle of CRDT design:</p>
<p><a name="principle-2"></a><strong>Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.</strong></p>
<h2 id="registers">Registers</h2>
<p>Our last basic CRDT is the <strong>register</strong>. This is a variable that holds an arbitrary value that can be set and get. If multiple users set the value at the same time, you pick one of them arbitrarily, or perhaps average them together.</p>
<p><strong>Example uses for registers:</strong></p>
<ul>
<li>The font size of a character in a collaborative rich-text editor.</li>
<li>The name of a document.</li>
<li>The color of a specific pixel in a collaborative whiteboard.</li>
<li>Basically, anything where you’re fine with users overwriting each others’ concurrent changes and you don’t want to use a more complicated CRDT.</li>
</ul>
<p>Registers are very useful and suffice for many tasks (e.g., <a rel="noopener" target="_blank" href="https://www.figma.com/blog/how-figmas-multiplayer-technology-works/">Figma</a> and <a rel="noopener" target="_blank" href="https://hex.tech/blog/a-pragmatic-approach-to-live-collaboration">Hex</a> use them almost exclusively).</p>
<p>The only operation on a register is <code>set(x)</code>, which sets the value to <code>x</code> (in the absence of concurrent operations). We can’t perform these operations literally, since if two users receive concurrent <code>set</code> operations in different orders, they’ll end up with different values.</p>
<p>However, we can <em>add</em> the value <code>x</code> to a Unique Set, following <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>. The state is now a set of values instead of a single value, but we’ll address that soon. We can also delete old values each time <code>set(x)</code> is called, overwriting them.</p>
<p>Thus the implementation of <code>set(x)</code> becomes:</p>
<ul>
<li>For each element <code>e</code> in the Unique Set, call <code>delete(e)</code> on the Unique Set; then call <code>add(x)</code>.</li>
</ul>
<p>The result is that at any time, the register’s state is the set of all the most recent concurrently-set values.</p>
<p>Loops of the form “for each element of a collection, do something” are common in programming. We just saw a way to extend them to CRDTs: “for each element of a Unique Set, do some CRDT operation”. I call this a <strong>causal for-each operation</strong> because it only affects elements that are prior to the for-each operation in the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#causal-order">causal order</a>. It’s useful enough that we make it our next principle of CRDT design:</p>
<p><a name="principle-3a"></a><strong>Principle 3a. For operations that do something “for each” element of a collection, one option is to use a <em>causal for-each operation</em> on a Unique Set (or list CRDT).</strong></p>
<p>(Later we will expand on this with <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3b">Principle 3b</a>, which also concerns for-each operations.)</p>
<p>Returning to registers, we still need to handle the fact that our state is a set of values, instead of a specific value.</p>
<p>One option is to accept this as the state, and present all conflicting values to the user. That gives the <strong>Multi-Value Register (MVR)</strong>.</p>
<p><a name="lww-register"></a>Another option is to pick a value arbitrarily but deterministically. E.g., the <strong>Last-Writer Wins (LWW) Register</strong> tags each value with the wall-clock time when it is set, then picks the value with the latest timestamp.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/pixelpusher.png" alt="Grid of pixels, some conflicting (outlined in red). One conflicting pixel has been clicked on, revealing the conflicting choices." /></p>
<p align="center"><i>In <a href="https://medium.com/@pvh/pixelpusher-real-time-peer-to-peer-collaboration-with-react-7c7bc8ecbf74" target="_blank">Pixelpusher</a>, a collaborative pixel art editor, each pixel shows one color by default (LWW Register), but you can click to pop out all conflicting colors (MVR). Image credit: Peter van Hardenberg (<a href="https://miro.medium.com/max/270/1*tXSBtdqf6yBCO6i77VVH1A.png" target="_blank">original</a>).</i></p>
<p>In general, you can define the value getter to be an arbitrary deterministic function of the set of values.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>If the values are colors, you can average their RGB coordinates.</li>
</ul>
<!--figure: illustration-->
<ul>
<li><a name="enable-wins-flag"></a>If the values are booleans, you can choose to prefer <code>true</code> values, i.e., the register’s value is <code>true</code> if its set contains any <code>true</code> values. That gives the <strong>Enable-Wins Flag</strong>.</li>
</ul>
<h1 id="composing-crdts">Composing CRDTs</h1>
<p>We now have enough basic CRDTs to start making more complicated data structures through composition. I’ll describe three techniques: CRDT objects, CRDT-valued maps, and collections of CRDTs.</p>
<h2 id="crdt-objects">CRDT Objects</h2>
<p>The simplest composition technique is to use multiple CRDTs side-by-side. By making them instance fields in a class, you obtain a <strong>CRDT Object</strong>, which is itself a CRDT (trivially correct). The power of CRDT objects comes from using standard OOP techniques, e.g., implementation hiding.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>In a collaborative flash card app, to make individual cards editable, you could represent each card as a CRDT object with two text CRDT (list CRDT of characters) instance fields, one for the front and one for the back.</li>
<li>You can represent the position and size of an image in a collaborative slide editor by using separate registers for the left, top, width, and height. <!--To get a complete image object, you might also add registers for border color/size/style, a text CRDT for the caption, a register for the image source (unless it's immutable, in which case you can use an ordinary, non-CRDT instance field), etc.--></li>
</ul>
<!--- Recall that we defined lists and registers in terms of the Unique Set. We can consider these as CRDT objects as well, even though they just have one instance field (the set). The object lets us delegate operations and reads to the inner set while exposing the API of a list/register.-->
<p>To implement a CRDT object, each time an instance field requests to broadcast a message, the CRDT object broadcasts that message tagged with the field’s name. Receivers then deliver the message to their own instance field with the same name. <!--When nesting CRDT objects, this effectively creates a tree with a [basic CRDT](#basic-designs) at each leaf; each basic CRDT message is sent tagged with its path to the root.--></p>
<h2 id="crdt-valued-map">CRDT-Valued Map</h2>
<p>A CRDT-valued map is like a CRDT object but with potentially infinite instance fields, one for each allowed map key. Every key/value pair is implicitly always present in the map, but values are only explicitly constructed in memory as needed, using a predefined factory method (like Apache Commons’ <a rel="noopener" target="_blank" href="https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/LazyMap.html">LazyMap</a>).</p>
<p><strong>Examples:</strong></p>
<ul>
<li><a id="add-wins-set"></a>Consider a shared notes app in which users can archive notes, then restore them later. To indicate which notes are normal (not archived), we want to store them in a set. A Unique Set won’t work, since the same note can be added (restored) multiple times. Instead, you can use a CRDT-valued map whose keys are the documents and whose values are <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#enable-wins-flag">enable-wins flags</a>; the value of the flag for key <code>doc</code> indicates whether <code>doc</code> is in the set. This gives the <strong>Add-Wins Set</strong>.</li>
<li><a rel="noopener" target="_blank" href="https://quilljs.com/">Quill</a> lets you easily display and edit rich text in a browser app. In a Quill document, each character has an <code>attributes</code> map, which contains arbitrary key-value pairs describing formatting (e.g., <code>"bold": true</code>). You can model this using a CRDT-valued map with arbitrary keys and <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#lww-register">LWW register</a> values; the value of the register for key <code>attr</code> indicates the current value for <code>attr</code>.</li>
</ul>
<p>A CRDT-valued map is implemented like a CRDT object: each message broadcast by a value CRDT is tagged with its serialized key. Internally, the map stores only the explicitly-constructed key-value pairs; each value is constructed using the factory method the first time it is accessed by the local user or receives a message. However, this is not visible externally—from the outside, the other values still appear present, just in their initial states. (If you want an explicit set of “present” keys, you can track them using an <a href="#add-wins-set">Add-Wins Set</a>.)</p>
<blockquote>
<p>CRDT-valued maps are based on the <a rel="noopener" target="_blank" href="https://docs.riak.com/riak/kv/2.2.3/learn/concepts/crdts/index.html#maps">Riak map</a>.</p>
</blockquote>
<p></p><br />
<h2 id="collections-of-crdts">Collections of CRDTs</h2>
<p>Our above definition of a Unique Set implicitly assumed that the data values <code>x</code> were immutable and serializable (capable of being sent over the network). However, we can also make a <strong>Unique Set of CRDTs</strong>, whose values are dynamically-created CRDTs.</p>
<p>To add a new value CRDT, a user sends a unique new tag and any arguments needed to construct the value. Each recipient passes those arguments to a predefined factory method, then stores the returned CRDT in their copy of the set. When a value CRDT is deleted, it is forgotten and can no longer be used.</p>
<p>Note that unlike in a CRDT-valued map, values are explicitly created (with dynamic constructor arguments) and deleted—the set effectively provides collaborative <code>new</code> and <code>free</code> operations.</p>
<p>We can likewise make a <strong>list of CRDTs</strong>.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>In a shared folder containing multiple collaborative documents, you can define your document CRDT, then use a Unique Set of document CRDTs to model the whole folder. (You can also use a CRDT-valued map from names to documents, but then documents can’t be renamed, and documents “created” concurrently with the same name will end up merged.)</li>
</ul>
<!--- In a todo-list app, you can define a "todo-item CRDT" with fields `text` and `done`, giving the item text and whether it is done. The whole app's state is then a list of todo-item CRDTs.-->
<ul>
<li>Continuing the Quill rich-text example from the previous section, you can model a rich-text document as a list of “rich character CRDTs”, where each “rich character CRDT” consists of an immutable (non-CRDT) character plus the <code>attributes</code> map CRDT. This is sufficient to build <a rel="noopener" target="_blank" href="https://compoventuals-tests.herokuapp.com/host.html?network=ws&container=demos/rich-text/dist/rich_text.html">a simple Google Docs-style app with CRDTs</a> (<a rel="noopener" target="_blank" href="https://github.com/composablesys/collabs/blob/master/demos/apps/rich-text/src/rich_text.ts">source</a>).</li>
</ul>
<h2 id="using-composition">Using Composition</h2>
<p>You can use the above composition techniques and basic CRDTs to design CRDTs for many collaborative apps. Choosing the exact structure, and how operations and user-visible state map onto that structure, is the main challenge.</p>
<p>A good starting point is to design an ordinary (non-CRDT) data model, using ordinary objects, collections, etc., then convert it to a CRDT version. So variables become registers, sets become Unique Sets or Add-Wins Sets, etc. You can then tweak the design as needed to accommodate extra operations or fix weird concurrent behaviors.</p>
<p>To accommodate as many operations as possible while preserving user intention, I recommend:</p>
<p><a name="principle-4"></a><strong>Principle 4. Independent operations (in the user’s mind) should act on independent state.</strong></p>
<p><strong>Examples:</strong></p>
<ul>
<li>As mentioned earlier, you can represent the position and size of an image in a collaborative slide editor by using separate registers for the left, top, width, and height. If you wanted, you could instead use a single register whose value is a tuple (left, top, width, height), but this would violate Principle 4. Indeed, then if one user moved the image while another resized it, one of their changes would overwrite the other, instead of both moving and resizing. <!--Likewise, it would be a mistake to replace (left, top, width, height) with (left, top, right, bottom) (this also violates [Principle 2](#principle-2)).--></li>
<li>Again in a collaborative slide editor, you might initially model the slide list as a list of slide CRDTs. However, this provides no way for users to move slides around in the list, e.g., swap the order of two slides. You could implement a move operation using cut-and-paste, but then slide edits concurrent to a move will be lost, even though they are intuitively independent operations.<br />
<a name="list-with-move"></a>Following Principle 4, you should instead implement move operations by modifying some state independent of the slide itself. You can do this by replacing the <em>list</em> of slides with a <em>Unique Set</em> of objects <code>{ slide, positionReg }</code>, where <code>positionReg</code> is an LWW register indicating the position. To move a slide, you create a unique new position like in a list CRDT, then set the value of <code>positionReg</code> equal to that position. This construction gives the <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3380787.3393677"><strong>list-with-move</strong></a> CRDT.</li>
</ul>
<h1 id="new-concurrent-causal-for-each-operations">New: Concurrent+Causal For-Each Operations</h1>
<p>There’s one more trick I want to show you. Sometimes, when performing a for-each operation on a Unique Set or list CRDT (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>), you don’t just want to affect existing (causally prior) elements. You also want to affect <em>elements that are added/inserted concurrently</em>.</p>
<p>For example:</p>
<ul>
<li>In a rich text editor, if one user bolds a range of text, while concurrently, another user types in the middle of the range, the latter text should also be bolded.
<br />
<img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/weird_bolding.png" alt="One user bolds a range of text, while concurrently, another user types “ the“ in the middle. In the final result, “ the“ is also bolded." />
<br />
In other words, the first user’s intended operation is “for each character in the range <em>including ones inserted concurrently</em>, bold it”.</li>
<li>In a collaborative recipe editor, if one user clicks a “double the recipe” button, while concurrently, another user edits an amount, then their edit should also be doubled. Otherwise, the recipe will be out of proportion, and the meal will be ruined!</li>
</ul>
<p>I call such an operation a <strong>concurrent+causal for-each operation</strong>. To accomodate the above examples, I propose the following addendum to <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>:</p>
<p><a name="principle-3b"></a><strong>Principle 3b. For operations that do something “for each” element of a collection, another option is to use a <em>concurrent+causal for-each operation</em> on a Unique Set (or list CRDT).</strong></p>
<p>To implement this, the initiating user first does a causal for-each operation. They then send a message describing how to perform the operation on concurrently added elements. The receivers apply the operation to any concurrently added elements they’ve received already (and haven’t yet deleted), then store the message in a log. Later, each time they receive a new element, they check if it’s concurrent to the stored message; if so, they apply the operation.</p>
<!-- > **Aside.** It would be more general to split Principle 3 into "causal for-each" and "concurrent for-each" operations. However, I haven't yet found a good use-case for a concurrent for-each operation that isn't part of a concurrent+causal for-each.
<p></p><br /> -->
<p>Concurrent+causal for-each operations are novel as far as I’m aware. They are based on <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3408976">a paper</a> I, <a rel="noopener" target="_blank" href="https://heather.miller.am/">Heather Miller</a>, and <a rel="noopener" target="_blank" href="http://christophermeiklejohn.com/">Christopher Meiklejohn</a> wrote last year, about a composition technique we call the <em>semidirect product</em>, which can implement them (albeit in a confusing way). <!--Unfortunately, the paper doesn't make clear what the semidirect product is doing intuitively (since we didn't understand this ourselves!). My current opinion is that concurrent+causal for-each operations are what it's really trying to do; the semidirect product is a special case of an optimized implementation, but written in the confusing traditional style (implementation + proof that concurrent operations commute). --></p>
<!-- > If you do want to use the semidirect product as an optimized implementation, be aware that it is not as general as it could be. E.g., the recipe example can be optimized, but not using the semidirect product. I'll write up a tech report about a more general approach at some point.
<p></p><br /> -->
<!-- Aside: dual view: controller for the for-each part plus oppositely-adjusted state. E.g. for scaling, or reversible list? Perhaps contrast with that approach---ours should be easier, in comparison to e.g. rich-text CRDT using invisible formatting characters (direct construction approach). -->
<!--# Summary: Principles of CRDT Design
also non-principle advice (basic designs, composition techniques)
For easy reference, here are our principles of CRDT design.
[**Principle 1.**](#principle-1) Use the Unique Set CRDT for operations that "add" or "create" a unique new thing.
[**Principle 2.**](#principle-2) Express operations in terms of user intention---what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.
**Principle 3([a](#principle-3a), [b](#principle-3b)).** For operations that do something "for each" element of a collection, use a *causal for-each operation* or a *concurrent+causal for-each operation* on a Unique Set (or list CRDT).
[**Principle 4.**](#principle-4) Independent operations (in the user's mind) should act on independent state.-->
<h1 id="case-study-a-collaborative-spreadsheet">Case Study: A Collaborative Spreadsheet</h1>
<p>Now let’s get practical: we’re going to design a CRDT for a collaborative spreadsheet editor (think Google Sheets).</p>
<p>As practice, try sketching a design yourself before reading any further. The rest of this section describes how I would do it, but don’t worry if you come up with something different—there’s no one right answer! The point of this blog post is to give you the confidence to design and tweak CRDTs like this yourself, not to dictate “the one true spreadsheet CRDT™”.</p>
<h2 id="design-walkthrough">Design Walkthrough</h2>
<p>To start off, consider an individual cell. Fundamentally, it consists of a text string. We could make this a text (list) CRDT, but usually, you don’t edit individual cells collaboratively; instead, you type the new value of the cell, hit enter, and then its value shows up for everyone else. This suggests instead using a register, e.g., an LWW register.</p>
<p>Besides the text content, a cell can have properties like its font size, whether word wrap is enabled, etc. Since changing these properties are all independent operations, following <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>, they should have independent state. This suggests using a CRDT object to represent the cell, with a different CRDT instance field for each property. In pseudocode (classes are implicitly <a href="#crdt-objects">CRDT objects</a>):</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Cell </span><span>{
</span><span> content</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">string</span><span>>;
</span><span> fontSize</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> wordWrap</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span></code></pre>
<p>The spreadsheet itself is a grid of cells. Each cell is indexed by its location (row, column), suggesting a map from locations to cells. (A 2D list could work too, but then we’d have to put rows and columns on an unequal footing, which might cause trouble later.) Thus let’s use a <code>Cell</code>-CRDT-valued map.</p>
<p>What about the map keys? It’s tempting to use conventional row-column indicators like “A1”, “B3”, etc. However, then we can’t easily insert or delete rows/columns, since doing so renames other cells’ indicators. (We could try making a “rename” operation, but that violates <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-2">Principle 2</a>, since it does not match the user’s original intention: inserting/deleting a different row/column.)</p>
<p>Instead, let’s identify cell locations using pairs (row, column), where “row” means “the line of cells horizontally adjacent to this cell”, independent of that row’s literal location (1, 2, etc.), and likewise for “column”. That is, we create an opaque <code>Row</code> object to represent each row, and likewise for columns, then use pairs <code>(Row, Column)</code> for our map keys.</p>
<p>The word “create” suggests using Unique Sets (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>), although since the rows and columns are ordered, we actually want list CRDTs. Hence our app state looks like:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span>rows: ListCRDT</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListCRDT</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span><span>cells: CRDTValuedMap</span><span style="color:#ececec;"><</span><span style="color:#78cecc80;">[</span><span>row: Row, column: Column</span><span style="color:#78cecc80;">]</span><span>, Cell</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Now you can insert or delete rows and columns by calling the appropriate operations on <code>columns</code> and <code>rows</code>, without affecting the <code>cells</code> map at all. (Due to the lazy nature of the map, we don’t have to explicitly create cells to fill a new row or column; they implicitly already exist.)</p>
<p>Speaking of rows and columns, there’s more we can do here. For example, rows have editable properties like their height, whether they are visible, etc. These properties are independent, so they should have independent states (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>). This suggests making <code>Row</code> into a CRDT object:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Row </span><span>{
</span><span> height</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span></code></pre>
<p>Also, we want to be able to move rows and columns around. We already described how to do this using a <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#list-with-move">list-with-move</a>:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">ListWithMove</span><span><</span><span style="color:#d6d6d6;">C</span><span>> {
</span><span> state</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">UniqueSet</span><span><{value</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">C</span><span>, positionReg</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#d6d6d6;">ListCRDTPosition</span><span>>}>;
</span><span>}
</span><span>
</span><span>rows: ListWithMove</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListWithMove</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Next, we can also perform operations on every cell in a row, like changing the font size of every cell. For each such operation, we have three options:</p>
<ol>
<li>Use a causal for-each operation (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>). This will affect all current cells in the row, but not any cells that are created concurrently (when a new column is inserted). E.g., a “clear” operation that sets every cell’s value to <code>""</code>.</li>
<li>Use a concurrent+causal for-each operation (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3b">Principle 3b</a>). This will affect all current cells in the row <em>and</em> any created concurrently. E.g., changing the font size of a whole row.</li>
<li>Use an independent state that affects the row itself, not the cells (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>). E.g., our usage of <code>Row.height</code> for the height of a row.</li>
</ol>
<!-- > **Aside.** Note that the for-each loops loop over every cell in the row, even blank cells that have never been used. This has the downside of making all those cells explicitly exist in the CRDT-valued map, increasing memory usage. We tolerate this since our focus is to pin down the semantics, not give an efficient implementation. Once the semantics are pinned down, though, you are free to optimize the implementation.
<p></p><br /> -->
<!--Lastly, let's take another look at cell contents. Before I said it was just a string, but it's more interesting than that: cells can reference other cells in formulas, e.g., "= A2 + B3". If a column is inserted in front of column A, these references should update to "= B2 + C3", since they intuitively describe a *cell*, not the indicators themselves. So, we should store them using a pair `[row: Row, column: Column]`, like the map keys. The content then becomes an array of tokens, which can be literal strings or cell references:
```ts
class Cell {
content: LWWRegister<(string | [row: Row, column: Column])[]>;
fontSize: LWWRegister<number>;
wordWrap: EnableWinsFlag;
// ...
}
```-->
<h2 id="finished-design">Finished Design</h2>
<p>In summary, the state of our spreadsheet is as follows.</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> ---- CRDT Objects ----
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Row </span><span>{
</span><span> height</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Column </span><span>{
</span><span> width</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Cell </span><span>{
</span><span> content</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">string</span><span>>;
</span><span> fontSize</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> wordWrap</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">ListWithMove</span><span><</span><span style="color:#d6d6d6;">C</span><span>> {
</span><span> state</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">UniqueSet</span><span><{value</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">C</span><span>, positionReg</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#d6d6d6;">ListCRDTPosition</span><span>>}>;
</span><span>}
</span><span>
</span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> ---- App state ----
</span><span>rows: ListWithMove</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListWithMove</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span><span>cells: CRDTValuedMap</span><span style="color:#ececec;"><</span><span style="color:#78cecc80;">[</span><span>row: Row, column: Column</span><span style="color:#78cecc80;">]</span><span>, Cell</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Note that I never explicitly mentioned <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#correctness">CRDT correctness</a>—the claim that all users see the same document state after receiving the same messages. Because we assembled the design from existing CRDTs using composition techniques that preserve CRDT correctness, it is trivially correct. Plus, it should be straightforward to reason out what would happen in various concurrency scenarios.</p>
<p>As exercises, here are some further tweaks you can make to this design, phrased as user requests:</p>
<ol>
<li>“I’d like to have multiple sheets in the same document, accessible by tabs at the bottom of the screen, like in Excel.” <em>Hint (highlight to reveal): <font color="white">Use a list of CRDTs.</font></em></li>
<li>“I’ve noticed that if I change the font size of a cell, while at the same time someone else changes the font size for the whole row, sometimes their change overwrites mine. I’d rather keep my change, since it’s more specific.” <em>Hint: <font color="white">Use a register with a custom getter.</font></em></li>
<li>“I want to reference other cells in formulas, e.g., <code>= A2 + B3</code>. Later, if <code>B3</code> moves to <code>C3</code>, its references should update too.” <em>Hint: <font color="white">Store the reference as something immutable.</font></em></li>
</ol>
<h1 id="conclusion">Conclusion</h1>
<p>I hope you’ve gained an understanding of how CRDTs work, plus perhaps a desire to apply them in your own apps. We covered a lot:</p>
<ul>
<li><strong>Traditional CRDTs:</strong> Unique Set, List/Text, LWW Register, Enable-Wins Flag, Add-Wins Set, CRDT-Valued Map, and List-with-Move.</li>
<li><strong>Novel Operations</strong>: Concurrent+causal for-each operations on a Unique Set or list.</li>
<li><strong>Whole Apps</strong>: Spreadsheet, rich text, and pieces of various other apps.</li>
</ul>
<p>For more info, <a rel="noopener" target="_blank" href="https://crdt.tech/">crdt.tech</a> collects most CRDT resources in one place.</p>
<p>I’ve also started putting these ideas into practice in a library, <a rel="noopener" target="_blank" href="https://www.npmjs.com/package/@collabs/collabs">Collabs</a>. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in <a rel="noopener" target="_blank" href="https://www.youtube.com/watch?v=Exr0iY_D-vw">my Strange Loop talk</a>.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>I thank Justine Sherry, Jonathan Aldrich, and Pratik Fegade for reviewing this post and giving helpful feedback. I also thank Heather Miller, Ria Pradeep, and Benito Geordie for numerous CRDT design discussions that led to these ideas.</p>
Time-Traveling Simulation for Security2022-12-06T00:00:00+00:002022-12-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/timetraveling-simulation/<p>Blockchains are a powerful technology which allow decentralized agreement with an immutable history. Since transactions can be added, but not removed, blockchains allow distributed banking as a trustworthy alternative to central banking.
A vast amount of cryptographic research on constructing secure blockchains has led to them being trusted to secure currency worth <a rel="noopener" target="_blank" href="https://coinmarketcap.com/currencies/bitcoin/">hundreds of billions</a> of US dollars.</p>
<p>Recently, blockchains have received attention as an enabler of cryptography rather than just a goal of it. Several works have used blockchains to build a variety of cryptographic tools, including <a rel="noopener" target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-70500-2_18">one-time programs</a> and <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s10623-018-0461-x">time-lock encryption</a>. These tools are impossible to construct without special assumptions. These works model cryptographic protocols as occurring in a world where a blockchain protocol is being executed. The cryptographic protocol is therefore able to perform actions such as reading the state of the blockchain or posting transactions to it. The exact security definitions vary significantly between these approaches.</p>
<p>Time-traveling simulation is a new security model for protocols executed in the presence of a blockchain. Intuitively, time-traveling simulation captures the philosophy that “any extra information an adversary learns in a real execution could have been learned on their own by waiting for the natural passage of time”. Since a blockchain will naturally progress no matter what the adversary does, it provides the notion of time needed to formalize this philosophy. </p>
<p>Time-traveling simulation bypasses many impossibility results, while the same time yielding an arguably stronger notion of security than prior blockchain based works. For example, time-traveling simulation enables zero knowledge arguments and secure two-party computation in three messages. It is currently not known how to construct these protocols in three messages with the standard notion of security, without relying on new hardness assumptions. </p>
<p>In this article, we will dive into the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#the-philosophy-of-security">definition of time-traveling simulation</a> and how it <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#comparison-to-other-relaxed-security-notions">compares to other security notions</a>. Additionally, we will explore how it can be used to bypass impossibility results for <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#application-time-traveling-simulators-for-zero-knowledge">three message zero knowledge arguments</a>.</p>
<h1 id="the-philosophy-of-security">The Philosophy of Security</h1>
<p>In modern cryptography, the central philosophy for security is “any extra information an adversary learns in a real execution could have been learned on their own”. In other words, the adversary learns nothing from participating in the real execution, beyond what they were supposed to learn. For example, in a zero knowledge argument, the adversary only learns that a given NP statement is true, without learning a witness for <em>why</em> it is true. This particular notion is actually too strong for many applications, so cryptographers usually consider weakenings of this philosophy with the same spirit. The most common weakening is “any extra information an adversary learns in a real execution could have been learned on their own using a little extra computation”.</p>
<p>These philosophies are captured formally by a mathematical object called a simulator. A simulator’s job is to reproduce whatever knowledge the adversarial verifier would have learned in a real execution of the protocol. However, it must do this without access to the real prover; it only has the adversary’s code. If such a simulator exists, then the adversary could run the simulator on its own. By doing so, it learns everything it would have learned in a real interaction, without interacting with the real prover.</p>
<p>More formally, a simulator (for zero knowledge) takes as input the adversary’s code and the statement being proven, then outputs a transcript of a protocol execution, along with the adversary’s internal state. In the real world, without loss of generality, the adversary outputs the transcript of the protocol execution along with its own internal state. This is before any post-processing. A protocol is zero knowledge if there exists a simulator whose output distribution is indistinguishable from the output distribution of the adversary in the real world. This guarantees that whatever information can be derived from the output of the adversary in the real world is indistinguishable from what can be derived from the simulator. Thus, by running the simulator, the adversary can learn whatever it would have learned in a real execution.</p>
<p><img src="./simulator-paradigm.png" alt="In the real world, the adversary interacts with someone who knows a secret. In the ideal world, the simulator does not know the secret, and may internally interact with the adversary to produce a realistic looking view." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> The simulator imagines an interaction between the adversarial verifier and an imaginary prover. This interaction is indistinguishable from a real interaction, from the adversary's point of view.</div>
<p>In some sense, a simulator can be viewed as a method for the adversary to fool itself into accepting the truth of a statement without knowing a witness. It is important that the adversary can only fool itself - an adversarial prover should not be able to fool an honest verifier. This requires some asymmetry between the simulator and a real-world adversary. One of the most basic forms of asymmetry is knowledge of the adversary’s code, which allows the simulator to internally run and interact with the adversary. Any adversary knows its own code, but it certainly shouldn’t know anyone else’s!</p>
<p>To relax the security philosophy, the simulator is provided with some form of additional power which represents additional asymmetry between the simulator and a real-world adversary. The more asymmetry, the easier it is to create a simulator without allowing an adversarial prover to convince an honest verifier of a false statement. In general, providing more extra power to the simulator corresponds to a weaker security notion. The adversary can learn whatever the simulator can learn, so a more powerful simulator corresponds to an adversary which can learn more information. The table below compares common relaxations to time-traveling simulation in terms of their philosophies and what extra power is given to the simulator.</p>
<table><thead><tr><th align="left">Security Notion</th><th align="left">Philosophy: <br/>“Any extra information an adversary learns in a real execution could have been learned on their own…”</th><th align="left">Simulator</th></tr></thead><tbody>
<tr><td align="left">Expected PPT (Standard)</td><td align="left">in expected PPT.</td><td align="left">Runs in expected poly time.</td></tr>
<tr><td align="left">Superpolynomial Simulation</td><td align="left">in superpolynomial time.</td><td align="left">Runs in superpolynomial time.</td></tr>
<tr><td align="left">Common Reference String (CRS)</td><td align="left">using the CRS trapdoor.</td><td align="left">Can choose the CRS used by both parties. This allows adding a trapdoor to it.</td></tr>
<tr><td align="left">Majority Simulation</td><td align="left">if they controlled the blockchain.</td><td align="left">Controls the majority of blockchain participants.</td></tr>
<tr><td align="left">Time-Traveling Simulation</td><td align="left">shortly into the future.</td><td align="left">Can look into the future.</td></tr>
</tbody></table>
<h2 id="security-implications-of-time-traveling-simulation">Security Implications of Time-Traveling Simulation</h2>
<p>As mentioned previously, time-traveling simulation captures the philosophy that “any extra information an adversary learns in a real execution could have been learned on their own by waiting for the natural passage of time”. This is realized by allowing the simulator to see a potential future state of the blockchain, which consists of a valid extension by \( F \) blocks. Since such a state will become public information after a short time regardless of what the adversary does, this only reveals information that would have anyway been revealed with the natural passage of time.</p>
<p>Simulator access to a future state allows time-traveling simulation to bypass impossibility results for expected probabilistic polynomial time simulation, which is considered the standard notion of simulation.
A common blockchain property is that a computationally-bounded adversary cannot compute a valid extension by \( F \) blocks faster than the honest parties can extend the chain by, say, \( \sqrt{F} \) blocks. Therefore access to a future state represents additional asymmetry between the simulator and a real adversary.
This additional asymmetry makes it possible for the simulator to “imagine” the adversary’s real-world view in protocols where it otherwise would not have been able to, bypassing the impossibility results for expected PPT simulation (aka standard simulation).</p>
<p><img src="./future-state.png" alt="A blockchain comes equipped with a validity predicate which allows checking whether a state is a valid extension of a previous state. A future state is a valid extension of the current state." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> A blockchain comes equipped with a validity predicate which allows checking whether a state is a valid extension of a previous state. A future state is a valid extension of the current state.</div>
<p>Time-traveling simulation is almost as meaningful as standard simulation when it comes to long-term knowledge.
For example, imagine the task of constructing multi-party computation protocols which are secure against malicious adversaries. A malicious adversary may deviate from the protocol arbitrarily. Another kind of adversary is a semi-honest adversary, which follows the protocol, but may attempt to analyze the transcript later. It is much easier to construct multi-party computation protocols which are secure against semi-honest adversaries. A multi-party computation protocol with semi-honest security can be transformed to have malicious security by using the <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/28395.28420">GMW compiler</a>. To do the transformation, each party proves the statement “I executed the protocol honestly using some input” in zero knowledge. This convinces the other parties that they did indeed behave honestly, but does not reveal an explanation for the honest behavior. Crucially, this means that the zero knowledge argument preserves the privacy of each party’s inputs. Now consider using a zero knowledge argument with time-traveling simulation to instantiate the GMW compiler. Since honest behavior in a non-time-sensitive protocol does not depend on the passage of time, this does not reveal an explanation for the honest behavior. In particular, the inputs of each party are still private.</p>
<p>In contrast, time-traveling simulation may not be suitable for applications which are inherently time sensitive. For example, consider using a zero knowledge argument with time-traveling simulation to prove knowledge of a solution to a time-lock puzzle. A time-lock puzzle can be solved in some set amount of time (for example, a day), but cannot be solved faster than that. Since the simulator has access to a future state from after the time-lock puzzle can be solved, in this situation time-traveling simulation may allow the solution to be leaked today instead of tomorrow.</p>
<h3 id="comparison-to-other-relaxed-security-notions">Comparison to Other Relaxed Security Notions</h3>
<p>Several of these security notions also bypass impossibility results for expected PPT simulation. One way to further compare security notions is comparing how powerful their simulators are. As mentioned previously, a security notion which allows the simulator more power may allow the adversary to learn more information. In many cases, time-traveling simulation gives the simulator less power than other simulation notions, so it corresponds to better security guarantees.</p>
<p><strong>Super-Polynomial Time Simulation.</strong> Time-traveling simulation can be seen as a very restricted form of super-polynomial time or angel-based simulation. Angel-based simulation is similar to super-polynomial time simulation, but restricts the extra computational power to performing one specific task. For example, an angel may break the security of a particular commitment scheme. Both super-polynomial time and angel-based simulators are very powerful and can bypass many impossibility results. However, it can be challenging to argue that the simulator cannot break the security of other primitives. These primitives may only have security against polynomial-time adversaries, so they can be broken using any super-polynomial time computation. Continuing the example of commitments, if the simulator could also break a second commitment scheme, then it cannot guarantee that the second scheme is secure against the real adversary.</p>
<p>In the case of time-traveling simulation, the angel’s task is to quickly compute a potential future state of the blockchain exactly once. It is worth emphasizing the special nature of this task: it is computing something which will be publicly available information in just a short while. As such, whatever security a time-traveling simulator breaks would have been broken soon anyway. For example, regardless of which commitment scheme the parties use, the commitment to their input can never be broken by a time-traveling simulator.</p>
<p><strong>Common Reference String.</strong> Another good point of comparison is the common reference string model, since the blockchain state represents a pre-agreed-upon string. One important difference between a CRS and the way time-traveling simulation uses a blockchain is that the format of a common reference string often depends on the exact protocol being run (for example, a zero knowledge proof or a secure computation protocol). However, a blockchain does not adapt to auxiliary protocols. A second, and perhaps more important difference, is the notion of control. In the CRS model, the simulator has full control over the CRS. A time-traveling simulator, on the other hand, has no actual control over the blockchain, only some extra information about it. This means that a time-traveling simulator can learn less information than a simulator with full control over the blockchain. Since an adversary might be able to learn whatever a simulator can, the security notion is stronger if the simulator only has extra information, instead of full control.</p>
<p><strong>Majority Simulation.</strong> This difference in control over versus knowledge about the blockchain is especially illustrated when comparing time-traveling simulation to majority simulation. Majority simulation is another relaxed security model for protocols executed alongside a blockchain. In majority simulation, the simulator is allowed control over all honest parties which are participating in the progression of the blockchain. Since blockchain security requires the honest parties to be in control of the blockchain, this allows a majority simulator to perform tasks such as pausing or even rewinding the blockchain. Such capabilities should even allow computation of future states of the blockchain, which is the only power given to a time-traveling simulator. </p>
<p>In particular, majority simulation can introduce security vulnerabilities when running two different protocols using the same blockchain. Since the two protocols rely on the security of the blockchain, a simulator with full control over the blockchain can easily break the security of either protocol. Therefore majority simulation does not guarantee that a party which participates in one protocol cannot violate the security of the other protocol. Although it is nontrivial to see, time-traveling simulation can allow multiple protocols to use the same blockchain at the same time if they are careful. </p>
<h1 id="application-time-traveling-simulators-for-zero-knowledge">Application: Time-Traveling Simulators for Zero Knowledge</h1>
<p>Time-traveling simulators allow a particularly simple construction for zero knowledge arguments with three messages. As mentioned previously, constructing zero knowledge arguments with three messages is very difficult under the standard notion of security (expected PPT simulation). <a rel="noopener" target="_blank" href="https://iacr.org/archive/tcc2008/49480068/49480068.pdf">Prior work</a> shows that any security proof for a three message zero knowledge argument must make non-blackbox use of the adversary’s code. However, non-blackbox techniques are notoriously difficult. The only <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3188745.3188870">current construction</a> for three message zero knowledge relies on new cryptographic hardness assumptions.</p>
<p>A zero knowledge argument is, first and foremost, an argument. A prover attempts to convince a verifier that an NP statement \( x\) is in an NP language \( L\). The prover should not be able to convince the verifier of a false statement; this property is called soundness. The zero knowledge property requires that the argument does allow the verifier to learn anything about the witness for \( x \in L\). This is formalized using the simulator definition discussed <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#the-philosophy-of-security">above</a>. As a reminder, the simulator must approximate a real view of the argument, except it does not have access to the real prover. In the standard notion of simulation, the simulator is an expected PPT algorithm.</p>
<p>In time-traveling simulation for zero knowledge arguments, the simulator additionally receives a valid extension of the blockchain by \(F\) blocks. Then it must produce the adversary’s view. If left alone, the blockchain will generate extensions of itself which are independent of the statement \(x\) or its witnesses. Therefore the future state which the simulator receives is effectively harmless and contains no information about the witness beyond what is naturally leaked with the passage of time.</p>
<p><img src="./timetraveling-simulator-zk.png" alt="In a real zero knowledge argument execution, the prover knows the witness. A time-traveling simulator for zero knowledge receives a future state of the blockchain instead of the witness." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> In a real zero knowledge argument execution, the prover knows the witness. A time-traveling simulator for zero knowledge receives a future state of the blockchain instead of the witness.</div>
<h2 id="zero-knowledge-in-three-rounds">Zero Knowledge in Three Rounds</h2>
<p>The construction of a three round zero knowledge argument uses a three round witness indistinguishable proof of knowledge (WIPoK). In a WIPoK, a prover convinces a verifier that they “know” a witness for some NP statement. The witness indistinguishability property guarantees that if there are two possible witnesses for the statement, then the verifier cannot tell which one the prover knows. This is a weaker security guarantee than zero knowledge, so it is possible to construct a WIPoK in just three rounds (even without assuming special setup like a CRS or a blockchain).</p>
<p>The construction is as follows. To prove the truth of an NP statement \( x\), the prover and verifier engage in a WIPoK for the statement “I know a witness for \( x\) or I know a blockchain state \(F\) blocks ahead of the current state”. Showing zero knowledge requires constructing a time-traveling simulator, which is initialized with a future state. The simulator acts as a prover in the WIPoK with the adversary, using the future state as its witness. Witness indistinguishability guarantees that an execution using the future state as a witness is indistinguishable from an execution using a witness for \(x \). The latter case is exactly what occurs in a real execution, so the simulator’s output is indistinguishable from a real execution.</p>
<p>To show soundness, observe that any adversarial prover must know a witness for the statement. This is either a witness for \( x \) or it is a future state of the blockchain. Since a real adversary cannot possibly know a future state of the blockchain without violating the blockchain’s security, it must know a witness for \( x\). The full argument for soundness requires some additional care in order to use the proof of knowledge property, since the WIPoK is composed in parallel with a blockchain protocol and many security properties break down during parallel composition. See the <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/035.pdf">full paper</a> for details.</p>
Kangaroo: Caching billions of tiny objects on flash2022-05-02T00:00:00+00:002022-05-02T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/kangaroo/<p>Many social-media and Internet-of-Things services have large numbers of tiny objects, each a few hundred bytes or less.
For example, edges in Facebook’s social graph, which are needed to connect friends, posts, and images among other content, <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/osdi20/presentation/berg">average under 100 bytes</a>.
Twitter tweets <a rel="noopener" target="_blank" href="https://techcrunch.com/2018/10/30/twitters-doubling-of-character-count-from-140-to-280-had-little-impact-on-length-of-tweets/">average 33 bytes</a>.</p>
<p>These objects are permanently stored in large-scale databases, object stores, or filesystems.
On top of this permanent storage layer,
popular objects are cached.
Caches allow quicker access to the popular objects and lower load on the storage layer.
A cache’s effectiveness in these systems is primarily measured by the ratio of
the number of requests it can fulfill to the total number of requests, or its miss ratio.
As the quantity of data scales, caching layers need to also scale to maintain
their miss ratio, otherwise end-user experiences such as website load times suffer.
However, scaling traditional DRAM caches is prohibitively expensive.
Instead, companies are increasingly using flash
to build larger caches since flash is 100x cheaper per bit than DRAM.</p>
<p>Unfortunately, prior flash caches fall short of efficiently caching tiny objects,
a challenging workload for flash caching.
Prior approaches either increase the cache’s cost by having a high indexing overhead
that requires excessive DRAM capacity to support
or writing too much and rapidly wearing out flash devices.
Thus, with prior designs, flash caching fails to live up to its potential as a cheap, large cache for tiny objects.</p>
<p>Kangaroo is a new flash cache optimized for tiny objects.
It enables efficient caching of tiny objects by requiring only a small
DRAM overhead and a small write overhead for cached objects.
In addition, Kangaroo introduces a new cache eviction policy that uses
minimal DRAM overhead while significantly reducing cache
misses, further reducing load on the storage layer.
Kangaroo is <a rel="noopener" target="_blank" href="https://github.com/saramcallister/Kangaroo">open source</a>
and implemented in <a rel="noopener" target="_blank" href="https://cachelib.org/">CacheLib</a>,
Facebook’s open-source caching engine.</p>
<p>Kangaroo lowers the number of cache misses by 29% over state-of-the-art
flash caching systems under production DRAM and flash constraints on traces
from production social-graph caching workloads at Facebook and Twitter.
These results are also corroborated with a
test deployment of Kangaroo in a shadow production setup at Facebook.
This research was published at <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">SOSP 2021</a> where it won the <a rel="noopener" target="_blank" href="https://sosp2021.mpi-sws.org/awards.html">Best Paper Award</a>.</p>
<h2 id="prior-approaches-too-much-dram-or-too-many-writes">Prior approaches: Too much DRAM or too many writes</h2>
<p>Prior flash caches fall into two main categories: <em>log-structured caches</em> and <em>set-associative caches</em>. Neither of these flash caches can efficiently support tiny objects
because, as explained further below, log-structured caches require prohibitively large
DRAM overheads whereas set-associative caches require prohibitively large write overheads.</p>
<h3 id="log-structured-caches-too-much-dram">Log-structured caches: Too much DRAM</h3>
<p>Log-structured caches use flash as a circular log.
During an insert, objects are first buffered in DRAM and then written to flash
sequentially in large groups.
Since objects can end up anywhere on flash, the cache maintains an in-memory index to find objects.</p>
<p>The advantage of a log-structured design is that it has a low <em>write amplification</em>.
Write amplification is the number of bytes written to flash divided by
the cumulative object size, and it represents the write overhead of a cache.
A write amplification of one is optimal, though often it is higher.
For example, writing a 100-byte object to flash by itself has a write amplification
of ~40x since flash has a minimum write granularity of 4KB.
Flash has a limited number of times it can be rewritten before becoming unusable.
Therefore, this significant write amplification wears out the flash device quickly,
requiring the device to be replaced quickly.
Since a log-structured cache buffers objects in DRAM,
it can wait until it has enough objects to write them to flash efficiently.
Thus, log-structured caches have close-to-optimal write amplification.</p>
<p>However, log-structured caches have a large DRAM overhead when caching tiny objects.
They have to keep an index entry for every on-flash object to enable
finding those objects again on a lookup request.
Since objects are around 100 bytes, there would be roughly 20 billion of them
in a 2 TB flash cache.
Even with the <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/nsdi19/presentation/eisenman">lowest overhead in the literature at 30 bits/object</a>,
the cache would require 75 GB just to the index objects on flash.
Since caching on flash is meant to lower costs through removing DRAM,
log-structured caches are inefficient for tiny objects because they require too much DRAM.</p>
<h3 id="set-associative-caches-too-many-writes">Set-associative caches: Too many writes</h3>
<p>Meanwhile, set-associative caches use flash as a large hash table where each flash page is a single <em>set</em>, or hash bucket.
During a lookup request, the cache hashes an object’s key to find it’s potential set on
flash and reads that flash page to find the object.</p>
<p>Since the finding objects is based on a hash function, set-associative caches
do not need large amounts of memory to track objects.
Thus, unlike log-structured caches, set-associative caches have a low
enough memory overhead to support large flash caches.</p>
<p>However, these caches write many more bytes than necessary.
When inserting a new object, the cache has to write, at a minimum,
a 4 KB flash page for every object.
If objects are roughly 100 bytes, the cache has a <em>40x</em> write amplification.
Thus, set-associative caches are also inefficient for tiny objects because they
require too many writes.</p>
<h2 id="kangaroo-an-efficient-tiny-object-flash-cache">Kangaroo: An efficient tiny-object flash cache</h2>
<p>Kangaroo caches tiny objects on flash effectively by combining log-structured and set-associative caches to reduce both DRAM and flash-write overheads.
Kangaroo has two main parts: <em>KLog</em>, a small log-structured flash cache, and <em>KSet</em>, a large set-associative flash cache.
At a high level, Kangaroo uses KLog as a staging area for objects so that
writing them to KSet is more efficient.</p>
<h3 id="finding-objects-in-kangaroo">Finding objects in Kangaroo</h3>
<p><img src="../kangaroo-lookup.png" alt="Lookup in Kangaroo" /></p>
<figure-caption>
<p>On a lookup, Kangaroo looks for the object in (1) the DRAM cache, then (2a) KLog’s index and (2b) KLog if the key is in the index, then finally (3a) KSet’s
Bloom filters and (3b) KSet if the Bloom filters indicate the object could be there.
If the object is not found in any of these locations, Kangaroo returns a miss.</p>
</figure-caption>
<h3 id="inserting-objects-in-kangaroo">Inserting objects in Kangaroo</h3>
<p><img src="../kangaroo-insert.png" alt="Insert into Kangaroo" /></p>
<figure-caption>
<p>On an insert, Kangaroo first places the object in (1) the DRAM cache.
This insertion may evict an object from the DRAM cache.
If the object is not admitted to flash, (2a) it is evicted from Kangaroo.
For instance, objects can be evicted at this stage based on a random admission policy,
where each object has a fixed probability of admission to the flash cache.
Otherwise, it is inserted into (2b) KLog’s index and (2c) written to flash in KLog via a buffered write.
When objects are evicted from KLog, they are again subject to an admission policy,
described more in the next section,
and (3a) can be evicted from Kangaroo entirely.
Admitted objects are written to (3b) KSet along with any other objects in KLog
that map to the same set in KSet.</p>
</figure-caption>
<p>One important aspect of the insertion path in Kangaroo that reduces write amplification
is how Kangaroo moves objects from KLog to KSet.
KLog often contains multiple objects mapping to the same set in KSet,
such as the pink and yellow objects in the figure above.
Whenever an object is evicted from KLog, Kangaroo proactively uses KLog’s index to
find any other objects that map to the same set in KSet,
and moves them to KLog as well.
Since writing a set always requires writing 4 KB, regardless of the number of objects inserted, writing multiple new objects instead of just 1 greatly reduces the the write amplification.</p>
<p>Thus, Kangaroo amortizes writes to KSet over multiple objects, decreasing the overall number of bytes written to flash.
Kangaroo accomplishes this amortization with a small KLog (~5% of flash), resulting in only a small DRAM overhead to index KLog’s entire capacity.
Kangaroo thus addresses both the DRAM and flash-write overheads of caching tiny objects on flash.</p>
<h3 id="kangaroo-optimizations">Kangaroo optimizations</h3>
<p>On top of this basic design, Kangaroo introduces additional techniques to increase its effectiveness.
In particular, since Kangaroo is a cache and not a key-value store, it can evict objects to minimize writes.
Kangaroo exploits this opportunity by adding a threshold admission policy that evicts objects from KLog instead of admitting them to KSet if there are fewer than n objects to insert to a set in KSet.
This admission policy allows Kangaroo to guarantee that the write amplification for moving objects to KSet will be much lower than a set-associative cache.</p>
<p>Kangaroo also introduces RRIParoo, a low DRAM-overhead eviction policy for KSet
based on the processor eviction policy <a rel="noopener" target="_blank" href="https://people.csail.mit.edu/emer/papers/2010.06.isca.rrip.pdf">RRIP</a>.
At a high level, RRIParoo keeps one bit in DRAM per object in KSet
to represent whether an object has been requested since the object was last
written to flash.
When a set is rewritten, this bit is used to update a 3-bit recency values kept on flash per object.
Objects in a set are then ordered by their 3-bit recency value
and Kangaroo evicts the least valuable
objects to make room for objects coming from KLog.
Thus, RRIParoo allows an advanced eviction policy in KSet
while keeping a low DRAM overhead.</p>
<p>Kangaroo provides further optimizations to reduce DRAM overhead and reduce misses, as explained in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">SOSP’21 paper</a>.
Together, these optimizations allow Kangaroo to overcome the limitations of log-structured caches and set-associative caches,
creating a flash cache that delivers on the goal of efficient caching for tiny objects.</p>
<h2 id="kangaroo-outperforms-other-flash-caches">Kangaroo outperforms other flash caches</h2>
<p>We evaluated Kangaroo on a 2 TB flash drive using a production trace from Facebook
under production DRAM and write rate constraints.
We also evaluated CacheLib’s default small object cache (SA), a set-assocative
cache that Facebook uses to serve its social graph,
and an optimistic version of a log-structured cache (LS) with a full in-DRAM
index.</p>
<p><img src="../kangaroo-results.png" alt="Kangaroo vs LS vs SA on production FB trace" /></p>
<figure-caption>
<p>Kangaroo reduces misses compared to LS by 56% and to SA by 29% over the last
2 days of the production FB trace.
LS’s high DRAM overhead means that it cannot index the entire flash drive.
Thus, it has a lower effective capacity, which increases its miss ratio.
SA’s high write amplification means that it has to rate limit its insertions
and greatly over-provision flash to prevent the flash device from
wearing out too quickly.
Kangaroo does not run into these issues and has a better eviction policy,
allowing it to outperform other flash caches.</p>
</figure-caption>
<p>We corroborated these results in a production shadow deployment at Facebook.
In addition, Kangaroo maintains its advantage if operated under different constraints,
such as different write rate limits, more or less available DRAM, different tiny object workload, and larger device capacities. More details on these results can be found in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">paper</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Kangaroo is a flash cache for billions of tiny objects that handles a wide range of DRAM and flash-write budgets.
Kangaroo leverages prior log-structured and set-associative designs, together with new techniques, to achieve the best of both designs.
Experiments using a trace from Facebook show DRAM usage close to the best prior DRAM- optimized design,
flash writes close to the best prior write-optimized design,
and miss ratios better than either.
Kangaroo shows that flash caches can support tiny objects,
an adversarial workload for DRAM usage and write amplification,
while maintaining flash’s cost advantage.</p>
<p>For more details about Kangaroo, check out our SOSP <a rel="noopener" target="_blank" href="https://www.youtube.com/watch?v=bJ4rqSrcVqs">presentation</a> and <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">paper</a>.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>I want to thank my other collaborators on this work:
<a rel="noopener" target="_blank" href="https://bsb20.github.io/">Benjamin Berg</a> (CMU), <a rel="noopener" target="_blank" href="http://cmu.io/%7Ejtutuncu/">Julian Tutuncu-Macias</a> (CMU, now at Goldman Sachs), <a rel="noopener" target="_blank" href="https://jasony.me/">Juncheng Yang</a> (CMU), Sathya Gunasekar (Facebook), Jimmy Lu (Facebook),
<a rel="noopener" target="_blank" href="https://www.microsoft.com/en-us/research/people/daberg/">Daniel Berger</a> (Microsoft Research and University of Washington), <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Ebeckmann/">Nathan Beckmann</a> (CMU), and <a rel="noopener" target="_blank" href="https://users.ece.cmu.edu/%7Eganger/">Greg Ganger</a> (CMU).
I would also like to give a special thanks to the <a rel="noopener" target="_blank" href="https://cachelib.org/">CacheLib</a> team at Facebook
as well as both Facebook and Twitter for sharing traces with us.</p>
Cases2Beds: A Case Study in Actionable Intelligence Highlights2022-01-06T00:00:00+00:002022-01-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/casestobeds/<p><em>This blog post is adapted from the <a rel="noopener" target="_blank" href="https://delphi.cmu.edu/blog/2021/03/10/cases2beds-a-case-study-in-actionable-intelligence/">Delphi blog</a>, originally published on March 10th, 2021. Again, thank you to the Allegheny County Health Department, the DELPHI Group, Chris Scott, and Roni Rosenfeld.</em></p>
<p>One of the <a rel="noopener" target="_blank" href="https://delphi.cmu.edu/">Delphi Group</a>’s goals is to create informative tools for healthcare organizations. Tools are only useful if the insights they provide can inform concrete actions. That is to say these tools must provide actionable intelligence. In early November 2020, as COVID case rates in Allegheny County continued to rise, the Delphi Group partnered with the Allegheny County Health Department (ACHD) to create such tools for investigating if hospitals located in the county would run out of hospital beds for COVID patients <a href="#f1">(Fig. 1)</a>.</p>
<div id="f1"></div>
<p><img src="./WPRDC-1.svg" alt="Image of the hospitalizations due to COVID-19 and new cases from positive PCR tests in Allegheny County. There are rapid upward trends in hospitalizations and positive cases from October 2020 to mid-December 2020. The maximum number of hospitalizations is about 600 and the minimum is less than 50 [in Oct 2020]. The maximum number of positive cases is over 7000 and the minimum is less than 1000 [in Oct 2020]." />
<strong>Fig. 1:</strong> Hospitalizations Due to COVID-19 and New Cases from Positive PCR Tests in Allegheny County (WPRDC Data <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#WPRDCLink">1</a></sup>)</p>
<p>Based on its planning, the ACHD needed at least a week to open emergency COVID facilities. If the emergency space wasn’t open and hospital beds ran out, mortality rates could soar. But, if we didn’t need the facility, that decision would have stretched already thin resources. Many of the hospitals in Allegheny County were in contact, but each hospital system only had visibility into its own facilities. We wanted to offer a more holistic picture of hospital resources for ACHD to assist in its planning.</p>
<h2 id="a-probabilistic-approach">A Probabilistic Approach</h2>
<p>To provide county-level intelligence on hospital bed usage, we developed Cases2Beds<sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#Cases2BedsLink">2</a></sup>.</p>
<p>To extrapolate beds utilization 1-2 weeks in the future, we needed to estimate:</p>
<ol>
<li>The probability that a person who tested positive for COVID-19 would require hospitalization</li>
<li>How many days after testing a person would be hospitalized</li>
<li>How long a person with COVID would stay in the hospital</li>
<li>The current number of positive COVID tests</li>
</ol>
<p>These values vary by demographic factors, most notably age (<a href="#f2">Fig. 2</a>), and to a lesser extent, sex and race.</p>
<div id="f2"></div>
<p><img src="./rates-1.svg" alt="Age Group Comparisons based on the Allegheny County COVID-19 Tableau. The age groups are 0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70+, and unspecified. As the age group increases, the percent of those who were tested in that age group and were later hospitalized in that age group increases (the 70+ age group being > 5%)." />
<strong>Fig. 2:</strong> Age Group Comparisons based on the Allegheny County COVID-19 Tableau <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#ACHDDashboardLink">3</a></sup></p>
<p>We used public data from Allegheny County about the number of people tested, test positivity rate, and hospitalization rate, broken down by the aforementioned demographic factors.</p>
<p>We also acquired information for two critical parameters: </p>
<ul>
<li><strong>Offset</strong>: Offset is the number of days between the day of testing (called specimen collection date) and the first day of hospitalization. For example, if the test date were 5 days before hospitalization, the offset would be 5 days. Also, if the test date is the hospital admit date, the offset would be 0 days (or sometimes, if, for example, they are admitted at midnight, -1 or +1 days). Notably, the offset can be negative, meaning a person may have been tested some days or weeks after admission.</li>
<li><strong>Length of Stay</strong>: The length of stay is approximately how many days a person uses a bed in the hospital (± 1 day).</li>
</ul>
<p>Given the hospitalization rate, the offset distribution, and the length of stay distribution, we can simulate multiple futures for any given set of positive cases and their testing dates. Estimating the future given a set of probabilities is a common problem and a possible approach is called a Monte Carlo simulation. This process ultimately shows the expected distribution of the number of beds needed each day.</p>
<p>Monte Carlo simulations involve running a large number of scenarios based on a set of probabilities. The more scenarios run, the more accurate the model tends to be. For example, if you gave 1000 people one dart to throw at a dartboard, even though each throw may not be very good, you’d still be able to get a pretty good idea of where the bull’s eye is after 1000 throws. This is the same principle we applied for Cases2Beds – after many simulations, we had a good idea of how many beds might be needed in the next two weeks.</p>
<p>Our prototype Monte Carlo simulation was written in Python and had a runtime of a few minutes. However, because the simulation works best with probabilities derived from Protected Health Information (PHI), ACHD needed to run it privately and offline so there would be no data transmission. Thus, any type of web application (which would transmit data to our servers) was ruled out. Even asking ACHD to run our Python software on their machines fell into a grey area. However, Microsoft Excel was easy to use and supported by ACHD. So we converted Cases2Beds into a spreadsheet. </p>
<p>It is relatively straightforward to port the Python application to VBScript macros for Microsoft Excel. However, those macros aren’t designed to run large simulations, and we saw that the time required to generate a model was far worse, bordering on unusable.</p>
<h2 id="an-alternative-to-monte-carlo-the-analytical-model">An Alternative to Monte Carlo: the Analytical Model</h2>
<p>As an alternative, we developed an analytical model for Microsoft Excel that offered a much faster run time than the full Monte Carlo simulation. The sheet has two tabs of inputs: constant parameters (first tab, static), and case counts (second tab, dynamic). </p>
<p>The analytical model had the same idea as the Monte Carlo simulation. Some fraction of individuals who test positive today will be hospitalized after a varying offset (from test date to admit date) and variable duration (from admit date to discharge date) based on their characteristics (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#app">appendix</a>). Because these parameters can vary by region, anyone can change these values in spreadsheet tab 1.</p>
<p>The characteristics are:</p>
<ol>
<li>Age Group: (Most important) [unspecified, 0-9, 10-19, 20-29 … 70-79, 80+]</li>
<li>Sex: [unspecified, M, F]</li>
<li>Race: [unspecified, Black, White, Hispanic, Asian]</li>
</ol>
<p>And the parameters are:</p>
<ol>
<li>Hospitalization Rate</li>
<li>Offset Distribution Parameter Set: Parameters describing the number of days before someone who tests positive is hospitalized</li>
<li>Duration Distribution Parameter Set: Parameters describing the number of days someone will be in the hospital</li>
</ol>
<p>The second types of inputs are the daily positive cases split by their traits. This is the input that the user actively changes on their end.</p>
<p>Behind the scenes, we take these parameters (first input tab) and generate Offset Fractions, which is the probability that a patient with given traits will occupy a bed for a duration k days after the specimen testing date. These Offset Fractions and the daily positive case breakdown (second input) give us the expected mean and variance up to 1 month in the future of the number of patients in the hospital per day based on the cases already seen (for details, see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#app">appendix</a>). This information can be used to generate plots like <a href="#f3">(Fig. 3)</a>. This graph isn’t to suggest that there won’t be any need for beds after February! It is just that based on the cases we know, very few people will be hospitalized for more than a month.</p>
<div id="f3"></div>
<p><img src="./C2B-1.svg" alt="Output of Cases2Beds using historical data until January 21st for Allegheny County Using Public Parameters. In the output of Cases2Beds, we see a peak in mid-December 2020 in the mean number of beds, followed by a stagnation period in mid-January 2021 and a rapid decline until the end of March 2021. The 25-75 Quantile and 5-95 Quantile are highlighted on the graph with the band having the largest width between mid-December 2020 and mid-January 2021. " />
<strong>Fig. 3:</strong> Output of Cases2Beds using historical data until January 21st for Allegheny County Using Public Parameters</p>
<p>If we assume independence between patients, the mean and variance calculations are exact. However, our quantile estimates are based on approximating the sum of independent binary variables, which is inaccurate near the tails. So the accuracy of the more extreme quantiles (95%+) depends on the number of cases present, which in practice makes them less trustworthy.</p>
<h2 id="cases2beds-in-action">Cases2Beds in Action</h2>
<p>By the end of November 2020, we had a viable prototype Cases2Beds spreadsheet used by ACHD. Over the following months, we made various modifications with their feedback. For example, the ACHD staff did not have time to manually input case numbers. So, we were able to use the granular public data to give them estimates of future hospital utilization without any additional work on their end. </p>
<p>At the peak of bed utilization, hospital systems themselves increased their COVID beds utilization to 6x more than in October 2020. Fortunately, in Allegheny County, we never reached a point where demand for beds exceeded a somewhat elastic supply. In early January 2021, multiple organizations told us that the pandemic’s most acute problem had changed to vaccine distribution and the number of COVID-19 beds needed dropped. Cases2Beds continues to act as an early warning system if the number of cases rise quickly.</p>
<div id="f4"></div>
<p><img src="./HHS-1.svg" alt="Numbers of staffed COVID beds over time vs. capacity from the HHS Protect Data. There was peak hospital utilization (7-day Average of COVID Adult Beds Used) in mid-December 2020 (over 800 beds avg.) before a steady decline until February 2021 (around 200 beds avg). " />
<strong>Fig. 4:</strong> Numbers of staffed COVID beds over time vs. capacity from the HHS Protect Data <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#HHSLink">5</a></sup>.</p>
<p>We were also able to show the efficacy of the spreadsheet to other health departments and hospitals by generating tailored, public parameters for offset and length of stay from different national public resources, like the Florida line-level COVID dataset <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#FloridaLineLevelLink">4</a></sup>. </p>
<p>Based on these organizations’ feedback that they needed projections more than 2 weeks out, we started to use Cases2Beds as an input to hospital utilization forecasting models. Other inputs to the hospital forecasting model included current hospital bed utilization (from HHS Protect<sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#HHSLink">5</a></sup>), how long current patients are likely to continue to be hospitalized, and how many new cases there will be in the near future. A preliminary evaluation of such a method shows decent predictive power when parameters are tailored to a location.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Cases2Beds was a case study about the realities of research institutions offering actionable intelligence in healthcare. While the Cases2Beds tool demonstrated reasonable predictive power, it was difficult to deploy it in a timely and actionable way. Our most significant challenges were data access and bureaucratic limitations to develop solutions at the granularity needed. </p>
<p>Research institutions can be effective partners to health organizations, but the next set of challenges of this pandemic–or the next–will require quick action. The tools we build now can set the stage for the future. </p>
<p>Thank you to the Allegheny County Health Department (especially Antony Gnalian, Dr. LuAnn Brink, and Dr. Debra Bogen) for their invaluable feedback, efforts, and shared interest in actionable intelligence.</p>
<p>Many members of the Delphi Group, including Sumit Agrawal, Katie Mazaitis, and Phil McGuinness, met regularly with the Allegheny County Health Department, provided data, edited this blog post, and investigated various solutions other than Cases2Beds.</p>
<h2 id="resources">Resources</h2>
<p>Please check out the <a rel="noopener" target="_blank" href="https://github.com/cmu-delphi/cases-to-beds-public">Cases2Beds Github Repo</a></p>
<p><a id="WPRDCLink">1.</a> <a rel="noopener" target="_blank" href="https://data.wprdc.org/dataset/allegheny-county-covid-19-tests-cases-and-deaths">WPRDC Allegheny County COVID dataset</a></p>
<p><a id="Cases2BedsLink">2.</a> <a rel="noopener" target="_blank" href="https://www.cmu.edu/delphi-web/cases2beds-v0.2.3.xlsm">Cases2Beds Worksheet</a></p>
<p><a id="ACHDDashboardLink">3.</a> <a rel="noopener" target="_blank" href="https://tableau.alleghenycounty.us/t/PublicSite/views/AlleghenyCountyCOVID-19Information_15912788131180/Landingpage?iframeSizedToWindow=true&%3Aembed=y&%3AshowAppBanner=false&%3Adisplay_count=no&%3AshowVizHome=no&%3Aorigin=viz_share_link">ACHD COVID-19 Dashboard</a></p>
<p><a id="FloridaLineLevelLink">4.</a> <a rel="noopener" target="_blank" href="https://experience.arcgis.com/experience/96dd742462124fa0b38ddedb9b25e429">Florida line-level COVID dataset</a></p>
<p><a id="HHSLink">5.</a> <a rel="noopener" target="_blank" href="https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u">HHS Protect Hospital Utilization Data</a></p>
<div id="app"></div>
<h2 id="appendix">Appendix</h2>
<p>To generate the Offset Fractions (OF(k|traits)), which is the probability a patient with given traits will occupy a bed on k days after the specimen testing date, we follow <strong>Alg 1</strong>. For a given set of traits, the Offset Fractions for day k, where k is between -10 and 31, is the sum of the offset * distribution probabilities * hospitalization rate that sum up to day k. From these Offset Fractions, the mean/var of bed occupancy on a given day is given in <strong>Alg 2</strong>.</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>for o in (-10, 30): #This is the offset
</span><span> for d in (0, 40): #This is the duration of the stay
</span><span> for k in (o, o+d):
</span><span> if (k<31):
</span><span> OF(k|traits) += Offset(o|traits) * Duration(d|traits) * Hospitalization(traits)
</span></code></pre>
<p><strong>Alg 1</strong>: Generate Occupancy Fractions for a given set of traits</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>for specimen_date and num_cases in case_inputs:
</span><span> for t in (-10, 30):
</span><span> p = OF(t|traits)
</span><span> beds_mean(spec_date + t) += num_cases * p
</span><span> beds_var(spec_date + t) += num_cases*p*(1-p)
</span></code></pre>
<p><strong>Alg 2</strong>: Generate Mean and Variances</p>
<p><strong>High-level Mathematical Formulation of the Model:</strong> </p>
<p>O<sub>r,l</sub>: The offset value for a given subset of the population r <span>∈</span> R where R := {race}x{gender}x{age group} for a given day l where -10 <span>≤</span> l <span>≤</span> 30. This <strong>pdf</strong> is derived from a piecewise function using segments of exponential distributions characterized by the offset parameters. </p>
<p>D<sub>r,k</sub>: The duration value for a given subset of the population r <span>∈</span> R for a given day k where 0 <span>≤</span> k <span>≤</span> 40. This <strong>pdf</strong> is derived from a piecewise function using segments of exponential distributions characterized by the duration parameters. </p>
<p>h<sub>r</sub>: The hospitalization rate for a given subset of the population r <span>∈</span> R where 0 <span>≤</span> h<sub>r</sub> <span>≤</span> 1 </p>
<p>c<sub>r,d</sub>: The number of cases for a given subset of the population r <span>∈</span> R on a particular specimen collection date d (ex: 5 cases with specimen collected on January 1st 2021).</p>
<p>$$OF_{r, j} = \sum_{l=-10}^{30} \sum_{k=0}^{40} \mathbb{I} ( l \leq j \leq l+k ) O_{r, l} * D_{r, k}*h_r $$
The offset fraction for a given subset of the population r <span>∈</span> R for a given delta j where -10 <span>≤</span> j <span>≤</span> 30.</p>
<p>$$ \mathbb{E}[\beta_i] = \sum_{d \in D}\sum_{r \in R}\sum_{j = -10}^{30} \mathbb{I} ( d+j = i) OF_{r, j}*c_{r, d} $$
The expected number of beds on date i, where i can start 10 days before the first case date and can end 30 days after the the last case date (c<sub>r,d</sub>)</p>
Hello World2021-08-16T00:00:00+00:002021-08-16T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2021/helloworld/<h1 id="hello-world">Hello World!</h1>
<h2 id="hello">Hello</h2>
<p>This is the first post being made to the CSD PhD blog, testing out the
system. And so, indeed, hello world!</p>
<p>That’s really all there is to this
post. You don’t need to keep reading. I just have to fill this space so
that the preview of this post is filled up. That way when it renders
we can see what it looks like full of text. So this is just filler text,
explaining what is going on in a meta way. Feel free to ignore it and
just go about your business.</p>
<p>But it seems that you are in fact continuing to read. I wonder why.
Perhaps if I had filled this
section in with <em>Lorem ipsum</em> it would be a better signal that there are
no secrets to be gotten from reading this section. You are still reading though.
Just reading along. This is just a test post,
and here you are, taking all this time to read it. It’s just gonna be
filled with meaningless filler text. Well, that and some markdown
rendering tests. Most of them are coming up in
the next section. And since you keep on reading, you’ll certainly run
into them. That’s probably going to be even more bland to read. It’s
just going to repeat “Hello World” over and over again. But maybe
you just enjoy reading any words at all. You are, after all, still
reading this.</p>
<p>What, did you still think there was going to be some
secret in this section? Well, there’s not. Honestly its just filler text.
I know, there were these whole additional paragraphs, but they’re not special - just testing
the paragraph break rendering. And, yeah, it works. You saw the paragraph break, right?
Or are you just reading on without paying attention? Or, actually, did the site break?
Well, whatever, this test post can’t do anything about it. No, this post just going to
go on, unread, moldering in a virtual corner. Well, almost unread. You are reading this.
I still don’t know why, but you’ve made it a long way through. Honestly, you could
probably go longer than I care to write for a post as meaningless as this.
Next time, I’m just going to use <em>Lorem ipsum</em> to fill space.</p>
<h2 id="world">World</h2>
<p><em>Hello World</em></p>
<p><strong>Hello World</strong></p>
<p><del>Hello World</del></p>
<p><code>Hello World</code></p>
<p>$$Hello World$$</p>
<blockquote>
<p>Hello World</p>
</blockquote>
<ul>
<li>Hello</li>
<li>World</li>
</ul>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#fed6af;">#include </span><span style="color:#d6d6d680;"><</span><span style="color:#d68686;">stdio.h</span><span style="color:#d6d6d680;">>
</span><span>
</span><span style="color:#fffb9d;">int </span><span style="color:#fffd87;">main</span><span>() {
</span><span> </span><span style="color:#fffd87;">printf</span><span>(</span><span style="color:#d6d6d680;">"</span><span style="color:#d68686;">Hello World!</span><span style="color:#d6d6d680;">"</span><span>);
</span><span> </span><span style="color:#fed6af;">return </span><span style="font-weight:bold;color:#87d6d5;">0</span><span>;
</span><span>}
</span></code></pre>
<table><thead><tr><th align="right">Hello</th><th align="left">World</th></tr></thead><tbody>
<tr><td align="right">Hi</td><td align="left">Universe</td></tr>
<tr><td align="right">Greetings</td><td align="left">Earth</td></tr>
<tr><td align="right">Hey</td><td align="left">Everything</td></tr>
<tr><td align="right">Sup</td><td align="left">Realm</td></tr>
</tbody></table>
<p><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/%22Hello,_World!%22_program">Further Reading</a></p>
<h1 id="lorem-ipsum">Lorem Ipsum</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut rutrum nulla luctus tristique porttitor. Curabitur ut nibh non nulla dapibus facilisis. In maximus, nisi bibendum volutpat sagittis, enim ligula vehicula dolor, a dignissim est justo quis lorem. Nulla cursus sagittis magna facilisis imperdiet. Etiam non luctus arcu. Sed vulputate urna urna, sed convallis metus imperdiet et. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Praesent ut ornare nisl, sit amet congue ligula. Ut iaculis euismod dictum. Donec est arcu, porta nec sem vel, euismod mollis eros. Nulla consequat vel magna nec ornare. Pellentesque eu massa vel orci ornare ultrices nec in nunc.</p>
<p>Quisque tellus est, accumsan vitae ullamcorper a, maximus et ante. Mauris odio sem, bibendum fringilla ullamcorper tempor, molestie id dolor. Nulla sed tincidunt sapien. Duis vitae arcu sollicitudin, ullamcorper est vel, varius dolor. Nunc augue erat, congue ut tincidunt id, ornare a libero. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque purus diam, ornare sed suscipit a, euismod in justo.</p>
<p>Aliquam aliquam congue eros vel volutpat. Nunc ullamcorper vitae mi vehicula commodo. Phasellus ultricies a nunc a blandit. Integer tincidunt velit ut metus vehicula, vitae dictum eros sodales. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Cras consectetur suscipit maximus. Integer ut sem fringilla, interdum nulla sed, pretium nisi. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nam lobortis mollis leo, ut condimentum erat hendrerit sit amet. Donec vitae semper risus. Aenean sollicitudin tincidunt laoreet. Quisque velit tellus, vestibulum sed nisi et, pharetra feugiat nunc.</p>
<p>Morbi luctus lobortis orci id aliquam. Pellentesque viverra arcu nunc, sed ultricies lectus molestie quis. Praesent cursus dui elementum purus tempor vehicula. Nulla sed ligula blandit, tristique purus nec, consequat ex. Nunc et consequat ligula, nec vehicula nisi. Integer imperdiet nisl felis, nec porttitor quam maximus quis. Sed commodo lacus eget urna consequat gravida. Proin pellentesque mollis magna, eu consectetur nulla efficitur vitae. Nullam rhoncus faucibus sapien id gravida. Nam maximus pellentesque lorem, auctor vulputate quam porttitor sed. Praesent fringilla id eros sit amet lobortis. Donec ultrices pretium nisl sit amet euismod. Vestibulum consectetur euismod orci non fermentum. Nam congue sapien id interdum malesuada. Sed sit amet rhoncus magna, vel gravida sem. Praesent tincidunt consectetur gravida.</p>
<p>Ut consectetur, ex at sagittis blandit, libero magna dictum velit, nec ullamcorper erat diam nec urna. Curabitur tincidunt nisi risus, non pulvinar ipsum eleifend et. Pellentesque nec dolor non tellus efficitur mattis vitae sed neque. Suspendisse lectus nulla, mollis in fermentum ac, tempus a sapien. Suspendisse tempor consectetur porttitor. Aenean sed purus tempor, mollis lectus ac, tristique odio. Sed purus risus, tempus non risus aliquet, tincidunt aliquam eros. Vestibulum eget sollicitudin diam, porta rhoncus felis. Cras pellentesque vestibulum euismod. Phasellus placerat iaculis quam, quis suscipit elit semper ut.
Foundus theus secretus. Donec tempus sed justo nec semper. Vestibulum blandit velit quis risus lobortis, sit amet efficitur nulla scelerisque. Phasellus condimentum lectus non augue molestie, egestas auctor turpis porta. Mauris eget est a eros venenatis tempus. Duis lorem nisl, vulputate et neque et, ullamcorper ornare ipsum.</p>