CMU CSD PhD BlogZola2024-08-26T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/atom.xmlModeling BBRv1's Interactions with Loss-Bassed Congestion Control2024-08-26T00:00:00+00:002024-08-26T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/modeling-bbr/<!-- After filling in the above "top-matter", as per instructions provided
in the `.md` file, you can write the main body of the blogpost here
onwards. Commonly used examples of syntax are shown below.
You can run `./local_server.sh` at the root of the repository to see
how the final blogpost looks in action. -->
<p>In 2006, Senator Ted Stevens infamously described the Internet as a “series of tubes’‘, where “filled tubes’’ delayed his email. While Senator Stevens was ridiculed at the time for his simplistic understanding of the Internet, his characterization was not entirely wrong:</p>
<blockquote>
<p>“It’s a series of tubes. And if you don’t understand, those tubes can be filled and if they are filled, when you put your message in it, it gets in line and it’s going to be delayed by anyone that puts into that tube enormous amounts of material…”</p>
</blockquote>
<p>The more technical term for what Senator Stevens describes here is congestion. Logical “tubes” are created between two end-points across a network by transport protocols like TCP. These tubes run over physical links with limited capacity which may become filled by competing traffic. If end-points send more data into the network than the capacity of these links, the network becomes overloaded and congested, delaying packets and degrading throughput. If congestion goes unchecked, the network can become so slow it grinds to a halt, a condition called congestion collapse.</p>
<p>Historically, all senders could share an overloaded network if all senders used the same congestion control algorithm (CCA). For example, if all senders used a loss-based CCA like TCP Reno, then all senders could share the network both fairly and efficiently, <em>without explicit coordination between the senders.</em> However, over the past 30 years since TCP Reno’s deployment, large content providers have proposed and deployed novel CCAs to keep up with growing demands of a faster Internet, emerging applications like video conferencing, and billions of mobile users. Since the senders must implement congestion control, content providers can deploy any CCA. Consequently, if we are moving away from a homogenous deployment of one loss-based CCA to a wild-west of many CCAs, the fairness of the Internet is at stake.</p>
<p>Most notable of these new algorithms is TCP BBR, proposed and deployed by Google in 2016, including an open-source implementation in the Linux kernel. When Google published BBR, it resulted in a surge of researchers studying the algorithm; could BBR fairly share with the already widely deployed loss-based CCA TCP Cubic? Several studies showed empirically with experiments in controlled testbeds that BBR could be very unfair to Cubic with little explanation for why. Consequently, in this work, we ask if we can determine when and why BBR is unfair to Cubic.</p>
<p>In the past, researchers have used modelling to understand a CCA’s behavior. For example, the <a rel="noopener" target="_blank" href="https://www.cs.utexas.edu/%7Elam/395t/papers/Mathis1998.pdf">Mathis equation</a> showed TCP Reno’s throughput is inversely proportional to the loss rate. So in this work, we build a similar model to understand a BBR flow’s throughput when competing with other flows.</p>
<p>Our model has an important finding: BBR’s throughput when competing with loss-based algorithms does not depend on the number of competing loss-based flows. Regardless of the number of competing flows, BBR flows will always achieve a fixed fraction of the available throughput.</p>
<h1 id="how-does-bbrv1-work">How does BBRv1 work?</h1>
<p>BBR is designed to be a rate-based algorithm. In contrast to window-based TCP variants (for example Cubic and Reno) which control the amount of in-flight data, BBR’s goal is to figure out what its fair share of the bottleneck link is and to pace the sending of packets at exactly that rate.</p>
<p>A BBR flow determines its sending rate by repeatedly probing for bandwidth (in its ProbeBW state) over 8 round trip times (RTTs). For 6 RTTs, BBR will send at rate \(R\), its current maximum achieved throughput. Then BBR will send at a higher rate to see if it can get better throughput. It does this by increasing its sending rate to \(1.25R\) for 1 RTT and observing the achieved throughput. Finally, it will reduce its sending rate to \(0.75R\) for 1 RTT to drain any excess packets from the queue. If a BBR flow observes a higher achieved throughput, it will adjust its sending rate \(R\) to what it now thinks is its maximum throughput. In total, these steps (ProbeBW) take 8 RTTs. BBR will repeat this 8 RTT cycle over and over again.</p>
<p>When one BBR flow is not sharing the bottleneck link with any other flow, repeatedly probing for bandwidth allows BBR to both maximize throughput and minimize delay by figuring out exactly how much bandwidth is available.</p>
<p>But what happens during ProbeBW when BBR competes with Reno or Cubic?</p>
<p>During ProbeBW, by putting additional data into the network when the bottleneck link and queue are already full, BBR will cause packet loss. Because Reno and Cubic are loss-based algorithms, they will reduce their sending rates in response to packet loss.</p>
<p>BBR on the other hand, will not reduce its rate; instead it will see that it was able to get better throughput and will increase its sending rate. This causes Reno and Cubic to end up with less bandwidth than BBR.</p>
<!-- ![1-bbr-vs-1-cubic](../../static/2024/modeling-bbr/BBR_fig1.png) -->
<p><img src="./BBR_fig1.png" alt="1-bbr-vs-1-cubic" /></p>
<p><em><strong>Fig. 1:</strong> 1 BBR vs. 1 Cubic. (10 Mbps network, 32 x bandwidth delay product queue). During ProbeBW, BBR causes Cubic to back off.</em></p>
<p>During ProbeBW, BBR will put more and more packets into the bottleneck queue, increasing its bandwidth estimate. This process will repeat over and over again. While the Cubic flows back off, BBR will push more and more packets into the queue. We can observe this behavior in Figure 1.</p>
<p>Given what we know about BBR so far, during ProbeBW, BBR should just keep putting more and more packets into the network until Cubic is starved. However, from Figure 1, we see that eventually BBR stops probing for more bandwidth.</p>
<h1 id="under-competition-bbr-is-window-limited">Under competition, BBR is window-limited</h1>
<p>Surprisingly, we find that under competition, BBR’s rate is not determined by its bandwidth estimate but by a window-limit called the <em>in-flight cap</em>.</p>
<p>BBR limits the amount of data in-flight to two Bandwidth Delay Product (BDP), as a safety mechanism in case of delayed and stretched ACKs. BDP is a product of BBR’s estimate of bandwidth and end-to-end latency.</p>
<p>We find that this in-flight cap is what ultimately dictates what fraction of the link a BBR flow will get when competing with other flows and is what stops BBR during ProbeBW in Figure 1.</p>
<p>Therefore, if we can model the in-flight cap, we can figure out what BBR’s throughput will be when competing with loss-based flows.</p>
<h1 id="a-simple-model-for-bbr-s-throughput">A simple model for BBR’s throughput</h1>
<p>To derive a simple model of BBR’s in-flight cap, and consequently its throughput, we assume that we have 1 BBR flow vs. any number of Cubic flows in a deep-buffered network (for example in Figure 1 we set the buffer size to 32 BDP).</p>
<p><img src="./BBR_fig2.png" alt="variables in simple BBR model" />
<em><strong>Fig. 2:</strong> Variables in simple BBR model.</em></p>
<p>First, we define some variables in the model shown in Figure 2. This illustrates what the bottleneck link and queue might look like. We assume the bottleneck link capacity is \(c\) and the bottleneck queue size is \(q\). If the Cubic flows occupy \(p\) fraction of the queue, we assume that 1 BBR flow occupies the remaining \((1-p)\) fraction of the queue.</p>
<p><img src="./BBR_fig3.png" alt="A simple model for BBR’s queue occupancy/throughput." />
<em><strong>Fig. 3:</strong> A simple model for BBR’s queue occupancy/throughput. This model says 1 BBR flow can get up to half of the available queue/link capacity when competing with any number of Cubic flows.</em></p>
<p>Given this, we can draw a figure of Cubic’s queue occupancy vs. BBR’s, as shown in Figure 3. First, the blue line, shows that if Cubic occupies \(p\) fraction of the queue, then BBR must have \(1-p\) data in-flight to occupy the rest of the queue.</p>
<p>Next, we need to model what BBR’s bandwidth and RTT estimates will be, so we can also draw the 2 BDP in-flight cap. BBR’s bandwidth estimate is equivalent to BBR’s fraction of the queue (which we have already said is \(1-p\) times the link capacity), so its bandwidth estimate is \((1-p)c\).</p>
<p>BBR determines its RTT estimate by continually tracking the minimal RTT it has seen over the last 10 seconds. Every 10 seconds, BBR goes into ProbeRTT state and lowers its in-flight data to four packets, so it can drain any of its packets from the queue and see what the minimum RTT is, minus any self-induced queueing delay. Since BBR will drain its packets from the queue, the remaining packets in the queue will be Cubic. Thus, assuming a negligible propagation delay, BBR’s RTT estimate is going to be equivalent to Cubic’s queue occupancy divided by the link capacity, so BBR’s RTT estimate is \((pq) / c\).</p>
<p>Thus, BBR’s in-flight cap = $$2 * BW * RTT = 2 * (1-p)c * (pq) / c = 2q(p-p^2).$$ This quadratic equation for the in-flight cap turns out to be the green line in Figure 3.</p>
<p>Returning to Figure 1, what we are seeing here is BBR moving up this blue line, putting more and more data into the queue until the amount of data it has in-flight meets its in-flight cap. This intersection is illustrated in Figure 3 by the dashed orange line.</p>
<p>This model shows that 1 BBR flow can get up to half of the link when competing with any number of Cubic flows! This is why we see unfairness between 1 BBR flow and 2 or more Cubic flows.</p>
<h1 id="a-more-robust-model-for-bbr-s-throughput">A more robust model for BBR’s throughput</h1>
<p>Thus far, we have made many simplifying assumptions to make a simple model of BBR’s throughput. Our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3355369.3355604">IMC</a> paper builds a more complete model of BBR’s throughput when competing with any number of loss-based flows, relaxing those assumptions. The roboust model is shown in Figure 4.</p>
<p><img src="./BBR_fig4.png" alt="A robust model for BBR’s queue occupancy/throughput." />
<em><strong>Fig.4:</strong> The robust model for BBR’s throughput/link fraction when competing with loss-based flows. Notably, none of the variables in the model depend on the number of loss-based flows.</em></p>
<p>Figure 4 highlights the variables that impact BBR’s fraction of the link and throughput. Here the queue size is a multiple of BDP=\(Xcl\) where \(X\) is the multiple, \(c\) is the link capacity, and \(l\) is the link propagation delay. In addition, \(N\) is the number of BBR flows. Notably, none of these variables depend on the loss rate or the number of loss-based flows. Consequently, BBR can be unfair to Cubic when there are more Cubic flows than BBR flows because the BBR flow’s aggregate fraction of the link will not be proportional to the number of flows.</p>
<p>The robust model can predict BBR’s throughput when competing against Cubic flows with a median error of 5% and 8% for Reno.</p>
<p>In summary, this model has two important explanations for why BBR can be unfair to Cubic:</p>
<ol>
<li>While BBR is supposed to be a rate-based algorithm, when competing, BBR is window-limited. As a result, although one of BBR’s goals is to minimize queueing delay, it will fill network buffers when competing with loss-based algorithms.</li>
<li>BBR’s throughput when competing with loss-based algorithms does not depend on the number of competing loss-based flows. As a result, a single BBR flow will grab a fixed fraction of the link regardless of the number of competing flows.</li>
</ol>
<p>Google recently proposed <a rel="noopener" target="_blank" href="https://datatracker.ietf.org/meeting/117/materials/slides-117-ccwg-bbrv3-algorithm-bug-fixes-and-public-internet-deployment-00">BBRv3</a> (preceded by <a rel="noopener" target="_blank" href="https://datatracker.ietf.org/meeting/112/materials/slides-112-iccrg-bbrv2-update-00">BBRv2</a>) which aims to address the fairness issues discussed in this work and other fairness concerns by incorporating loss into BBR’s model of the network. An interesting direction for future work is to study BBRv3’s interactions with loss-based CCAs.</p>
<p>This blog is based on a paper published at IMC 2019: <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu//%7Erware/assets/pdf/ware-imc2019.pdf">Modeling BBR’s Interactions with Loss-Based Congestion Control</a>.</p>
<!-- ## Subsection Heading
Some text.
## Another Subsection Heading
Some more text.
You can write lines
separately
and it'll still
be considered
a single paragraph. Paragraphs are separated by a
blank line.
# Another Section
You can **bold** things by wrapping them in two asterisks/stars. You
can _italicise_ things by wrapping them in underscores. You can also
include inline `code` which is done by wrapping with backticks (the
key likely to the left of the 1 on your keyboard).
If you want to add larger snippets of code, you can add triple
backticks around them, like so:
```
this_is_larger = true;
show_code(true);
```
However, the above doesn't add syntax highlighting. If you want to do
that, you need to specify the specific language on the first line, as
part of the backticks, like so:
```c
#include <stdio.h>
int main() {
printf("Hello World!");
return 0;
}
```
If you want to quote someone, simply prefix whatever they said with a
`>`. For example:
> If it is on the internet, it must be true.
-- Abraham Lincoln
You can also nest quotes:
> > You miss 100% of the shots you don't take
>
> -- Wayne Gretzky
-- Michael Scott
Every paragraph _immediately_ after a quote is automatically right
aligned and pressed up against the quote, since it is assumed to be
the author/speaker of the quote. You can suppress this by adding a
`<p></p>` right after a quote, like so:
> This is a quote, whose next para is a normal para, rather than an
> author/speaker
<p></p>
This paragraph is perfectly normal, rather than being forced
right. Additionally, you could also add a `<br />` right beside the
`<p></p>` to give some more breathing room between the quote and the
paragraph.
In the author notifications above, btw, note how the double-hyphen
`--` automatically becomes the en-dash (--) and the triple-hyphen
`---` automatically becomes the em-dash (---). Similarly, double- and
single-quotes are automagically made into "smart quotes", and the
ellipsis `...` is automatically cleaned up into an actual ellipsis...
---
You can add arbitrary horizontal rules by simply placing three hyphens
on a line by themselves.
---
Of course, you can write \\( \LaTeX \\) either inline by placing stuff
within `\\(` and `\\)` markers, or as a separate equation-style LaTeX
output by wrapping things in `\\[` and `\\]`:
\\[ \sum_{n_1 \in \N} \frac{n_1}{n_2} \\]
Alternatively, you can wrap it inside of a pair of double-dollar signs
`$$`:
$$ \frac{\Phi \in \psi}{\psi \rightarrow \xi} $$
Single dollar signs unfortunately do not work for inline LaTeX.
# More fun!
Of course, you can add links to things, by using the right syntax. For
example, [here is a link to NASA](https://www.nasa.gov/). Standard
HTML-like shenanigans, such as appending a `#anchor` to the end of the
link also work. Relative links within the website also work.
You can also use the links to link back to parts of the same
blogpost. For this, you need to find the "slug" of the section. For
this, you can force a slug at the section heading, and then simply
refer to it, like the [upcoming section](#finale), or alternatively,
you can take the lowercase version of all the parts of a section and
place hyphens between them, like [this](#more-fun) or
[this](#another-section).
Pictures, of course, can be added. The best way to do this is to
utilize relative links (just add images into the right directory, see
the main `README` file in this repository to learn where it should
go), but you can link to external images too in the same way. For
example,
![i are serious cat](https://upload.wikimedia.org/wikipedia/commons/4/44/CatLolCatExample.jpg)
Of course, it is good etiquette to add alt-text to your images, like
has been done in the previous image, with "i are serious cat". It
helps with accessibility.
Images are automatically shown at a reasonable size by limiting their
maximum width. If you have a particularly tall image, you might have
to do some manipulation yourself though. Images should also
automatically work properly in mobile phones :)
---
Do you want some tables? Here are some tables:
| Header 1 | Another header here | This is a long header |
|:---------- | ---------------------:|:---------------------:|
| Some data | Some more data | data \\( \epsilon \\) |
| data | Some _long_ data here | more data |
| align left | right | center |
You use the `:` specifier in the table header-body splitting line to
specify whether the particular column should be left, center, or right
aligned. All the standard markdown elements continue to work within
the table, so feel free to use them.
# Finale {#finale}
Finally, you're at the end of your blogpost! Your name will appear
again at the end automatically, as will the committee members who will
(hopefully) approve your blogpost with no changes! Good luck! -->
miniCodeProps: a Benchmark for Proving Code Properties2024-08-23T00:00:00+00:002024-08-23T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/mini-code-props/<p>It is nearly inevitable that bugs will appear in a codebase during software development. To catch these bugs before they lead to real-world consequences,
the formal verification community has developed a wide variety of tools for ensuring code correctness. These tools fall
into two main classes: Automated Reasoning and Interactive Theorem Proving. Unfortunately, proving code properties with either of these approaches tends
to require significant effort from human experts. In this blog post, we describe early steps towards using the emerging capabilities of <a rel="noopener" target="_blank" href="https://aws.amazon.com/what-is/large-language-model/">Large Language Models (LLMs)</a>
to automate the labor intensive portions of the Interactive Theorem Proving paradigm. In particular, we introduce <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2406.11915">miniCodeProps</a>, a benchmark
for automated Interactive Theorem Proving on code properties.</p>
<h1 id="background">Background</h1>
<h2 id="verification-with-automated-reasoning">Verification with Automated Reasoning</h2>
<p>Automated Reasoning tools such as boolean satisfiability solvers (a.k.a. SAT solvers) take formulas in <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">a form of propositional logic</a>
as input and use “<a rel="noopener" target="_blank" href="https://cacm.acm.org/research/the-science-of-brute-force/">brute reasoning</a>” to search for
a variable assignment such that the input formula evaluates to True. If no such assignment exists, a <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Emheule/publications/p01c15-prf.pdf">proof</a> of this
fact is returned. If we want to verify that the outputs of a function \(y = f(x)\) satisfy property \(p(y)\), we can encode \(\exists x, \lnot p(f(x)) \) into CNF and run a SAT solver.
If \(f\) fails to satisfy \(p\) on any input(s), the solver will return one such input as a satisfying assignment to the formula. Otherwise, the returned proof that the formula
is unsatisfiable is equivalent to a proof that \(\forall x, p(f(x))\), i.e. \(f\) satisfies \(p\) on all possible inputs. </p>
<p>For example, \(f\) may be a function that takes an unsorted list of numbers \(x\) as input and returns \(y\), a new sorted list. If \(p(y)\) mathematically encodes the statement
“the list \(y\) is ordered from least to greatest”, then solving the aforementioned \(\exists x, \lnot p(f(x)) \) is effectively asking the SAT solver to search for input lists
that cause \(f\) to not return a sorted list. In practice, verifying a sorting algorithm requires proving other properties as well. If \(f\) always sets \(y\) to an empty list,
\(p(y)\) will always be True! We follow up with this example in a <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#sorting">later section</a>.</p>
<p>Satisfiability Modulo Theory (SMT) solvers allow more complicated variable types such as integers and arrays, as well as clauses that use for-all quantifiers. These extensions make encoding correctness properties
simpler, but make <a rel="noopener" target="_blank" href="https://leodemoura.github.io/files/SMTProofs.pdf">producing proofs of unsatisfiability much more difficult</a>.
While this formulation succeeds in many practical settings, solving SAT and SMT formulas
is an NP-hard problem. In practice, when an input formula takes prohibitively long to solve, a human expert must modify the problem encoding by adding extra information such as
an inductive invariant hint to reduce the search space. Although not the focus of this post, <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2310.17807">recent work</a> has shown that
<a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.08774">OpenAI’s GPT-4 LLM</a> can be used to reliably produce SMT encodings of some problems.</p>
<h2 id="leanExample">Verification with Interactive Theorem Proving</h2>
<p>In contrast to the Automated Reasoning approach, Interactive Theorem Provers (ITPs) were explicitly designed to include a human expert in the proving process. Well known ITP
environments such as <a rel="noopener" target="_blank" href="https://isabelle.in.tum.de/overview.html">Isabelle</a>, <a rel="noopener" target="_blank" href="https://coq.inria.fr/">Coq</a>, and <a rel="noopener" target="_blank" href="https://lean-lang.org/">Lean</a> require that the user specify the
program and the property to be proved in mathematical languages unique to each ITP, which are about as expressive
as the SMT language while maintaining more human readability. The user is then presented with the current state of the proof, containing the goal(s) and other relevant information.
The user then adds lines to a growing proof of the property in which each line modifies the goal(s) in a mathematically valid way that is verified by the ITP until
all goals are proven. At first glance, this paradigm seems far inferior to Automated Reasoning in terms of scaling potential because a human expert is an integral part of the
proving process. However, recent advances in LLM capabilities provide hope that Artificial Intelligence (AI) will soon be able to automate proof writing in ITP environments.</p>
<p>Our work uses Lean 4 to state and prove theorems about code. The image below is a sample Lean file containing two functions involving
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Binary_tree">binary trees</a>: <code>tree_size</code> and <code>balanced</code> (section B). Section C contains a property of these functions written
in Lean, namely that when <code>balanced</code> returns true on any tree <code>t</code>, <code>tree_size(t)</code> returns an odd number.
The definition of <code>Odd</code> is part of mathlib (Lean’s library of mathematical statements and theorems), which is imported in section A.
Section D contains the proof of this property. The proof state (section E) displays Lean’s internal proof environment at the location of the cursor, which in this case is
line 20 in section D. The yellow items in section E describe objects available for use in the proof, i.e. <code>p</code> and <code>q</code> are the left and right branches of the input tree.
The proof state also contains facts available for use in the proof, i.e. <code>hb</code> stores the fact that a tree with <code>x</code> as the root and <code>p</code> and <code>q</code> as branches is balanced.
The final line of section E is the “goal” of the proof, which by line 20 has been simplified to showing that adding 1 to the tree sizes of <code>p</code> and <code>q</code> results in an odd number.</p>
<img src="lean_example.png" alt="Example Lean environment showing file context, imports, and proof state" width="1000"/>
<h1 id="new-applications-of-llms">New Applications of LLMs</h1>
<h2 id="mathematical-theorem-proving">Mathematical theorem proving</h2>
<p>Mathematics research currently relies on extensive peer review to spot errors in new publications. This process can be difficult and time consuming, without any guarantees
on review quality. To address this problem, several well-known mathematicians have begun to
<a rel="noopener" target="_blank" href="https://terrytao.wordpress.com/2023/12/05/a-slightly-longer-lean-4-proof-tour/">formalize parts of their work in Lean</a>. From a computer scientist’s point of view, this context provides
several possible avenues of research, such as generating new formal and informal proofs and translating between the two types. In this post, we focus on formal proof generation.
Recently, LLMs fine-tuned on medium size formal math datasets such as Lean Mathlib have shown state of the art performance on the formal mathematical proof
benchmark <a rel="noopener" target="_blank" href="https://github.com/openai/miniF2F">miniF2F</a> (see <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2310.00656">lego-prover</a>, <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2310.10631">llemma</a>, <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2210.12283">Draft-Sketch-Prove</a>, <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2205.11491">HTPS</a>, <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2405.14333">DeepSeek-Prover</a>, and this <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2312.14188">improved data sampling method</a>).</p>
<h2 id="code-generation">Code Generation</h2>
<p>Code generation, modification, and repair have been active areas of research for decades. Recent work has shown <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2212.09420">significant progress</a>
on influential benchmarks such as <a rel="noopener" target="_blank" href="https://paperswithcode.com/sota/code-generation-on-humaneval">HumanEval</a> and <a rel="noopener" target="_blank" href="https://paperswithcode.com/dataset/mbpp">MBPP</a>.
In principle, the advances in this area are directly applicable to formal theorem proving.
Lean 4 is both a programming language and an ITP, so generating Lean proofs can also be viewed as generating code in the Lean programming language.
Additionally, <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2307.02503">several code generation models</a> can generate accompanying natural language explanations of generated code. If a language model can explain
how code works, it may also be able to generate proofs of properties of said code.</p>
<h1 id="challenges-and-mitigations">Challenges and Mitigations</h1>
<p>Although LLMs can prove formal mathematical theorems and explain generated code, the niche of proving program properties with LLMs is underexplored due to several technical challenges.</p>
<h2 id="incorporating-context">Incorporating Context</h2>
<p>Until recently, input size constraints have been a well-known problem for LLMs. In particular, early LLMs could only process between hundreds and thousands of words at a time
due to various architecture choices and constraints. Recent advances have significantly increased the effective allowed input size:
see <a rel="noopener" target="_blank" href="https://agi-sphere.com/context-length/">this post</a> for an introduction to the topic.
In early work on using LLMs for interactive theorem proving, only the proof state (see section E in the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#leanExample">image above</a>) was used as input due to input size constraints.
Increases in allowed input length have allowed prompts about code to also include code dependencies and file context, which are both useful to humans when reasoning about ITP proofs.</p>
<h2 id="hallucinations">Hallucinations</h2>
<p>As people use Large Language Models (LLMs) such as ChatGPT more often and for an increasing variety of tasks,
LLM <a rel="noopener" target="_blank" href="https://www.ibm.com/topics/ai-hallucinations">hallucinations</a> have garnered significant attention. Essentially, LLMs sometimes fabricate
information that appears plausible but does not hold up under scrutiny. For example, an LLM might produce a proof that contains logical errors or uses nonexistent lemmas.
Many partial solutions exist for handling hallucinations; some examples include <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.08896">self-consistency</a>
and <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2005.11401">Retrieval Augmented Generation (RAG)</a>.</p>
<p>ITPs provide a somewhat unique context where hallucinations are caught immediately by the ITP’s internal verifier. When
an LLM produces an invalid proof step, the error message produced by the ITP can also be used to prompt the LLM for an alternate proof step (see <a rel="noopener" target="_blank" href="https://www.arxiv.org/abs/2408.08152">DeepSeek-Prover-V1.5</a>),
similarly to how humans interact with an ITP.</p>
<h2 id="benchmark-availability">Benchmark Availability</h2>
<p>In large part, progress on tasks such as code generation and (in)formal math proofs is driven by reporting progress on widely accepted benchmarks such as <a rel="noopener" target="_blank" href="https://paperswithcode.com/sota/code-generation-on-humaneval">HumanEval</a>, <a rel="noopener" target="_blank" href="https://paperswithcode.com/sota/math-word-problem-solving-on-math">MATH</a>, and <a rel="noopener" target="_blank" href="https://paperswithcode.com/dataset/minif2f">miniF2F</a>.
At the time of writing, no such benchmark exists for proofs of code properties. The main contribution of our work in this field is the creation of <a rel="noopener" target="_blank" href="https://huggingface.co/datasets/elohn/miniCodeProps">miniCodeProps</a>, a new benchmark containing
a variety of programs and corresponding properties to be proven in Lean 4. We intend that miniCodeProps be used to benchmark the capabilities of LLMs to produce correct proofs
of code properties.</p>
<h1 id="benchmark-minicodeprops">Benchmark: miniCodeProps</h1>
<p>miniCodeProps is intended to mirror the utility of miniF2F (a formal mathematical theorem proving benchmark) in the space of proving properties of code.
We describe the way the benchmark was created and our baseline experiments with several techniques from code generation and theorem proving literature.</p>
<h2 id="benchmark-collection">Benchmark Collection</h2>
<p>The programs and associated properties in miniCodeProps were all sourced from the <a rel="noopener" target="_blank" href="https://tip-org.github.io/">Tons of Inductive Problems (TIP)</a> dataset. We selected files from TIP with properties
describing functions defined in TIP, then translated those properties and functions from Haskell to Lean 4. During the translation process, we were required to prove several
lemmas in Lean regarding the termination of the recursive functions being defined. These lemmas are also properties of the functions translated from TIP, and are also included
in the benchmark. Each example in our benchmark contains the text of sections A, B, C, and E in the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#leanExample">example</a> above, where section E is the initial proof state.
An automated ITP system succeeds on a given example by producing a correct section D, where correctness is verified by Lean. The next section explores the common methods such ITP
systems use to produce proofs.</p>
<h1 id="methods">Methods</h1>
<p>In this section we will address the following questions: </p>
<ul>
<li>How do you programmatically check generated proofs?</li>
<li>What are current and potential future techniques used to generate proofs using LLMs? </li>
</ul>
<p>There are two main classes of proof generation techniques in current automated ITP literature: </p>
<ol>
<li>Next-Step tactic prediction: one line is generated at a time until the proof is complete</li>
<li>Full-Proof generation: entire candidate proofs are generated until a proof is found
In practice, researchers designate a computational budget (for example, 8 attempts at Full-Proof generation) and terminate the proof search process once this budget is reached
without a successful proof.
We evaluate both approaches on miniCodeProps.</li>
</ol>
<h2 id="interaction-with-lean">Interaction with Lean</h2>
<p>Interaction with the Lean 4 kernel is necessary for most Next-Step tactic generation and self-refinement Full-Proof generation methods. Our work uses the <a rel="noopener" target="_blank" href="https://github.com/leanprover-community/repl">Lean REPL</a>, a tool that
facilitates backtracking and continuing from specific steps in the proof. Each time Lean code is generated, Lean REPL checks the validity of the line in the context of the definitions
and earlier proof lines. The REPL returns error messages if any invalid steps were taken, or the new proof state containing the list of remaining goals (statements to prove) otherwise.
When the list of goals in the proof state is empty, the original theorem has been proven correct.</p>
<h2 id="next-step-tactic-generation">Next-Step Tactic Generation</h2>
<p>At the beginning of a proof and after each valid line, the Lean kernel generates a proof state, i.e. a collection of the variables and hypotheses defined in the current context
(section E of <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#leanExample">the earlier example</a>).
As Lean is also a programming language, the proof state can also be thought of as a debug trace of the current context (theorems are effectively functions that produce certificates that a property holds! See <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence">Curry-Howard</a>). Tactics are functions that modify the proof state. Common examples include <code>simp</code>, a
broadly useful simplification tactic that attempts a wide array of other common tactics, and <code>rw</code>, which attempts to use specific lemmas provided by the user to modify the goal.</p>
<p>The most basic variant of Next-step tactic generation is a function from proof state to a set of possible next tactics. There are many ways to extend this idea. For example,
expanding the input to include other relevant Lean code and lemmas or expanding output to give each possible next tactic a “confidence” score describing the likelihood that the proof
can be completed using the generation as the next tactic. In the earlier example, a successful Next-Step tactic prediction given the proof state in section E would be the line
after the cursor in section D, i.e. <code>unfold balanced at hb</code>. To produce a full proof, the system starts from the initial proof state and repeatedly generates possible next tactics
until the proof state has an empty list of goals.</p>
<h2 id="full-proof-generation">Full-Proof Generation</h2>
<p>Researchers have discovered that LLMs in some cases exhibit <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2301.00234">In-Context Learning</a>: the ability to generalize patterns
from a small number of examples provided in the prompt. Additionally,
the data that massive LLMs such as GPT-4 are trained on contains examples of proofs in Lean 3, as well as in other proof assistants such as Isabelle and Coq.
Therefore, it is reasonable to expect that LLMs could generate full proofs of a given theorem statement given example pairs of theorem statement and proof.
Concretely, the “theorem statement” used is generally sections A, B, C, and optionally E in the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#leanExample">earlier example</a>, while the expected output is an entire valid
proof (section D). In our experiments we ignored initial proof state (section E) as it was mostly redundant with the theorem definition (section C) in our examples.</p>
<p>Deciding inputs and outputs is a good first step, but it is generally suboptimal to just send the LLM a list of input-output pairs. The nascent field of Prompt Engineering provides
a variety of approaches to constructing high-performing prompts for language models. One such technique is to tell the language model it
is an expert, i,e, begin with “You are an expert in producing Lean 4 proofs tasked with…”. A second common approach is “few-shot prompting,” the approach of providing several examples
of input and desired output in the prompt. Another common approach is <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.17651">self-refinement</a>, i.e. using any available output describing the results
of the previous LLM output as an input to the next prompt. </p>
<h1 id="minicodeprops-baselines">miniCodeProps Baselines</h1>
<p>We tested several models using Next-Step tactic generation, and GPT-4 for Full-Proof generation. Results can be found in the table below.
<a rel="noopener" target="_blank" href="https://github.com/wellecks/llmstep">LLMStep</a> is a framework for getting Next-Step proof suggestions from arbitrary LLMs in VScode. We modified it to communicate with Lean REPL
directly and output confidence scores for each generated next step. We applied the following proof search approach:</p>
<ol>
<li>Given proof state, generate 10 (next tactic, confidence score) pairs.</li>
<li>Deduplicate and pick the 5 highest confidence tactics.</li>
<li>Send each tactic to Lean REPL using the current state. For each valid proof state returned:
<ul>
<li>If the proof state is invalid (i.e. the tactic caused Lean REPL to error), ignore it.</li>
<li>If there are no goals remaining, return the list of steps taken.</li>
<li>If the max proof search depth has not been reached, repeat steps 1-3 on the new proof state.
The LLMs we used were all fine-tuned to produce Lean tactics from proof state. ntp-context-1.3b in particular was also fine-tuned to use surrounding file context. Due to computational constraints, we used a proof search depth of 3 in our experiments.</li>
</ul>
</li>
</ol>
<p>For our Full-Proof generation approach, we constructed a base prompt containing three examples of program property proofs similar to those in miniCodeProps. For each property in miniCodeProps,
we appended the property and accompanying context (function definitions and lemmas) to the base prompt and requested a full proof. We requested 8 responses from GPT-4 and reported success
if any succeeded.</p>
<table><thead><tr><th>Method</th><th align="left">LLM</th><th align="center">Medley (Easy)</th><th align="center">Termination (Med)</th><th align="center">Sorting (Hard)</th></tr></thead><tbody>
<tr><td>Next-Step</td><td align="left"><a rel="noopener" target="_blank" href="https://huggingface.co/EleutherAI/pythia-2.8b">Pythia2.8b</a></td><td align="center">44/86</td><td align="center">1/28</td><td align="center">0/63</td></tr>
<tr><td>Next-Step</td><td align="left"><a rel="noopener" target="_blank" href="https://huggingface.co/EleutherAI/llemma_7b">Llemma7b</a></td><td align="center">46/86</td><td align="center">2/28</td><td align="center">0/63</td></tr>
<tr><td>Next-Step</td><td align="left"><a rel="noopener" target="_blank" href="https://huggingface.co/l3lab/ntp-mathlib-context-deepseek-coder-1.3b">ntp-context-1.3b</a></td><td align="center">38/86</td><td align="center">0/28</td><td align="center">0/63</td></tr>
<tr><td>Full-Proof</td><td align="left"><a rel="noopener" target="_blank" href="https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo">GPT-4-turbo</a></td><td align="center">44/86</td><td align="center">1/28</td><td align="center">0/63</td></tr>
</tbody></table>
<p>Our results indicate that proving properties of programs is nontrivial for simple applications of fine-tuned language models and basic few-shot prompting of GPT-4. Further analysis of
the failure modes of these models (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/mini-code-props/#sorting">Sorting Discussion</a>), as well as more sophisticated (and higher computational budget) approaches to proof search will likely improve
these results in the near future.
In the following section, we analyze the Sorting component of miniCodeProps and a sample incorrect proof generated by GPT-4.</p>
<h2 id="sorting">Discussion: Sorting</h2>
<p>The Sorting section of miniCodeProps contains a variety of sorting algorithms with associated properties. In particular, after defining 11 sorting algorithms on
lists of natural numbers, TIP defines the following properties for each algorithm:</p>
<ol>
<li>the algorithm returns an ordered list</li>
<li>the list returned by the algorithm has the same number of elements as the input list</li>
<li>the list returned by the algorithm is a permutation of the original list</li>
<li>the algorithm is equivalent to another sorting algorithm (insertion sort)</li>
</ol>
<p>Properties 1 and 4 are deeply connected; there is a unique way to order a list of natural numbers. However, this fact itself is a property of the <code>ordered</code> function used!
The best way to prove property 4 for most algorithms may very well be proving property 1 and that single fact about lists of natural numbers, but I argue that any
theorem prover that does so has demonstrated a valuable skill. The property 2 is also strictly easier than property 3. This allows for interesting analysis of future
theorem provers: will they succeed on proving property 2 but struggle on property 3? Or, will they prove property 3 directly and use its corollary to immediately prove property 2?</p>
<p>Unfortunately, the approaches we have tested so far have not succeeded at proving any properties of sorting algorithms. However, the ways in which they fail are informative. Below we
have a sample output of GPT-4 attempting to prove that heapsort (<code>hsort</code> below) returns an ordered list. The proof <em>looks</em> mostly reasonable to a Lean user, a common characteristic
of LLM-produced output. Notable problems occur on lines 25, 31, and 40. </p>
<pre data-linenos data-lang="lean" style="background-color:#393939;color:#dedede;" class="language-lean "><code class="language-lean" data-lang="lean"><table><tbody><tr><td>1</td><td><span style="color:#fed6af;">inductive </span><span style="color:#fffd87;">MyHeap </span><span style="color:#fed6af;">where
</span><tr><td>2</td><td><span>| nil : MyHeap
</span><tr><td>3</td><td><span>| node : MyHeap → Nat → MyHeap → MyHeap
</span><tr><td>4</td><td><span>
</span><tr><td>5</td><td><span style="color:#fed6af;">def </span><span style="color:#fffd87;">numElem </span><span>: MyHeap → Nat
</span><tr><td>6</td><td><span>| MyHeap.nil => </span><span style="font-weight:bold;color:#87d6d5;">0
</span><tr><td>7</td><td><span>| MyHeap.node p _x q => </span><span style="font-weight:bold;color:#87d6d5;">1</span><span> + numElem p + numElem q
</span><tr><td>8</td><td><span>
</span><tr><td>9</td><td><span style="color:#fed6af;">lemma </span><span style="color:#fffd87;">numElem_merge_branches_lt </span><span>(p q: MyHeap) (x: Nat): numElem (hmerge p q) < numElem (MyHeap.node p x q) := </span><span style="color:#fed6af;">by
</span><tr><td>10</td><td><span> rw [←merge_elems _ _];
</span><tr><td>11</td><td><span> </span><span style="color:#fed6af;">have</span><span> h': numElem (MyHeap.node p x q) = </span><span style="font-weight:bold;color:#87d6d5;">1</span><span> + numElem p + numElem q; rfl
</span><tr><td>12</td><td><span> rw [h']
</span><tr><td>13</td><td><span> linarith;
</span><tr><td>14</td><td><span>
</span><tr><td>15</td><td><span style="color:#fed6af;">def </span><span style="color:#fffd87;">ordered </span><span>: List Nat -> Bool
</span><tr><td>16</td><td><span>| [] => True
</span><tr><td>17</td><td><span>| [_x] => True
</span><tr><td>18</td><td><span>| x::y::xs => x <= y && ordered (y::xs)
</span><tr><td>19</td><td><span>
</span><tr><td>20</td><td><span style="color:#a0cfa1;">--</span><span style="color:#87ae86;"> hsort (heapsort) omitted for brevity
</span><tr><td>21</td><td><span>
</span><tr><td>22</td><td><span style="color:#fed6af;">theorem </span><span style="color:#fffd87;">prop_HSortSorts' </span><span>(xs: List Nat) : ordered (hsort xs) == True := </span><span style="color:#fed6af;">by
</span><tr><td>23</td><td><span>unfold hsort
</span><tr><td>24</td><td><span>unfold ordered
</span><tr><td>25</td><td><span>induction xs generalizing MyHeap </span><span style="color:#fed6af;">with
</span><tr><td>26</td><td><span>| nil => unfold toHeap; unfold toList; simp
</span><tr><td>27</td><td><span>| cons x xs' ih =>
</span><tr><td>28</td><td><span> unfold toHeap
</span><tr><td>29</td><td><span> unfold toList
</span><tr><td>30</td><td><span> unfold hmerge
</span><tr><td>31</td><td><span> </span><span style="color:#fed6af;">have</span><span> h := numElem_merge_branches_lt ... </span><span style="color:#a0cfa1;">--</span><span style="color:#87ae86;"> Placeholder for specific heaps if needed
</span><tr><td>32</td><td><span> </span><span style="color:#fed6af;">have</span><span> m := merge_elems ... </span><span style="color:#a0cfa1;">--</span><span style="color:#87ae86;"> Placeholder for specific heaps if needed
</span><tr><td>33</td><td><span> specialize ih (xs'.map (</span><span style="color:#fed6af;">fun</span><span> x => MyHeap.node MyHeap.nil x MyHeap.nil))
</span><tr><td>34</td><td><span> induction (toList (toHeap xs')) generalizing (toList (toHeap (x :: xs')))
</span><tr><td>35</td><td><span> | nil =>
</span><tr><td>36</td><td><span> apply ih
</span><tr><td>37</td><td><span> | cons y ys' ysih =>
</span><tr><td>38</td><td><span> unfold toList at ih
</span><tr><td>39</td><td><span> unfold toHeap at ih
</span><tr><td>40</td><td><span> unfold ordered at ih ⊢
</span><tr><td>41</td><td><span> unfold hsort at ih
</span><tr><td>42</td><td><span> </span><span style="color:#fed6af;">have</span><span> h := numElem_merge_branches_lt ... </span><span style="color:#a0cfa1;">--</span><span style="color:#87ae86;"> Placeholder for specific values if needed
</span><tr><td>43</td><td><span> </span><span style="color:#fed6af;">have</span><span> m := merge_elems ... </span><span style="color:#a0cfa1;">--</span><span style="color:#87ae86;"> Placeholder for specific values if needed
</span><tr><td>44</td><td><span> case h_1 =>
</span><tr><td>45</td><td><span> unfold ordered
</span><tr><td>46</td><td><span> simp [List.map]
</span><tr><td>47</td><td><span> simp [toList]
</span><tr><td>48</td><td><span> case h_2 =>
</span><tr><td>49</td><td><span> unfold toHeap
</span><tr><td>50</td><td><span> unfold ordered
</span><tr><td>51</td><td><span> simp [toList]
</span><tr><td>52</td><td><span> rfl
</span></tr></tbody></table></code></pre>
<p>On line 25, GPT-4 attempts to generalize <code>MyHeap</code>, a type of object defined earlier. Induction with generalization
is a common idea in many proofs of recursive programs, but the object generalized is always some object in the proof context, not a type. Generalizing a type of object is semantically
meaningless, and indeed the lean kernel throws its first error on this line.</p>
<p>On line 31, GPT-4 again demonstrates interesting but incorrect behavior. <code>numElem_merge_branches_lt</code> is a lemma stating that merging two heaps results in a heap with fewer elements
than a heap with a new value at the root and the two original heaps as children. GPT-4 invokes this lemma, but does not provide any arguments instead using <code>...</code> (not valid Lean syntax),
seemingly trying to tell the user “I don’t know what should go here, so you fill it in.” However, the <code>h</code> that GPT-4 names this invocation is not used in the proof. I interpret this as follows: GPT-4’s model of correct Lean proofs includes invoking lemmas defined in the context, but does not include the logic necessary to effectively use such lemmas.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Our baselines for miniCodeProps demonstrate that despite recent advances in LLM-powered mathematical theorem proving in ITPs, proving complex code properties remains difficult.
While future work on benchmarking this capability will likely expand outside of TIP to include properties of the wide range of non-inductive functions and data, miniCodeProps represents
a challenging first step. When models are capable of automatically producing proofs in the Sorting category, theorem proving technology will have taken a large step towards
the elusive goal of automatic generation of provably correct code. We hope the theorem proving community finds miniCodeProps useful for improving the capabilities of automated ITP systems.</p>
<h1 id="reference-list">Reference List</h1>
<ul>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2406.11915">miniCodeProps on Arxiv</a></li>
<li><a rel="noopener" target="_blank" href="https://huggingface.co/datasets/elohn/miniCodeProps">miniCodeProps benchmark</a></li>
<li><a rel="noopener" target="_blank" href="https://aws.amazon.com/what-is/large-language-model/">What is an LLM blog post</a></li>
<li><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">CNF explanation</a></li>
<li><a rel="noopener" target="_blank" href="https://cacm.acm.org/research/the-science-of-brute-force/">Brute Reasoning explanation</a></li>
<li><a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Emheule/publications/p01c15-prf.pdf">SAT solver unsatisfiability proofs</a></li>
<li><a rel="noopener" target="_blank" href="https://leodemoura.github.io/files/SMTProofs.pdf">SMT solver proofs</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2310.17807">Clover: Closed-Loop Verifiable Code Generation</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.08774">GPT-4 Technical Report</a></li>
<li><a rel="noopener" target="_blank" href="https://lean-lang.org/">Lean homepage</a></li>
<li><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Binary_tree">Binary Tree explanation</a></li>
<li><a rel="noopener" target="_blank" href="https://terrytao.wordpress.com/2023/12/05/a-slightly-longer-lean-4-proof-tour/">Terrence Tao formalizing parts of his work in Lean</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2312.14188">Enhancing Neural Theorem Proving through Data Augmentation and Dynamic Sampling Method</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2212.09420">Large Language Models Meet NL2Code: A Survey</a></li>
<li><a rel="noopener" target="_blank" href="https://paperswithcode.com/dataset/mbpp">MBPP Dataset</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2307.02503">Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review</a></li>
<li><a rel="noopener" target="_blank" href="https://agi-sphere.com/context-length/">Context Length blog post</a></li>
<li><a rel="noopener" target="_blank" href="https://www.ibm.com/topics/ai-hallucinations">LLM Hallucinations blog post</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.08896">SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2005.11401">Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</a></li>
<li><a rel="noopener" target="_blank" href="https://www.arxiv.org/abs/2408.08152">DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search</a></li>
<li><a rel="noopener" target="_blank" href="https://paperswithcode.com/dataset/minif2f">miniF2F dataset</a></li>
<li><a rel="noopener" target="_blank" href="https://tip-org.github.io/">Tons of Inductive Problems (TIP) dataset</a></li>
<li><a rel="noopener" target="_blank" href="https://github.com/leanprover-community/repl">Lean REPL</a></li>
<li><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence">Curry-Howard Correspondence</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2301.00234">a Survey on In-Context Learning</a></li>
<li><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.17651">Self-Refine: Iterative Refinement with Self-Feedback</a></li>
<li><a rel="noopener" target="_blank" href="https://github.com/wellecks/llmstep">LLMStep</a></li>
<li><a rel="noopener" target="_blank" href="https://huggingface.co/EleutherAI/pythia-2.8b">Pythia2.8b fine-tuning</a></li>
<li><a rel="noopener" target="_blank" href="https://huggingface.co/EleutherAI/llemma_7b">Llemma7b fine-tuning</a></li>
<li><a rel="noopener" target="_blank" href="https://huggingface.co/l3lab/ntp-mathlib-context-deepseek-coder-1.3b">ntp-context-1.3b fine-tuning</a></li>
<li><a rel="noopener" target="_blank" href="https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo">OpenAI documentation for GPT-4-turbo</a></li>
</ul>
Mariposa: the Butterfly Effect in SMT-based Program Verification2024-08-06T00:00:00+00:002024-08-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/mariposa/<p>Satisfiability Modulo Theories (SMT) solvers are powerful tools
that answer logical and mathematical questions.
As an example, let’s say I want to know whether there exists integers
\(a, b, c\) such that \(3a^{2} -2ab -b^2c = 7\).
To ask an SMT solver, I need to write an SMT query, which is in a <a rel="noopener" target="_blank" href="https://smtlib.cs.uiowa.edu/">standardized format</a> for expressing logical problems. In the SMT query below, the <code>declare-fun</code> command creates a variable (i.e., a
function with no argument), the <code>assert</code> command states the equation
as a constraint. More generally, an SMT query may contain
multiple assertions, and the <code>check-sat</code> command checks if
the query context, i.e., the <em>conjunction</em> of the
assertions, is satisfiable.</p>
<!-- A slight quirk is that the expressions are in prefix form,
where each operator comes before its operand(s). -->
<pre style="background-color:#393939;color:#dedede;"><code><span>(declare-fun a () Int)
</span><span>(declare-fun b () Int)
</span><span>(declare-fun c () Int)
</span><span>(assert
</span><span> (=
</span><span> (+ (* 3 a a) (* -2 a b) (* -1 b (* c b)))
</span><span> 7
</span><span> )
</span><span>)
</span><span>(check-sat)
</span></code></pre>
<p>The possible answers from the SMT solver can be “Yes”
(satisfiable), “No” (unsatisfiable) or “I don’t know”
(unknown). Suppose the solver responds with “Yes”
(satisfiable) in this case. This is good, because the
question is not so straightforward to me at least, and the
solver gives a definitive answer. What’s more, the solver
provides fairly high assurance about its responses, which
are justified by <strong>precise mathematical reasoning</strong>. For
this example, the solver can also provide a solution, <code>a = 1, b = 2, c = -2</code>, which serves as a checkable witness to
the “Yes” answer.</p>
<p>However, the solver is not perfect, because even a seemingly
benign change to a query can
trip up the SMT solver, causing it to give up. Suppose that
I slightly tweak the formula and ask again:</p>
<!-- $ \exists \, e, f, g \in Z \, | \,
3e^{2} -2ef -e^2g = 7 $
-->
<pre style="background-color:#393939;color:#dedede;"><code><span>(declare-fun e () Int)
</span><span>(declare-fun f () Int)
</span><span>(declare-fun g () Int)
</span><span>(assert
</span><span> (=
</span><span> (+ (* 3 e e) (* -2 e f) (* -1 f (* g f)))
</span><span> 7
</span><span> )
</span><span>)
</span><span>(check-sat)
</span></code></pre>
<p>This time, the following may happen: the solver gives up,
saying “I don’t know” to this new query. Understandably,
this may seem puzzling. As you might have noticed, the two
queries are essentially the same, just with different
variable names.
Is it even a legitimate move for it to give up? Why would the solver give different responses?</p>
<p>Before you get mad at the solver (this is a made-up example
BTW), let me explain why it can unexpectedly fail with a
seemingly innocuous query change. As mentioned earlier, the
SMT solver sticks to precise mathematical reasoning.
Therefore, if a best-effort try doesn’t work out, the solver
is allowed to give up, instead of giving bogus answers.
Moreover, the solver heuristics may not be robust against
superficial modifications to the input, leading to
confusing responses on similar inquiries.</p>
<!-- How hard? Well, some questions can
be NP-hard! In fact, the example above pertains to
[Diophantine
equations](https://en.wikipedia.org/wiki/Diophantine_equation),
which are undecidable in general. Therefore, no program can
correctly answer all such questions. The poor solver has to
resort to heuristics, which may not be robust against
superficial modifications to the input query. -->
<!-- ### Instability of SMT Solving -->
<p>What we have observed in this example is the phenomenon of
<strong>SMT instability</strong>, where trivial changes to the input
query may incur large performance variations (or even
different responses) from the solver. While there are many
applications of the SMT solver, in this blog post, I will focus
on instability in <strong>SMT-based program verification</strong>, where
we ask the solver to prove programs correct. More
concretely, instability manifests as a butterfly effect:
even tiny, superficial changes in the program may lead to
noticeable change in proof performance and even spurious
verification failures.</p>
<!-- spurious proof failures, where a previously proven program
may be (wrongfully) rejected after trivial changes to the
source code. -->
<h1 id="instability-in-smt-based-program-verification">Instability in SMT-based Program Verification</h1>
<p>Please allow me to briefly explain why program verification
is useful, how SMT solvers can help with verification, and
why instability comes up as a concern. If you are already
familiar with the background topic, please feel free to skip
this section. </p>
<p>As programmers, we often make informal claims about our
software. For example, I might say that a filesystem is
crash-safe or an encryption software is secure, etc. However, as
many of us can testify, these claims might be
unfounded or even straight-up wrong. Sometimes, the cost of software failure can be catastrophic (e.g., consider <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Ariane_flight_V88">spacecrafts</a> or <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Therac-25">medical devices</a>).
Fortunately, formal
verification offers a path to move beyond informal claims and avoid such
disasters.</p>
<p>Formal verification uses proofs to show that the code meets
its specification. In comparison to testing, formal
verification offers a higher level of assurance, since it
reasons about the program’s behavior for <em>all possible
inputs</em>, not just the ones in the test cases. In a
more-or-less <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Predicate_transformer_semantics">standard
algorithm</a>,
program properties can be encoded as logical statements,
often called the verification conditions (VCs). Essentially, the
task of formal verification is to prove that the VCs hold.</p>
<p>In SMT-based program verification, the solver takes as input
the VCs and search for proofs. As you might have gathered
from the previous example, the SMT solver can reason about
pretty complex logical statements. In this way, the solver
enables a high degree of automation, allowing the developer
to skip manual and tedious proof steps. This methodology has
thus made verification of complex software systems a
reality.</p>
<p>However, SMT-based automation also introduces the problem of
instability. Verified software, similar to regular software,
has an iterative development process. As the developers make
incremental changes to the code, corresponding queries also
change constantly. Even seemingly trivial changes, such as
renaming a variable would create a different query. As we
have discussed, the solver may not respond consistently to
these changes, leading to confusing verification results and
frustrated developers.</p>
<h1 id="detecting-instability-with-mariposa">Detecting Instability with Mariposa</h1>
<p>Now that we have a basic understanding of instability, let’s
try to quantify it more systematically. I will
introduce the methodology used in <a rel="noopener" target="_blank" href="https://github.com/secure-foundations/mariposa">Mariposa</a>, a tool that we
have built to measure and detect instability. In this blog post, I will
stick to the key intuitions and elide the details. For a
more thorough discussion, I encourage you to check out our
<a rel="noopener" target="_blank" href="https://www.andrew.cmu.edu/user/bparno/papers/mariposa.pdf">paper</a>. At a high level, given an original query \( q \) and
an SMT solver \( s \), Mariposa answers the question: </p>
<blockquote>
<p>Is the query-solver pair \((q, s)\) stable?</p>
</blockquote>
<p></p>
<p>Intuitively, instability means that \( s
\) experiences a mix of successes and failures when we apply seemingly irrelevant mutations to
\( q \). Mariposa detects instability by generating a set
of mutated queries and evaluating the performance of \( s
\) on each mutant. In this section, I will explain what
mutations are used, and how Mariposa decides the stability
status of the query-solver pair.</p>
<h2 id="what-mutations-to-use">What Mutations to Use?</h2>
<p>In Mariposa, a mutation method needs to preserve not only
the semantic meaning but also the syntactic structures of a
query. More precisely, the original query \( q \) and its
mutant \( q’ \) need to be both semantically equivalent
and syntactically isomorphic. </p>
<!-- it seems reasonable to expect similar performance from
the solver on both queries. -->
<ul>
<li>
<p><strong>Semantic Equivalence</strong>. \( q \) and \( q’ \) are semantically equivalent
when there is a bijection between the set of proofs for \( q \)
and those for \( q’ \) . In other words, a proof of \( q \) can be
transformed into a proof of \( q’ \) , and vice versa. </p>
</li>
<li>
<p><strong>Syntactic Isomorphism</strong>. \( q \) and \( q’ \) are
syntactically isomorphic if there exists a bijection between
the symbols (e.g., variables) and commands (e.g.,
<code>assert</code>). In other words, each symbol or command in \( q \) has
a counterpart in \( q’ \), and vice versa. </p>
</li>
</ul>
<p>For our concrete experiments, we are interested in mutations
that also correspond to common development practices.
Specifically, we consider the following three mutation
methods:</p>
<ul>
<li>
<p><strong>Assertion Shuffling</strong>. Reordering of source-level lemmas
or methods is a common practice when developing verified
software. Such reordering roughly corresponds to shuffling
the commands in the query. Since an SMT query is a
conjunction of assertions, the assertion order does not
impact query semantics. Further, shuffling the assertions
guarantees syntactic isomorphism.</p>
</li>
<li>
<p><strong>Symbol Renaming</strong>. It is common to rename source-level
methods, types, or variables, which roughly corresponds to
renaming the symbols in the SMT queries. As long as the
symbol names are used consistently, Renaming preserves
semantic equivalence and syntactic isomorphism. </p>
</li>
<li>
<p><strong>Randomness Reseeding</strong>. SMT solvers optionally take as
input a random seed, which is used in some of their
non-deterministic choices. Changing the seed has no effect
on the query’s semantics but is known to affect the solver’s
performance. While technically not a mutation, reseeding has
been used as a proxy for measuring instability, which is why
we have included it here.</p>
</li>
</ul>
<!-- Historically, some verification tools have
attempted to use reseeding to measure instability: Dafny and
F* have options to run the same query multiple times with
different random seeds and report the number of failures
encountered.
-->
<p>As an example, suppose we have a query \( q \) with \(
100 \) assertions. If we exhaustively apply shuffling to
\( q \), we obtain a set of mutated queries, with \(100!
\approx 9 \times 10^{157}\) permutations of \( q \). </p>
<h2 id="is-it-stable-or-not">Is it Stable or Not?</h2>
<p>Whether a query-solver pair \( (q, s) \) is stable or not
depends on how the mutants perform. A natural measure is the
<strong>Mutant Success Rate</strong>, i.e., the percentage of \( q\)’s
mutants that are verified by \( s \). Intuitively, the
success rate, denoted by \(r\), reflects performance
consistency. A low \(r\) indicates consistently poor
results; a high \(r\) indicates consistently good results;
and a moderate \(r\) indicates inconsistent results, i.e.,
instability.</p>
<p>Mariposa thus introduces four stability categories based on \(r\): <strong>unsolvable</strong>, <strong>unstable</strong>, <strong>stable</strong>, and <strong>inconclusive</strong>.<br />
The scheme includes two additional parameters:
\(r_{solvable}\) and \(r_{stable}\), corresponding to
the lower and upper range for unstable queries. In our
concrete experiments, we set \(r_{solvable} = 5\% \) and
\(r_{stable} = 95\%\).</p>
<img src="./mariposa_categories.png" alt="intuition of Mariposa Categories" style="width:80%">
<p>The <strong>inconclusive</strong> category is needed as a result of
statistical tests. Specifically, it is often infeasible to
enumerate all the mutants of a query and obtain the true
success rate. (Think about the \(100!\) mutants or more!)
Therefore, Mariposa uses random sampling to <strong>estimate</strong> the
success rate. When the estimated success rate based on the
sampled mutants is close to the boundaries, the statistical
test may not result in enough confidence to place \( (q, s)
\) in any of the previous three categories, yielding an
inconclusive result.</p>
<!-- , which correspond
respectively to the lower and upper bounds of the success
rate range for unstable queries. -->
<!-- The scheme
includes two additional parameters: \\(r_{solvable}\\) and
r stable , which correspond respectively to the lower and
upper bounds of the success rate range for unstable queries. -->
<!-- * **Unsolvable**: \\(r < r_{solvable} \\)
* **Unstable**: \\( r_{solvable} \leq r \leq r_{stable} \\)
* **Stable**: \\( r_{stable} < r \\)
* **Inconclusive**. This category is needed due to a
technicality that is less important for our discussion
here. In short, because Mariposa uses random sampling to
estimate the success rate, sometimes statistical tests do
not result in enough confidence to place \\( (q, s) \\) in
any of the previous three categories. -->
<h1 id="measuring-instability-in-the-wild">Measuring Instability in the Wild</h1>
<p>So far we have discussed Mariposa’s methodology to detect
and quantify instability. How much instability is there in
practice? Let me share some experimental results from
existing program verification projects.</p>
<h2 id="projects-and-queries">Projects and Queries</h2>
<p>The table below lists the projects we experimented.
Generally speaking: (1) These are all verified system
software such as storage systems, boot loaders, and
hypervisors. (2) They all involve non-trivial engineering
effort, creating a considerable number of SMT queries. (3)
They are all published at top venues, with source code and
verification artifacts available online. </p>
<table><thead><tr><th align="left">Project Name</th><th align="right">Source Line Count</th><th align="right">Query Count</th><th align="center">Artifact Solver</th></tr></thead><tbody>
<tr><td align="left">Komodo\(_D \) <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3132747.3132782">(SOSP’17)</a></td><td align="right">26K</td><td align="right">2,054</td><td align="center">Z3 4.5.0</td></tr>
<tr><td align="left">Komodo\(_S \) <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3341301.3359641">(SOSP’19)</a></td><td align="right">4K</td><td align="right">773</td><td align="center">Z3 4.2.2</td></tr>
<tr><td align="left">VeriBetrKV\(_D \) <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/osdi20/presentation/hance">(OSDI’20)</a></td><td align="right">44K</td><td align="right">5,325</td><td align="center">Z3 4.6.0</td></tr>
<tr><td align="left">VeriBetrKV\(_L \) <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3527313">(OOPSLA’22)</a></td><td align="right">49K</td><td align="right">5,600</td><td align="center">Z3 4.8.5</td></tr>
<tr><td align="left">Dice\(_F^⋆\) <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/usenixsecurity21/presentation/tao">(USENIX’21)</a></td><td align="right">25K</td><td align="right">1,536</td><td align="center">Z3 4.8.5</td></tr>
<tr><td align="left">vWasm\(_F \) <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/usenixsecurity22/presentation/bosamiya">(USENIX’22)</a></td><td align="right">15K</td><td align="right">1,755</td><td align="center">Z3 4.8.5</td></tr>
</tbody></table>
<!-- | Project Name | Source Line Count | Query Count |
|:------------- | -----------:| -----------:|
| Komodo\\(_D \\) | 26K | 2,054 |
| Komodo\\(_S \\) | 4K | 773 |
| VeriBetrKV\\(_D \\) | 44K | 5,325 |
| VeriBetrKV\\(_L \\) | 49K | 5,600 |
| Dice\\(_F^⋆\\) | 25K | 1,536 |
| vWasm\\(_F \\) | 15K | 1,755 | -->
<h2 id="how-much-instability">How Much Instability?</h2>
<p>For our experiments, we focus on the Z3 SMT solver, with
which the projects were developed. We are interested in both
the current and historical status of SMT stability.
Therefore, in addition to the latest Z3 solver (version
4.12.1, as of the work), we include seven legacy versions of
Z3, with the earliest released in
2015. In particular, for each project we include its
artifact solver, which is the version used in the project’s
artifact.</p>
<p>We run each project-solver pair through Mariposa. For each
original query in a project, Mariposa outputs a stability
category. Therefore, each project-solver pair is associated with
a breakdown of different stability categories, plotted as
stacked bars in In Figure 1. Focusing on Z3 4.12.1 (the right-most
group), the unstable proportion is the highest in
Komodo\(_D \) (\( 5.0\% \)), and \(2.6\%\) across all queries. This
might not seem like a significant number, but think of the
a regular software project’s continuous integration (CI) where \(\sim 2\%\) of the test cases may fail
randomly: it would be a nightmare! Nevertheless, developers
have to bear with such burden in SMT-based verification.</p>
<!-- In all project-solver pairs, the majority of queries are
stable. However, a non-trivial amount of instability
persists as well. -->
<img src="./stability_status.png" alt="historical stability status of Mariposa projects" style="width:100%">
<figcaption>Figure 1: Overall Stability Status. From bottom to top, each stacked bar shows the proportions of unsolvable (lightly shaded), unstable
(deeply shaded), and inconclusive (uncolored) queries. The remaining portion of the queries (stacking each bar to 100%), not shown, are
stable. The solver version used for each project’s artifact is marked with a star (⋆). </figcaption>
<br>
<p>Now that we know instability is not a rare occurrence, the
next question is: what gives? Well, first off, instability
is a property that is jointly determined by the solver and
the query. Therefore, the causes can roughly be categorized
as solver-related and query-related. Of course, I cannot
possibly be exhaustive here, so let me discuss the
significant ones we have found for each side. </p>
<!-- Before we delve into the details, here is a disclaimer: I
can only cover significant causes -->
<h1 id="debugging-the-solver">“Debugging” the Solver</h1>
<!-- As we zoom out to the full picture, the projects exhibit
different historical trends. The unstable proportion of
vWasm\\(_F\\) and Komodo\\(_S\\) remain consistently small
across the solver versions. On the other hand, some
projects seem "overfitted" to their artifact solver, in that
they become less stable with solver upgrades. Specifically,
Komodo\\(_D \\), VeriBetrKV\\(_D \\), and VeriBetrKV\\(_L
\\) show more instability in newer Z3 versions, with a
**noticeable gap** between Z3 4.8.5 and Z3 4.8.8. -->
<p>As you might have noticed already, in Figure 1, there is a
“gap” between Z3 4.8.5 and 4.8.8, where several projects
suffer from noticeably more instability in the newer
solver. In other words, certain queries used to be stable,
but somehow become unstable with the solver upgrade. Since
the query sets did not change, solver change is responsible for the regression. </p>
<p>We perform further experiments to narrow down the Z3 git
commits that may have been the problem. In the six
experiment projects, \(285\) queries are stable under Z3
4.8.5 but unstable under Z3 4.8.8. For each query in this
set, we run <a rel="noopener" target="_blank" href="https://www.git-scm.com/docs/git-bisect">git bisect</a> (which calls Mariposa) to find the
commit to blame, i.e., where the query first becomes
unstable.</p>
<p>There is a total of \(1,453\) commits between the two versions,
among which we identify two most impactful commits. Out of
the \(285\) regressed queries, \(115 (40\%)\) are blamed on commit <code>5177cc4</code>.
Another \(77 (27\%) \)of the queries are blamed on <code>1e770af</code>. The
remaining queries are dispersed across the other commits.</p>
<p>The two commits are small and localized: <code>5177cc4</code> has \( 2
\) changed files with \( 8 \) additions and \( 2 \)
deletions; <code>1e770af</code> has only \( 1 \) changed file with
\( 18 \) additions and \( 3 \) deletions. Both commits
are related to the order of disjunctions in a query’s
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">conjunctive normal form</a>.
<code>1e770af</code>, the earlier of the two, sorts the disjunctions,
while <code>5177cc4</code> adds a new term ordering, updating the
sorted disjunction order. Similar to
conjunction order, disjunction order does not affect the
semantics, but interacts with other solver heuristics. Therefore,
the change of disjunction order can be thought of as “internal mutations”
to the query, exposing more instability.</p>
<!-- The results suggest that the solver's internal heuristics
can have a significant impact on stability.
-->
<h1 id="debugging-the-query">“Debugging” the Query</h1>
<p>The discussion so far is condensed from our work on
Mariposa. However, we have yet to cover the query side of
the problem. To that end, let me share some results in our
follow-up work on query context. As it turns out, the
queries we have studied often contain a large amount of
irrelevant information (assertions). However, the solver may
not need all of the context to find a proof. In fact, the
presence of irrelevant information can cause “confusion” to
the solver, leading to instability. </p>
<!-- , which can be a major source of
instability -->
<h2 id="most-of-the-context-is-irrelevant">Most of the Context is Irrelevant</h2>
<p>Our experiments in this section analyze each query’s
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Unsatisfiable_core"><strong>core</strong></a>. Recall that an SMT query is a
conjunction of assertions. Upon verification success, the
solver can report a core, which is the subset
of the original assertions constructing the proof.
Therefore, this “slimmed-down” version of the query is an
oracle of <strong>relevant assertions</strong>, and what is excluded from
it can be considered irrelevant.</p>
<p>After acquiring a core, we compare its context to the
original query. Using the assertion count as a proxy for the
“size” of the context, we examine the <strong>Relevance Ratio</strong>: \(
\frac{\# \text{core\ assertions}}{\# \text{original\ assertions}} \times 100\% \). Since a core is a
subset of the original query, the lower this ratio is, the
less context remains, and the more irrelevant context the
original query has.</p>
<img src="./relevance_ratio.png" alt="small cache" style="width:50%">
<figcaption>Figure 2: Query Relevance Ratios. More to the left means more irrelevant contexts. Typically, the vast majority of an original query context is irrelevant. </figcaption>
<p>Figure 2 shows the CDFs of the relevance ratios for
different projects. For example, on the left side lies the
line for Dice\(_F^⋆\). The median relevance ratio is \(
0.06\% \), meaning that for a typical query in the
project, only \( 0.06\% \) of the context is relevant.
Please note that I have excluded Komodo\(_S\) from the
experiment, as its queries each contains only a single
assertion due to special encoding rules. Nevertheless, among
the remaining projects, typically \( 96.23\% – 99.94\%
\) of the context is irrelevant.</p>
<!-- In
vWasm\\(_F \\), the median is \\( 3.77 \\% \\), which is
almost an of order of magnitude higher than the other
projects. This can be attributed to the manual context
tuning vWasm\\(_F \\) developers, who explicitly documented
the tedious effort. -->
<h2 id="irrelevant-context-harms-stability">Irrelevant Context Harms Stability</h2>
<p>Considering the significant amount of irrelevant context, we
further analyze how that impacts stability, by comparing the
original queries and their core counterparts. Given an
original query \(q\) and its core \(q_c\), we introduce
the following metrics among the possible stability status
transitions. </p>
<ul>
<li><strong>Core Preservation</strong>: given that \( q \) is stable, the probability that \( q_c \) remains stable.</li>
<li><strong>Core Mitigation</strong>: given that \( q \) is unstable, the
probability that \( q_c \) becomes stable.</li>
</ul>
<p>We use the Mariposa tool with Z3 version 4.12.5 in this
experiment. Below we have listed the number of original
queries and the preservation and mitigation scores. As an
example, in the original Komodo\(_D\) queries, \(1,914\)
are stable and \(93\) are unstable. In its core
counterpart, \(99.4\%\) of the stable queries remain
stable, while \(90.3\%\) of the unstable ones become
stable. The vWasm\(_F\) project is the only case where the
core has no mitigation effect. However, its original
unstable query count is very low to begin with. </p>
<!-- As we noted previously,
vWasm\\(_F\\) also starts off with more relevant context
originally. Therefore, the stability of vWasm\\(_F\\) can be
explained by the manual tuning done by the developers. -->
<table><thead><tr><th align="left">Project Name</th><th align="right">Stable Count</th><th align="right">Core Preservation Count</th><th align="right">Unstable Count</th><th align="right">Core Mitigation Count</th></tr></thead><tbody>
<tr><td align="left">Komodo\(_D \)</td><td align="right">1,914</td><td align="right">1,902 (99.4%)</td><td align="right">93</td><td align="right">84 (90.3%)</td></tr>
<tr><td align="left">VeriBetrKV\(_D \)</td><td align="right">4,983</td><td align="right">4,958 (99.5%)</td><td align="right">172</td><td align="right">111 (64.5%)</td></tr>
<tr><td align="left">VeriBetrKV\(_L \)</td><td align="right">4,999</td><td align="right">4,979 (99.6%)</td><td align="right">256</td><td align="right">214 (83.6%)</td></tr>
<tr><td align="left">Dice\(_F^⋆\)</td><td align="right">1,483</td><td align="right">1,477 (99.6%)</td><td align="right">20</td><td align="right">18 (90.0%)</td></tr>
<tr><td align="left">vWasm\(_F \)</td><td align="right">1,731</td><td align="right">1,726 (99.7%)</td><td align="right">4</td><td align="right">0 (0.0%)</td></tr>
<tr><td align="left">Overall</td><td align="right">15,110</td><td align="right">15,042 (99.5%)</td><td align="right">545</td><td align="right">427 (78.3%)</td></tr>
</tbody></table>
<p>Generally, the core is highly likely to preserve what is
stable. Moreover, across all projects, \(78.3\%\) of the
unstable instances can be mitigated by using the core. In
other words, irrelevant context can be thought of as <strong>a
major factor to instability</strong> on the query side! While this
is far from an end-to-end solution, the result suggests a
promising direction to mitigate instability by pruning
irrelevant assertions, which we are exploring in our ongoing
work. </p>
<h1 id="takeaways">Takeaways</h1>
<p>I will conclude with some TLDRs in case someone has skipped ahead
or wants a quick recap.</p>
<ul>
<li>The SMT solvers is immensely useful for program verification,
but it introduces the problem of instability, where
trivial changes to the input query may incur spurious
verification failures.</li>
<li>Mariposa is a tool (and a methodology) to detect and
quantify instability.</li>
<li>Instability is a real concern in existing SMT-based
program verification projects. \(2.6\%\) of the queries
in our study are unstable using Z3 4.12.1.</li>
<li>Tiny changes in the solver heuristics can cause noticeable
regression in stability. Just \( 2 \) tiny commits to Z3
are responsible for \( 67.3\% \) of the regression in
our study.</li>
<li>Irrelevant context in the queries is a major source of
instability. Typically, \( 96.23\% – 99.94\% \) of the
context is irrelevant, while being responsible for \(
78.3\%\) of the instability observed.</li>
</ul>
<p>Last but not least, I would like to reiterate that
instability is a joint problem of the solver heuristic and
the query property, which will take a joint effort to
address. I hope that the work we have done can help to
improve the stability of SMT-based verification in
future research and practice.</p>
Oblivious Maps for Trusted Execution Environments2024-08-05T00:00:00+00:002024-08-05T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/oblivious-maps/<p>\[
\gdef\lf{\left \lfloor}
\gdef\rf{\right \rfloor}
\]</p>
<p>Imagine using a popular messaging app that includes a contact discovery feature to find which of your phone contacts are already using the service and to get information on how to communicate with them. While convenient, this process raises significant privacy concerns: how can you discover mutual contacts without revealing your entire contact list to the messaging app's server?</p>
<p>In a standard implementation, the app might upload your entire contact list to the server to perform the matching, potentially exposing sensitive information to unauthorized access. To address this issue, we need a solution that allows for secure contact discovery without compromising user privacy.</p>
<p>One approach is to leverage Trusted Execution Environments (TEEs), like Intel SGX, to perform these operations securely on the server. TEEs create isolated environments where code and data can be processed without being accessible to the rest of the system. This means that even if the server's operating system is compromised, the information inside the TEE remains protected.</p>
<p>By implementing an oblivious map inside a TEE, we can ensure that neither the app's server nor potential attackers learn anything about your contact list or which queries you performed. Being oblivious means that no information is revealed from the CPU's memory access patterns, making it an ideal solution for privacy-preserving applications.</p>
<p>This blog post explores our research on ENIGMAP <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[6]</a>, an efficient external-memory oblivious map designed for secure enclaves, offering significant performance improvements over previous work. ENIGMAP enables privacy-preserving contact discovery and other applications by protecting sensitive data and queries from unauthorized access even from the operating system of the machine where ENIGMAP is running.</p>
<h1 id="background">Background</h1>
<p>Before we can dive into the details of ENIGMAP, we first need to understand a few basic concepts, including sorted maps, background on TEEs, external-memory and oblivious algorithms.</p>
<h2 id="sorted-map">Sorted Map</h2>
<p>In ENIGMAP, our goal is to implement an oblivious sorted map. A sorted map of size \(N\) is a data structure that can store up to \(N\) key-value pairs and efficiently supports the following operations:</p>
<ul>
<li><strong>Get(key) -> value:</strong> Returns the value associated with the key, or None if the key was not set before.</li>
<li><strong>Set(key, value):</strong> Sets or updates the value of the key.</li>
<li><strong>Delete(key):</strong> Removes the key from the map.</li>
<li><strong>RangeQuery(keyMin, keyMax) -> [(key, value)]:</strong> Returns all the key-value pairs in the specified range.</li>
</ul>
<p>A search tree, such as a B+ tree or an AVL tree, is typically used to implement a sorted map. Following previous work <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[5]</a>, we chose to use an AVL tree. <sup class="footnote-reference"><a href="#whynotbplusorhashmap">1</a></sup></p>
<h3 id="avl-tree">AVL tree</h3>
<p>An AVL tree of size \(N\) is a binary search tree with at most \(N\) nodes, and the following properties:</p>
<ul>
<li>
<p><strong>binary tree</strong> - a tree where each node has a key, a value, and at most 2 children.</p>
</li>
<li>
<p><strong>search tree</strong> - the key of every node is larger than the key of every node on its left subtree and smaller than the key of every node on its right subtree.</p>
</li>
<li>
<p><strong>AVL invariant</strong> - the height of the two child subtrees of any node differs by at most one.</p>
</li>
</ul>
<p>The maximum height of an AVL tree of size \(N\) is \( 1.44 \log_2 N\), and <code>Search(key)</code>, <code>Insert(key,value)</code> and <code>Delete(key)</code> operations – with their standard semantic meaning – can be implemented <sup class="footnote-reference"><a href="#gowiki">2</a></sup> to only access \( O(\log N) \) nodes by doing a binary search for <code>key</code> on the tree. <sup class="footnote-reference"><a href="#boundedheightimportant">3</a></sup></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/avl.png" alt="An example avl tree" />
<strong>Figure 2</strong> - <em>An example AVL tree. Each node is represented only by its key. To search for the key 26, we would touch the nodes on the path from the root: 42, 20, 27 and 26.</em> </p>
<!-- TODO: Add avl tree picture here -->
<p>To implement a map using an AVL tree, <code>Get</code>, <code>Set</code> and <code>Delete</code> get translated to the equivalent operations on a binary search tree (<code>Search</code>, <code>Insert</code> and <code>Delete</code>), while <code>RangeQuery</code> can be implemented by Searching for <code>KeyMin</code> and iterating over the next values on the map until we hit <code>KeyMax</code>.</p>
<p>Ok. So we know AVL trees can be used to implement a map efficiently. Great! But, how can we hide the content of our queries from an attacker who runs the machine where the map is now? </p>
<!-- Isn\'t this impossible to do efficiently? Doesn't [PIR](@/2024/piano-private-information-retrieval.md) imply this problem requires either large communication or large client storage? -->
<p>Well, this is where Trusted Execution Environments (TEEs) come into play. Rather than trusting standard cryptographic assumptions like Computational Diffie-Hellman or the existence of one-way functions, we instead trust… Intel. </p>
<!--
While AVL trees provide an efficient and well-structured way to manage key-value pairs, implementing them in a secure and privacy-preserving manner requires additional considerations, especially in the context of limited secure memory provided by Trusted Execution Environments (TEEs).
-->
<h2 id="tees-and-external-memory">TEEs and External Memory</h2>
<p><code>Trusted Execution Environments</code> (TEEs), like Intel SGX, provide isolated execution environments for sensitive computations. They ensure that data and code running inside of a secure memory region called an <code>enclave</code> are protected from external tampering and observation, even if the operating system is compromised. However, TEEs come with limited secure memory, which poses a challenge for applications to handle large datasets securely.</p>
<blockquote>
<p>The <strong>“Enclave Assumption”</strong> – code inside an enclave runs under the crash-fault model (it either runs the correct computation or crashes, without unexpected behaviors); all of its memory contents are encrypted and cannot be accessed by any other applications running on the same machine; and the speed of code execution is similar to if it was not inside of an enclave.</p>
</blockquote>
<p></p>
<p>To manage datasets that don't fit in the TEE, data must frequently be swapped between the secure enclave memory - called the Enclave Page Cache (EPC) - and insecure <strong>external memory</strong> (RAM or disk). This swapping process, known as page swapping, can significantly impact performance due to the overhead of moving data in and out of the enclave, context switching, and the need to encrypt and decrypt data during these transfers. In fact, we manually measured the cost of <strong>external memory</strong> accesses (Figure 2) - in SGXv2 copying a page from <strong>external memory</strong> to the EPC is about 47x-80x slower than copying the same amount of data inside of the EPC. </p>
<blockquote>
<p>Optimizing the number of <strong>external memory</strong> page swaps is crucial for enhancing the performance of applications running in TEEs.</p>
</blockquote>
<p></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/intrinsics.png" alt="A graph showing the time of several operations in SGXv2 relative to moving a page in enclave memory. Moving swapping to unprotected RAM is about 47x slower, while swapping to disk is 80x slower." />
<strong>Figure 2</strong> - <em>Cost of PageSwap operation of 4KB relative to a MOV of 4KB inside of enclave protected memory. A page swap is about 47
times more expensive than moving 4KB in memory within the enclave.
The costs are color-coded which shows the cost breakdown, blue is the cost of EWB/OCall.</em> <strong>EWB</strong> <em>- enclave write back (the mechanism used by the operating system to swap enclave pages),</em> <strong>OCall</strong> - <em>using SGX</em>'s <em>OCall mechanism so the enclave application manually swaps pages</em>.</p>
<p></p>
<p>Additionally, all the accesses to the external memory can be seen by the operating system and thus by an attacker running the server. Therefore, these accesses should not reveal any information about the client's queries. This is where oblivious algorithms come into play.</p>
<h2 id="oblivious-algorithms">Oblivious Algorithms</h2>
<blockquote>
<p>An <strong>oblivious algorithm</strong> is an algorithm that doesn't leak any information about its inputs to an attacker that has access to a trace of the algorithm's execution.</p>
</blockquote>
<p></p>
<p>In the context of TEEs there are 3 traces to consider:</p>
<ol>
<li>
<p>The addresses of external memory accesses - when we need to access disk, the operating system can always see which disk pages we accessed without any physical attacks <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[3]</a>.</p>
</li>
<li>
<p>The addresses of every RAM access inside of the TEE protected memory - in SGX, it is still the operating system that manages memory pages; therefore the operating system can know at the page level which addresses were accessed <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[3,4]</a>.</p>
</li>
<li>
<p>The instruction trace - the list of executed instructions, since the way CPU fetches instructions is by reading RAM addresses <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[4]</a>.</p>
</li>
</ol>
<p>Algorithms that are oblivious with respect to only 1) are typically called <code>weakly oblivious</code> or <code>external memory oblivious</code>; while algorithms that are oblivious with respect to 1), 2) and 3) are called <code>strongly oblivious</code> or simply <code>oblivious algorithms</code>. </p>
<p>In this blogpost we will focus on <code>strongly oblivious algorithms</code>. In this notion:</p>
<ol>
<li>
<p>All the traces above are public. Only the traces of the CPU registers and CPU caches are private.</p>
</li>
<li>
<p>The limited enclave-protected memory is encrypted and accessible to the enclave, even though its memory access trace is public<sup class="footnote-reference"><a href="#pagelevel">4</a></sup>.</p>
</li>
<li>
<p>The external memory is public, and therefore the enclave needs to encrypt data before moving it there.</p>
</li>
</ol>
<h2 id="oblivious-algorithms-in-practice">Oblivious Algorithms in practice</h2>
<p>So, how do oblivious algorithms look like in practice? Lets consider the following function:</p>
<!-- The concept of oblivious algorithms is also tied to the concept of constant-time algorithms, as the attacker can also learn information based on the number of memory accesses executed. -->
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">const int</span><span> PASSWORD_SIZE;
</span><span style="color:#fffb9d;">char</span><span> CORRECT[PASSWORD_SIZE];
</span><span style="color:#fffb9d;">bool </span><span style="color:#fffd87;">check_password_nonoblivious</span><span>(</span><span style="color:#fffb9d;">char </span><span>input[PASSWORD_SIZE]) {
</span><span> </span><span style="color:#fed6af;">for </span><span>(</span><span style="color:#fffb9d;">int</span><span> i</span><span style="color:#ececec;">=</span><span style="font-weight:bold;color:#87d6d5;">0</span><span>; i</span><span style="color:#ececec;"><</span><span>PASSWORD_SIZE; i</span><span style="color:#ececec;">++</span><span>) {
</span><span> </span><span style="color:#fed6af;">if </span><span>(CORRECT[i] </span><span style="color:#ececec;">!=</span><span> input[i]) </span><span style="color:#fed6af;">return </span><span style="font-weight:bold;color:#d6d6ae;">false</span><span>;
</span><span> }
</span><span> </span><span style="color:#fed6af;">return </span><span style="font-weight:bold;color:#d6d6ae;">true</span><span>;
</span><span>}
</span></code></pre>
<p><strong>Listing 1</strong> - <em>A non-oblivious version of the check_password function - based on the number of instructions executed an attacker can learn the size of the common prefix between CORRECT and input</em></p>
<p>In <em>Listing 1</em>, the attacker can infer how many initial characters of the input are correct based on the number of memory accesses that <code>check_password</code> used - therefore it is not an oblivious algorithm. To make it oblivious we can make the number of memory accesses independent of the input:</p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">bool </span><span style="color:#fffd87;">check_password_oblivious</span><span>(</span><span style="color:#fffb9d;">char </span><span>input[PASSWORD_SIZE]) {
</span><span> </span><span style="color:#fffb9d;">bool</span><span> ret </span><span style="color:#ececec;">= </span><span style="font-weight:bold;color:#d6d6ae;">true</span><span>;
</span><span> </span><span style="color:#fed6af;">for </span><span>(</span><span style="color:#fffb9d;">int</span><span> i</span><span style="color:#ececec;">=</span><span style="font-weight:bold;color:#87d6d5;">0</span><span>; i</span><span style="color:#ececec;"><</span><span>PASSWORD_SIZE; i</span><span style="color:#ececec;">++</span><span>) {
</span><span> </span><span style="color:#fffb9d;">bool</span><span> condition </span><span style="color:#ececec;">=</span><span> CORRECT[i] </span><span style="color:#ececec;">!=</span><span> input[i];
</span><span> ret </span><span style="color:#ececec;">=</span><span> ret </span><span style="color:#ececec;">*</span><span> condition; </span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> if (!condition) ret = false;
</span><span> }
</span><span> </span><span style="color:#fed6af;">return</span><span> ret;
</span><span>}
</span></code></pre>
<p><strong>Listing 2</strong> - <em>An oblivious version of the check_password function - the memory access trace is now constant.<sup class="footnote-reference"><a href="#andcppshortcircuits">5</a></sup></em></p>
<p>If you are familiar with <em>constant-time cryptography</em> you probably noticed that this oblivious algorithm is in fact also a <a rel="noopener" target="_blank" href="https://www.bearssl.org/constanttime.html">constant-time algorithm</a>. These two notions are closely related - if an attacker knows the number of memory addresses accessed, then the attacker has the information needed for timing attacks. </p>
<p>However, compared to constant-time algorithms - where we only need to care about the computation time being constant - the oblivious notion is stronger - we also need to make sure every single address we access does not leak any information. Even accessing a single array index can leak information about the data being processed:</p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">int </span><span style="color:#fffd87;">access_array_nonoblivious</span><span>(</span><span style="color:#fffb9d;">int </span><span>Array[MAX_SIZE], </span><span style="color:#fffb9d;">int </span><span>i) {
</span><span> </span><span style="color:#fed6af;">return</span><span> Array[i];
</span><span>}
</span></code></pre>
<p><strong>Listing 3</strong> - <em>A non-oblivious array access</em></p>
<p>When we call <code>access_array_nonoblivious</code>, we will access the memory address <code>(A+i)</code>. If an the attacker can see every address we use, then the attacker can learn if two calls to this function have the same arguments. To protect against this, we could again rely on a constant time algorithm - by scanning the entire array every time we need to access a single index,making use of the Conditional Mov x86 instruction - <code>CMOV(cond, target, value)</code>. </p>
<blockquote>
<p>The <code>CMOV(cond, target, value)</code> instruction assigns <code>value</code> to <code>target</code> if <code>condition</code> is true, but always fetches <code>target</code> and <code>value</code> from memory, resulting in a constant memory trace. </p>
</blockquote>
<p></p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">int </span><span style="color:#fffd87;">access_array_linear_scan</span><span>(</span><span style="color:#fffb9d;">int </span><span>Array[MAX_SIZE], </span><span style="color:#fffb9d;">int </span><span>index) {
</span><span> </span><span style="color:#fffb9d;">int</span><span> ret </span><span style="color:#ececec;">= </span><span style="font-weight:bold;color:#87d6d5;">0</span><span>;
</span><span> </span><span style="color:#fed6af;">for </span><span>(</span><span style="color:#fffb9d;">int</span><span> j</span><span style="color:#ececec;">=</span><span style="font-weight:bold;color:#87d6d5;">0</span><span>; j</span><span style="color:#ececec;"><</span><span>MAX_SIZE; j</span><span style="color:#ececec;">++</span><span>) {
</span><span> CMOV(j</span><span style="color:#ececec;">==</span><span>i, ret, Array[i]);
</span><span> }
</span><span> </span><span style="color:#fed6af;">return</span><span> Array[i];
</span><span>}
</span></code></pre>
<p><strong>Listing 4</strong> - <em>An oblivious array access via linear scan - this takes time<sup class="footnote-reference"><a href="#timemeaning">6</a></sup> \( O( \) MAX_SIZE \() \)</em></p>
<p>While using a linear scan for each array access ensures obliviousness, it is highly inefficient - we need to transverse the whole array every time we want to access a single element. However, this kind of linear scan solution is used in practice, for instance, <a rel="noopener" target="_blank" href="https://signal.org/">Signal</a> used it along with other techniques to offer <a rel="noopener" target="_blank" href="https://signal.org/blog/private-contact-discovery/">Private Contact Discovery</a>.</p>
<p>To address this inefficiency, more sophisticated methods to hide which index of an array is accessed have been developed that no longer have a constant memory access trace, but instead a random-looking one, like PathORAM.</p>
<h2 id="pathoram">PathORAM</h2>
<p>Path Oblivious RAM (PathORAM) <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[1]</a> is an efficient protocol designed to hide the access patterns to an array of size \(N\). The key insight for PathORAM is that rather than doing a linear scan to create a constant trace, we keep dynamically moving data in unprotected memory, so that the memory access trace is indistinguishable from that of accessing random positions in the array - so nothing can be inferred about which indexes we are accessing from the memory access trace. </p>
<h3 id="interface-for-pathoram">Interface for PathORAM</h3>
<p>PathORAM provides a straightforward interface:</p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">void </span><span style="color:#fffd87;">Access</span><span>(</span><span style="color:#fffb9d;">bool </span><span>readOrWrite, </span><span style="color:#fffb9d;">int </span><span>addr, </span><span style="color:#fffb9d;">int</span><span style="color:#ececec;">& </span><span>pos, </span><span style="color:#fffb9d;">int</span><span style="color:#ececec;">& </span><span>data)
</span></code></pre>
<blockquote>
<p>Performs a read or write operation (<code>readOrWrite</code>) on the specified address <code>addr</code> reading or writing to <code>data</code>. The position <code>pos</code> has information about where the specified address is stored and is updated for <code>addr</code> after each call to access. It is up to the callee to keep track of <code>pos</code> for each address<sup class="footnote-reference"><a href="#readpathoram">7</a></sup>.</p>
</blockquote>
<p></p>
<p>In PathORAM each address is assigned a random position in {0..N} that identifies where that address is stored in public memory (we will see how soon). This position is leaked after each access call; therefore after each access call, a new random position is generated for that address. (We will explain how to keep track of all the positions for a binary search tree in the next section - <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#keeptrackofpositions">Oblivious Data Structures</a> - for now, just assume there is a way to keep track of the position for each address)</p>
<h3 id="how-pathoram-works">How PathORAM works</h3>
<p>In PathORAM, each array index is stored as a block: </p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">template </span><span><</span><span style="color:#fffb9d;">typename</span><span> T>
</span><span style="color:#fffb9d;">struct </span><span>Block{
</span><span> </span><span style="color:#fffb9d;">int</span><span> address;
</span><span> </span><span style="color:#fffb9d;">int</span><span> position; </span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> only used if the block is in the stash
</span><span> T data;
</span><span>}
</span></code></pre>
<p>PathORAM has two major structures:</p>
<ol>
<li>
<p><strong>Stash</strong> - where we keep recently accessed blocks. The stash has a constant maximum size and is accessed using linear scans.</p>
</li>
<li>
<p><strong>ORAM Tree</strong> (Figure 3) - an almost<sup class="footnote-reference"><a href="#acbst">8</a></sup> complete binary tree with \(N\) leaves where each node is called a Bucket and can have up to <code>Z</code> (typically <code>Z=4</code>) blocks of data. Whenever we access this tree, we will leak which nodes are being accessed in the trace.</p>
</li>
</ol>
<p>The <code>position</code> mentioned in the PathORAM interface identifies a unique path from the root to a leaf in this tree. If an address has a given position, then its block can be stored in any bucket on the path corresponding to that position.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/orambasic.png" alt="A visualization of ORAM" />
<strong>Figure 3</strong> - <em>A visualization of PathORAM tree with <code>Z=4</code> where path 2 - the path with all the buckets that could contain position 2 is highlighted.</em></p>
<p>When we construct the ORAM, every address is assigned a random <code>position</code>, and is placed in some bucket in the corresponding path<sup class="footnote-reference"><a href="#readpathoram">7</a></sup>. When the access operation is called, the PathORAM algorithm does the following steps:</p>
<ol>
<li>
<p><strong>Read Path</strong> - Move, from the ORAM tree to the stash, all the blocks on the path identified by the original position of the address.</p>
</li>
<li>
<p><strong>Generate new position</strong> - Generate a new random position for the address we just accessed.</p>
</li>
<li>
<p><strong>Stash Operation</strong> - Do a linear scan over the stash to do the read/write operation on the address we want, and update its position.</p>
</li>
<li>
<p><strong>Stash Eviction</strong> - Pick a random path, and try to obliviously move blocks from the stash to this path - without revealing how many blocks were moved and the locations they were moved to. After this operation, the remaining number of blocks in the stash is below a small constant with very high probability <sup class="footnote-reference"><a href="#infoconstant">9</a></sup>.</p>
</li>
</ol>
<p>Provided we can keep track of the positions in some way (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#keeptrackofpositions">section Oblivious Data Structures</a>), we can do the access operation in an ORAM of size \(N\) in \( O(\log N \log\log^2 N) \) (non-private) memory accesses. <sup class="footnote-reference"><a href="#readpathoram">7</a></sup> </p>
<p>Let's see now how can we keep track of the node positions for a binary search tree.</p>
<h2 id="keeptrackofpositions">Oblivious Data Structures</h2>
<p>In order to keep track of the positions of all the nodes in a binary search tree, we can use an insight from the Oblivious Data Structures (ODS) paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[2]</a>:</p>
<blockquote>
<p><strong>We only need to store the position of the root node</strong> - since every operation in a binary search tree always accesses the nodes starting from the root of the binary search tree, we can store in each node the position of its two children directly (See Figure 4).</p>
</blockquote>
<p></p>
<p>This works, since when we want to access the children of a node, we can generate ahead of time the new random position for its children and store it in the parent before actually accessing the children. </p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/avloram.png" alt="AVL Tree and how it is mapped to ORAM" />
<strong>Figure 4</strong> - <em>A visualization of the logical AVL tree and how it is mapped to the PathORAM tree. We only need to keep track that node 42 is in position 2. Its children (node 20 and 73) have their position information (2 and 7 respectively) stored inside node 42.</em></p>
<p>This insight from ODS was previously implemented by P. Mishra et al. in Oblix <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[5]</a>, where an oblivious AVL tree was used to develop an oblivious map. <strong>In ENIGMAP, we build on the insight from ODS to develop an oblivious AVL tree, with practical optimizations related to TEEs.</strong></p>
<h1 id="enigmap">ENIGMAP</h1>
<p>Our main contributions with ENIGMAP are:</p>
<ol>
<li>
<p>Identifying external memory accesses as an important cost of oblivious algorithms in TEEs (Figure 2).</p>
</li>
<li>
<p>An asymptotically and concretely faster <em>strongly oblivious</em> sorted map both in number of instructions executed as well as external memory accesses (section <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#mainqueryoptimizations">Main Query Optimizations</a>).</p>
</li>
<li>
<p>A faster initialization algorithm for oblivious maps, making it practical for large database sizes (section <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#fastinitializationalgorithm">Fast Initialization Algorithm</a>).</p>
</li>
</ol>
<p>So, let's take a look at a few of the optimizations done in ENIGMAP!</p>
<h2 id="mainqueryoptimizations">Main Query Optimizations</h2>
<h3 id="optimization-1-locality-friendly-layout-for-the-oram-tree">Optimization 1 - Locality Friendly Layout for the ORAM tree</h3>
<p>To improve the locality of data accesses, ENIGMAP leverages concepts from cache-oblivious algorithms and van Emde Boas (vEB) layout <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/:cite">[8]</a>. By organizing the ORAM tree in a cache-efficient manner, we reduce the number of page swaps needed to access a path, significantly improving access times<sup class="footnote-reference"><a href="#btwnotonlyexternalmemory">10</a></sup>.</p>
<p>When we want to access an AVL tree node, we need to call <code>Access</code> on the ORAM to get that AVL tree node (recall from Figure 4 that each AVL node is stored in the ORAM), and therefore we will have to read all the Buckets in the path where that AVL tree node is stored. In the external-memory these buckets will be stored in pages, which are read atomically, but typically can include \(B\) buckets. </p>
<p>If we were to store the buckets in heap layout - this is level by level left to right - we would have to read \( log{N} \) pages, since apart from the first few levels, all the buckets would end up in different pages (see Heap Layout in Figure 5).</p>
<blockquote>
<p>Instead, ENIGMAP uses a locality friendly layout<sup class="footnote-reference"><a href="#noteembdaboas">11</a></sup> - we find out the size of the largest ORAM subtree that fits in a page (its height is \( \lf log_2{B} \rf \) ) and store subtrees of that size together in the disk page (see Our Layout in Figure 5) . This optimization allows to only have to read \( log{\frac{N}{B}} \) pages per ORAM access.</p>
</blockquote>
<p></p>
<!-- TODO: replace our layout with ENIGMAP layout and switch sides -->
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/treepacking.png" alt="alt text" />
<strong>Figure 5</strong> - <em>Comparison of Heap Layout with the Layout used in ENIGMAP considering B=3. Each triangle/rectangle represents a disk page. Red - pages read while accessing the same certain path. To read a given path ENIGMAP reads an optimal number of pages (3 in the example), while the Heap layout will read one page per bucket (5 in the example).</em></p>
<h3 id="optimization-2-ensuring-integrity-and-freshness-with-low-cost">Optimization 2 - Ensuring integrity and freshness with low cost</h3>
<p>Another key optimization that comes from the locality friendly layout is that we can ensure integrity and freshness of data in external memory with almost no extra cost. Since we always access ORAM pages in a path from the root and each friendly-layout subtree corresponds to a disk page, we can build a Merkle tree of the friendly-layout subtrees. Each subtree is encrypted with AES-GCM, and stores the nonce of its children subtrees encryptions. The main application only needs to keep the nonce of the root subtree to ensure integrity and freshness.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/friendlylayout_integrity.png" alt="alt text" />
<strong>Figure 6</strong> - <em>Achieving integrity protection efficiently in external memory on a tree can be done with a Merkle-tree. The fact we have subtrees packed together allows us to have good nonce-size to data-size ratio.</em></p>
<h3 id="optimization-3-multi-level-caching-batching">Optimization 3 - Multi-level caching + batching</h3>
<p>ENIGMAP employs a multi-level caching scheme to optimize data access:</p>
<ul>
<li>
<p><strong>Page-Level Cache</strong>: Outside the enclave, this cache reduces the frequency of page swaps from disk.<sup class="footnote-reference"><a href="#sgxv2tdxbad">12</a></sup></p>
</li>
<li>
<p><strong>Bucket-Level Cache</strong>: Inside the enclave, it caches frequently accessed data to minimize external-memory calls, specifically we cache the top levels of the ORAM tree to always be inside the enclave.</p>
</li>
<li>
<p><strong>AVL-Tree node Cache</strong>: This cache is specifically designed to optimize searches within the AVL tree. It is implemented by temporarily marking AVL nodes as sticky - these nodes should stay in the stash during eviction, until we mark them as non-sticky. If a node is sticky, we can just access it directly via a linear scan from the stash, without paying the ORAM overhead. We use this optimization in two ways:</p>
<ul>
<li><strong>AVL tree top caching</strong> - the first few AVL levels during search can just be accessed via linear scan.</li>
<li><strong>AVL batched operations</strong> - when we do an AVL tree insertion, we need to do two passes over the same AVL path. In the first pass, we mark all the nodes we will have to access as sticky, so that on the second pass we can access them faster via a linear scan of the stash.</li>
</ul>
</li>
</ul>
<p>We encourage you to read about about further optimizations and how each of these optimizations impacts performance in our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[6]</a>.</p>
<!--
### Optimization 4 - Optimized Insertion
> RFC: this needs a lot of details on AVL tree insertion and isn't important for queries, so I wrote it this way. should I keep it or remove it?
When we do an AVL tree insertion, the tree can potentially lose the AVL Property (TODO: add link), and we need to do rotation operations to restore it. In the non-oblivious setting, where we don't need our code to hide where the insertion was done, insert is typically a recursive function that calculates in every subcall if a rotation is needed. To translate this to the oblivious setting, we would have to simulate doing a rotation in every node in the insertion path, which is suboptimal, because the AVL tree can always be rebalanced after an insertion in two rotations.
Instead, in ENIGMAP we wrote an iterative AVL insertion that does a first pass to find the rotation and insertion points, does the rotations and then does the second pass to finish insertion. -->
<h2 id="fastinitializationalgorithm">Fast Initialization Algorithm</h2>
<p>Imagine we have a array of \(N\) key-value pairs, and want to initialize an oblivious map with them. The simplest solution (Naive Solution), would be to start with an empty map, and call <code>Set(key, value)</code> once for each key-value pair - this would cost us \(N\) AVL Tree insertions. We can do better!</p>
<blockquote>
<p>Instead of doing \(N\) insertions, we construct the AVL tree with all the values all at once. </p>
</blockquote>
<p></p>
<p>This works in two phases:</p>
<ul>
<li>
<p><strong><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#phase1">Phase 1</a></strong> - we build the logical AVL tree nodes with the correct random positions assigned.</p>
</li>
<li>
<p><strong><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#phase2">Phase 2</a></strong> - we place the nodes into the ORAM tree obliviously, using a PathORAM initialization algorithm.</p>
</li>
</ul>
<p>In order to better understand our algorithm, we will go over it step by step with an example. For the sake of simplicity, we will represent each key-value pair in the initial array by its key:</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/init_part1.png" alt="alt text" /></p>
<h3 id="phase1">Phase 1 - AVL Tree Construction</h3>
<p>In the first phase, our goal is to build the AVL tree, by assigning random positions to all the nodes, and correctly assigning children and storing the children's position on each node such that the AVL tree properties are preserved.</p>
<p>We start by obliviously sorting<sup class="footnote-reference"><a href="#osort">13</a></sup> the array and assigning random positions to each node<sup class="footnote-reference"><a href="#nosortneeded">14</a></sup>:
<img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/init_part2.png" alt="alt text" /></p>
<p>Notice that any sorted array represents an implicit binary search tree with the AVL property - doing a binary search on the array corresponds to traversing a binary search tree.
So now, we need to correctly store the children positions on each node:</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/init_part3.png" alt="alt text" /></p>
<p>To do so, we use the <code>Propagate</code> procedure - <em>Listing 5</em>. For our example, the propagate algorithm for our array should be called as <code>Propagate(arr, 0, 7)</code>.</p>
<pre data-lang="c++" style="background-color:#393939;color:#dedede;" class="language-c++ "><code class="language-c++" data-lang="c++"><span style="color:#fffb9d;">struct </span><span>AddrPos {
</span><span> </span><span style="color:#fffb9d;">int</span><span> addr;
</span><span> </span><span style="color:#fffb9d;">int</span><span> pos;
</span><span>};
</span><span>
</span><span style="color:#fffb9d;">struct </span><span>Node {
</span><span> </span><span style="color:#fffb9d;">int</span><span> key;
</span><span> AddrPos left, right;
</span><span> AddrPos ap; </span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> exists only during the construction algorithm.
</span><span>};
</span><span>
</span><span>AddrPos </span><span style="color:#fffd87;">Propagate</span><span>(vectorExternalMemory<Node></span><span style="color:#ececec;">& </span><span>nodes, </span><span style="color:#fffb9d;">int </span><span>left, </span><span style="color:#fffb9d;">int </span><span>right) {
</span><span> </span><span style="color:#fffb9d;">int</span><span> curr </span><span style="color:#ececec;">= </span><span>(left </span><span style="color:#ececec;">+</span><span> right) </span><span style="color:#ececec;">/ </span><span style="font-weight:bold;color:#87d6d5;">2</span><span>;
</span><span> </span><span style="color:#fffb9d;">int</span><span> size </span><span style="color:#ececec;">=</span><span> right </span><span style="color:#ececec;">-</span><span> left </span><span style="color:#ececec;">+ </span><span style="font-weight:bold;color:#87d6d5;">1</span><span>;
</span><span> </span><span style="color:#fed6af;">if </span><span>(size </span><span style="color:#ececec;">== </span><span style="font-weight:bold;color:#87d6d5;">1</span><span>) </span><span style="color:#fed6af;">return</span><span> nodes[curr].ap;
</span><span> nodes.MarkSticky(curr);
</span><span> nodes.left </span><span style="color:#ececec;">= </span><span>Propagate(nodes, left, mid</span><span style="color:#ececec;">-</span><span style="font-weight:bold;color:#87d6d5;">1</span><span>);
</span><span> nodes.right </span><span style="color:#ececec;">= </span><span>Propagate(nodes, mid</span><span style="color:#ececec;">+</span><span style="font-weight:bold;color:#87d6d5;">1</span><span>, right);
</span><span> nodes.MarkNotSticky(curr);
</span><span>}
</span></code></pre>
<p><strong>Listing 5</strong> - <em>Propagate procedure pseudocode</em></p>
<p><code>Propagate</code> will be called once for each node. Every time we access a node, we mark it as sticky it so it won't be swapped into external memory and then recurse on each child to get its position and keep updating the indices. This means each node will be transferred at most once from external memory, and at any given time we will only have at most \(\log N\) nodes marked as sticky - since that is the maximum tree depth. Therefore, this algorithm will incur at most \(N\) external memory transfers.</p>
<p>Notice that the memory access pattern is oblivious since it depends only on the length of the array, and not on the content of the key-value pairs themselves. </p>
<!--
In Figure deleted7, we show how the algorithm proceeds.
<!-- ![alt text](image.png)
![alt text](image-1.png)
![alt text](image-2.png)
![alt text](image-3.png)
![alt text](image-4.png)
![alt text](image-5.png) --
![alt text](image_together.png)
**Figure deleted7** - *Visualization of the propagate procedure. At each timestep at most \\(\log N\\) node are sticky (in blue).*
-->
<p>Apart from the initial sorting, each Phase 1 stage does a linear number of external memory accesses and computation steps. </p>
<h3 id="phase2">Phase 2 - PathORAM Initialization</h3>
<p>The second phase of the algorithm is an ORAM initialization algorithm - we have the content of all the block, as well as randomly assigned positions for each block, and now we want to place them in the ORAM tree without leaking where each block is stored. We encourage you to read about it in our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[6]</a>.</p>
<p><strong>For now, let's take a look at the performance of ENIGMAP.</strong></p>
<!--
(TODO: RFC: Should I include here an explanation of the PathORAM initialization algorithm like the phase 1 example? I already have the Figures, but this is not a contribution from our paper nor needed to explain any concept)
-->
<h1 id="results">Results</h1>
<p>In order to evaluate ENIGMAP, we compare the performance of each map operation against two implementations:</p>
<ul>
<li>
<p>Signal's <a rel="noopener" target="_blank" href="https://signal.org/blog/private-contact-discovery/">Linear Scan Solution</a> - It does a linear scan of the whole database for a batch of \( \beta \) queries, indexing each entry of the database on an hashtable built obliviously from the batch of queries.</p>
</li>
<li>
<p>Oblix <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[5]</a> - The previous state of the art. It also uses an ODS-based AVL tree.</p>
</li>
</ul>
<blockquote>
<p>Experimental Setup</p>
<ul>
<li><strong>Database Size (N):</strong> We tested with varying sizes, up to 256 million entries.</li>
<li><strong>Batch Size (\( \beta \)):</strong> Because Signal's solution is optimized to work with batches of queries, we introduce the parameter \( \beta \) to define the number of queries in a batch. We used batch sizes of 1, 10, 100, and 1000 queries.</li>
<li><strong>SGX Setting:</strong> In this blogpost we report results for a large EPC size (192GB)<sup class="footnote-reference"><a href="#largeepcbad">15</a></sup>, on a machine with 256GB of non-EPC RAM. Please refer to our paper for other settings, such as a small EPC size (128mb). </li>
<li><strong>Key size:</strong> All the keys in the experiments have 8 bytes each.</li>
</ul>
</blockquote>
<h2 id="get-latency"><code>Get</code> Latency</h2>
<p>We analyzed the performance of doing batches of \( \beta \) <code>Get</code> operations on each map implementation, in terms of latency and throughput.</p>
<blockquote>
<p>At a database size of \(2^{26}\), ENIGMAP achieves a throughput speedup of 2x on <code>Get</code> queries, while maintaining a latency per <code>Get</code> of 0.45ms compared to Signal's 930ms and Oblix's 11ms.</p>
</blockquote>
<blockquote>
<p>At a database size of \(2^{32}\), ENIGMAP achieves a throughput speedup of 130x on <code>Get</code> queries, while maintaining a latency per <code>Get</code> of 2ms compared to Signal's 133000ms.<sup class="footnote-reference"><a href="#nooblixlargeepc">16</a></sup></p>
</blockquote>
<h3 id="asymptotics">Asymptotics</h3>
<!--
In terms of query throughput and latency, we are assymptotically faster than both Oblix and Signal, as we can see from Table 1.
-->
<table><thead><tr><th>Scheme</th><th>Page Swaps</th><th>Compute</th></tr></thead><tbody>
<tr><td>Signal</td><td>\( O\left(\frac{N}{B}\right) \)</td><td>\( O(N + \beta^2) \)</td></tr>
<tr><td>Oblix</td><td>\( O\left(\beta \log^2 N\right) \)</td><td>\( O\left(\beta \log^3 N\right) \)</td></tr>
<tr><td>ENIGMAP</td><td>\( O\left(\beta \log_B N \log N\right) \)</td><td>\( O\left(\beta \log^2 N \log \log N\right) \)</td></tr>
</tbody></table>
<p><em>Table 1 - Cost of a batch of \( \beta \) <code>Get</code> queries, on a map with N elements, page size B (key-value pairs) and EPC of size M (key-value pairs)</em></p>
<h3 id="experimental-latency">Experimental Latency</h3>
<!--
| \\( \beta \\) | Signal (ops/s) | Oblix (ops/s) | ENIGMAP (ops/s) | Signal (latency \\( ms\\)) | Oblix (latency \\( ms\\)) | ENIGMAP (latency \\( ms\\)) |
|--------------------------|------------------|-----------------|-------------------|------------------|-----------------|-------------------|
| 1 | 1.1 | 91.5 | **2200** | 920 | 11 | **0.45** |
| 10 | 10.9 | 91.4 | **2200** | 921 | 109 | **4.58** |
| 100 | 109 | 91.1 | **2200** | 924 | 1096 | **45.8** |
| 1000 | 1086 | 90.6 | **2200** | 930 | 11040 | **458** |
*Table 3 - Throughput of each solution for varying batch sizes at a database size \\(N=2^{26}\\). The best results for each row are shown in bold.*
| \\( \beta \\) | Signal (ops/s) | ENIGMAP (ops/s) | Signal (latency \\( ms\\)) | ENIGMAP (latency \\( ms\\)) |
|--------------------------|------------------|-----------------|-------------------|------------------|
| 1 | 0.008 | **970** | 133507 | **1.03** |
| 10 | 0.008 | **970** | 133502 | **10.3** |
| 100 | 0.008 | **970** | 133522 | **103** |
| 1000 | 0.008 | **970** | 133531 | **1027** |
*Table 4 - Throughput of each solution for varying batch sizes at a database size \\(N=2^{32}\\). The best results for each row are shown in bold.*
-->
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/graph_query.png" alt="alt text" />
<strong>Figure 7</strong> - <em>Comparison of ENIGMAP and Signal on
SGXv2. Enclave memory size is 192GB, RAM size is
256GB. The vertical lines mark when ENIGMAP and Signal
start to incur RAM and disk swaps, respectively. Comparison with Oblix in our paper is done through relative comparison to Signal (refer to Figure 9 in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[5]</a>)</em></p>
<p><em>Figure 7</em> shows that:</p>
<ul>
<li>In terms of latency (measured at \(\beta=1\)), ENIGMAP always outperforms Signal (and Oblix). </li>
<li>In terms of throughput, for batch sizes of 10, 100 and 1000, ENIGMAP starts to outperform Signal at a database sizes of \(2^{17}\), \(2^{22}\), and \(2^{25}\), respectively. Signal's quadratic computation term on the batch size makes it perform worse with batches larger than the ones tested.</li>
</ul>
<p>In our paper, we also analyze the query performance of Insertions and Deletions, as well as analyzing the same experiments with different EPC and external memory constraints. </p>
<blockquote>
<p>The key takeaway is that for medium and large database sizes (larger than \(2^{25}\) entries), ENIGMAP's throughput always outperforms the linear scan solution, making clear the superiority of ODS for TEEs.</p>
</blockquote>
<p></p>
<h2 id="initialization">Initialization</h2>
<p>We analyzed how long it takes to initialize an oblivious map of \(N\) entries.</p>
<blockquote>
<p>At a database size of \(2^{26}\), ENIGMAP's initialization takes 9.5h, a speedup of 18x compared to Oblix, but much slower than the few minutes Signal's needs to create an enclave and write the key-value pairs to enclave memory.</p>
</blockquote>
<!--
In terms of initialization, we are faster than Oblix, but worse than Signal, since their initialization is just copying a single array from outside the enclave to the enclave, as we can see from Table 2.
-->
<h3 id="initialization-assymptotics">Initialization - Assymptotics</h3>
<table><thead><tr><th>Scheme</th><th>Page Swaps</th><th>Compute</th></tr></thead><tbody>
<tr><td>Signal</td><td>\( O\left(\frac{N}{B}\right) \)</td><td>\( O(N) \)</td></tr>
<tr><td>Oblix</td><td>\( O\left(\frac{N}{B} \log^2 N\right) \)</td><td>\( O(N \log N) \)</td></tr>
<tr><td>ENIGMAP</td><td>\( O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right) \)</td><td>\( O(N \log N) \)</td></tr>
</tbody></table>
<p><strong>Table 2</strong> - <em>Cost of initializing a map with N elements, page size B (key-value pairs) and EPC of size M (key-value pairs)</em></p>
<h3 id="initialization-experimental">Initialization - Experimental</h3>
<!-- TODO: change text from fast to ENIGMAP -->
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/graph_init.png" alt="alt text" />
<strong>Figure 8</strong> - <em>Initialization cost of ENIGMAP (Fast), compared to our implementation of Oblix's initialization and to the Naive Initialization - doing \(N\) insertions on the database</em></p>
<p><em>Figure 8</em> shows that ENIGMAP's initialization outperforms Oblix's by 2-18x depending on the database size. This improvement in initialization time is crucial for making ODS practical for larger database sizes. However, at 9.5h for \(N=2^{26}\), ENIGMAP's initialization is still much slower than Signal's linear scan initialization, which only takes a few minutes since it only needs to create a large enclave and copy the key-value pairs there. <sup class="footnote-reference"><a href="#notallhopelost">17</a></sup></p>
<h2 id="results-summary">Results Summary</h2>
<p>ENIGMAP shows significant improvements in query performance, achieving faster throughput and lower latency compared to Signal's linear scan solution and Oblix's ODS-based AVL tree. Specifically:</p>
<ul>
<li>
<p><strong>Query Throughput</strong>: ENIGMAP consistently outperforms both Signal and Oblix. At larger database sizes, ENIGMAP achieves up to 130x speedup over Signal for <code>Get</code> queries. This makes it highly efficient for handling large volumes of queries in practical applications.</p>
</li>
<li>
<p><strong>Query Latency</strong>: ENIGMAP maintains low latency per query, making it suitable for real-time applications. For example, at a database size of \(2^{26}\), ENIGMAP's latency per <code>Get</code> is 0.45ms compared to Signal's 930ms and Oblix's 11ms. This significant reduction in latency ensures quick response times for individual queries.</p>
</li>
</ul>
<p>While ENIGMAP excels in query performance, its initialization time is slower compared to Signal's solution. This tradeoff highlights that while ENIGMAP's initialization is slower, its higher throughput and low latency make it highly practical for applications where query performance is more critical than initialization time.</p>
<h1 id="finale">Applications, Limitations and Open Problems</h1>
<p>ENIGMAP's Oblivious Sorted Map has a broad range of applications:</p>
<ul>
<li><strong>Secure Databases</strong>: A sorted map can be used to build databases that protect the privacy of user queries. This is especially relevant for sensitive data such as medical records, financial transactions, and personal communication logs.</li>
<li><strong>Private Contact Discovery</strong>: Similar to the use case implemented by Signal, ENIGMAP can help in securely finding mutual contacts without revealing the contact list to the server.</li>
<li><strong>Cloud Computing</strong>: With the increasing reliance on cloud services, ensuring data privacy and security is paramount. ENIGMAP allows users to securely store and query data on untrusted cloud servers while maintaining the confidentiality of their access patterns. In this setting the external memory is no longer RAM or Disk, but the remote cloud server itself.</li>
<li><strong>Multi-party computations (MPC)</strong>: MPC and Fully Homomorphic Encryption (FHE) provide encrypted computations similar to Intel SGX, but rely on strong cryptographic primitives rather than trusting hardware vendors like Intel. In MPC, it is crucial for algorithms to be oblivious to prevent information leakage through data access patterns. Traditionally, maps in MPC have been implemented using linear scans or an online phase of Private Information Retrieval (PIR) or Oblivious RAM (ORAM). ENIGMAP can serve as an efficient implementation of online ORAM in MPC, offering significant performance enhancements.</li>
</ul>
<p>While ENIGMAP presents significant advancements, it also comes with certain limitations to address in future work:</p>
<p><strong>Initialization Time</strong>: The initialization time for ENIGMAP is slower compared to simpler solutions like Signal's linear scan. For large databases, this initialization time can be a bottleneck. Future work should explore how to minimize the initialization time of ORAMs/ODSs in the TEE setting; there is still room for improvement. In <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[7]</a> we developed a TEE external-memory optimized oblivious sorting algorithm that significantly improves ORAM initialization time; however, even with this optimization, ENIGMAP's initialization is still significantly slower than Signal. Exploring using other types of BSTs instead of AVL, or improving the ORAM initialization can also help improve initialization time.</p>
<p><strong>Memory Overheads</strong>: The use of oblivious algorithms and PathORAM requires additional memory to store metadata, such as positions and encrypted blocks, as well a linear amount of fake metadata used to store fake blocks, which may be a constraint in memory-limited environments. </p>
<p><strong>Exploring other BST implementations</strong>: In non-oblivious databases, typically B+ trees, AVL trees, skip-lists, or variations of them are used for indices. It would be interesting to explore in depth the tradeoffs between each of these solutions.</p>
<p>Our code <a rel="noopener" target="_blank" href="https://github.com/odslib/odsl">is available on github</a>, as well as ongoing work on <a rel="noopener" target="_blank" href="https://github.com/gty929/oram">more efficient oblivious maps</a>.</p>
<p><strong>Footnotes:</strong></p>
<div class="footnote-definition" id="whynotbplusorhashmap"><sup class="footnote-definition-label">1</sup>
<p>In our research, we considered both AVL trees and B+ trees. We opted for AVL trees in ENIGMAP because previous work had successfully used them, and for the specific problems we were addressing, the key and value sizes were relatively small. If range queries were not required, a hash table could be a faster alternative to a search tree.</p>
</div>
<div class="footnote-definition" id="gowiki"><sup class="footnote-definition-label">2</sup>
<p>To learn how each of the operations are implemented, the <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/AVL_tree">wikipedia page on AVL trees</a> is a great starting point.</p>
</div>
<div class="footnote-definition" id="boundedheightimportant"><sup class="footnote-definition-label">3</sup>
<p>Having a bounded height and number of nodes that are touched during an operation is needed so that the time it takes to do a query doesn't leak information about the query. Since the tree depth is at most \( 1.44 \log_2 N\), we can make every <code>Search</code> operation always touch \( 1.44 \log_2 N\) nodes, potentially accessing fake nodes after we found the <code>key</code> we were looking for.</p>
</div>
<div class="footnote-definition" id="andcppshortcircuits"><sup class="footnote-definition-label">5</sup>
<p>We need to use <code>*</code> instead of <code>&&</code> because in C++ the <code>&&</code> operator short circuits.</p>
</div>
<div class="footnote-definition" id="pagelevel"><sup class="footnote-definition-label">4</sup>
<p>The granularity of the public memory access trace for Intel SGX is typically at the page level (4KB pages). </p>
</div>
<div class="footnote-definition" id="timemeaning"><sup class="footnote-definition-label">6</sup>
<p>Both number of CPU instructions as well as number of memory accesses whose trace is public.</p>
</div>
<div class="footnote-definition" id="acbst"><sup class="footnote-definition-label">8</sup>
<p>An almost complete binary tree of size N is a binary tree with N leaves where all the levels except the last one are full. The last level should have the N leftmost leaves only.</p>
</div>
<div class="footnote-definition" id="infoconstant"><sup class="footnote-definition-label">9</sup>
<p>The failure probability is negligible in the stash size - the probability of the stash becoming larger than K after an operation is \( o(2^{-K}) \) <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[1]</a>.</p>
</div>
<div class="footnote-definition" id="readpathoram"><sup class="footnote-definition-label">7</sup>
<p>I suggest reading the PathORAM paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[1]</a> for more details on why the stack size is kept constant, how initialization works, and the recursive ORAM technique used to keep track of the positions of all the addresses.</p>
</div>
<div class="footnote-definition" id="btwnotonlyexternalmemory"><sup class="footnote-definition-label">10</sup>
<p>This locality-friendly layout is useful also in the scenario where we don't have a disk, since it can also translate to RAM pages that don't need to be cached, or it can also make trees with smaller nodes fit in CPU cache lines directly.</p>
</div>
<div class="footnote-definition" id="noteembdaboas"><sup class="footnote-definition-label">11</sup>
<p>This is not the Embe Boas layout - we don't need to be cache-agnostic, since we know the page size for disk and SGX - and can even experimentally measure it, as you can find in our paper.</p>
</div>
<div class="footnote-definition" id="sgxv2tdxbad"><sup class="footnote-definition-label">12</sup>
<p>This cache is not as useful in SGXv2/TDX, since all the RAM can be used as part of the enclave.</p>
</div>
<div class="footnote-definition" id="osort"><sup class="footnote-definition-label">13</sup>
<p>Oblivious Sorting can efficiently be done on an array of size N in time \(O(N \log^2 N)\) <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/oblivious-maps/#cite">[7]</a>.</p>
</div>
<div class="footnote-definition" id="nosortneeded"><sup class="footnote-definition-label">14</sup>
<p>If the key-value pairs are already sorted by key, we can instead just verify it using a linear scan.</p>
</div>
<div class="footnote-definition" id="largeepcbad"><sup class="footnote-definition-label">15</sup>
<p>These larger EPC sizes have weaker security guarantees - EPC memory in RAM no longer has a freshness check, therefore the hardware TCB is no longer just the CPU, but all of the machine hardware instead. There has been a shift in industry interest towards larger enclaves sizes recently as they become available on cloud datacenters. The assumption of trusting the hardware is addressed by a “proof of cloud” - a cloud provider signs they are running the enclave and, since there is pottentially a huge economical loss if the cloud providers lies, developers can trust the hardware is not being tampered with. Since this is now a utility-based model, in large EPCs, trusting an SGX enclave will follow the Crash-Fault model is now risk management rather than an expected guarantee.</p>
</div>
<div class="footnote-definition" id="nooblixlargeepc"><sup class="footnote-definition-label">16</sup>
<p>The Oblix paper does not report results for database sizes over \(2^{30}\), but even using the time for \(2^{28}\), we achieve a speedup of at least 53x, which further increases with database size. </p>
</div>
<div class="footnote-definition" id="notallhopelost"><sup class="footnote-definition-label">17</sup>
<p>Not all hope is lost in terms of initialization time; from our ongoing experiments, we believe the initialization time for binary search trees can be further improved, if we move away from AVL trees.</p>
</div>
<h1 id="cite">Bibliography</h1>
<ol>
<li>
<p>E. Stefanov, M. Van Dijk, E. Shi, T.-H. H. Chan, C. Fletcher, L. Ren, X. Yu, and S. Devadas. “PathORAM: An Extremely Simple Oblivious RAM Protocol” <em>Journal of the ACM (JACM)</em> 65, 4, Article 18 (August 2018), 26 pages. <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3177872">https://doi.org/10.1145/3177872</a></p>
</li>
<li>
<p>X. S. Wang, K. Nayak, C. Liu, T.-H. H. Chan, E. Shi, E. Stefanov, and Y. Huang. “Oblivious Data Structures” <em>Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14)</em>, Association for Computing Machinery, New York, NY, USA, 2014, pp. 215-226. <a rel="noopener" target="_blank" href="https://doi.org/10.1145/2660267.2660314">https://doi.org/10.1145/2660267.2660314</a></p>
</li>
<li>
<p>V. Costan and S. Devadas. “Intel SGX Explained” <em>Cryptology ePrint Archive</em>, Report 2016/086. <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2016/086.pdf">https://eprint.iacr.org/2016/086.pdf</a></p>
</li>
<li>
<p>J. V. Bulck, F. Piessens, and R. Strackx. “SGX-Step: A Practical Attack Framework for Precise Enclave Execution Control” <em>Proceedings of the 2nd Workshop on System Software for Trusted Execution (SysTEX’17)</em>, Association for Computing Machinery, New York, NY, USA, Article 4, 1–6. <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3152701.3152706">https://doi.org/10.1145/3152701.3152706</a></p>
</li>
<li>
<p>P. Mishra, R. Poddar, J. Chen, A. Chiesa, and R. A. Popa. “Oblix: An Efficient Oblivious Search Index” <em>2018 IEEE Symposium on Security and Privacy (SP)</em>, San Francisco, CA, USA, 2018, pp. 279-296. <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/document/8418609">https://doi.org/10.1109/SP.2018.00045</a></p>
</li>
<li>
<p>A. Tinoco, S. Gao, and E. Shi. “EnigMap: External-Memory Oblivious Map for Secure Enclaves” <em>32nd USENIX Security Symposium (USENIX Security 23)</em>, Anaheim, CA, August 2023, pp. 4033-4050. <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/usenixsecurity23/presentation/tinoco">https://www.usenix.org/conference/usenixsecurity23/presentation/tinoco</a></p>
</li>
<li>
<p>T. Gu, Y. Wang, B. Chen, A. Tinoco, E. Shi, K. Yi. “Efficient Oblivious Sorting and Shuffling for Hardware Enclaves” <em>Cryptology ePrint Archive</em>, Report 2023/1258. <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2023/1258">https://eprint.iacr.org/2023/1258</a></p>
</li>
<li>
<p>M. A. Bender, E. D. Demaine, and M.
Farach-Colton. “Cache-oblivious b-trees” <em>SIAM J. Comput.</em>, 35(2):341–358, 2005. <a rel="noopener" target="_blank" href="https://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/paper.pdf">https://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/paper.pdf</a></p>
</li>
</ol>
Efficient Anonymous Blocklisting2024-07-31T00:00:00+00:002024-07-31T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/anonymous-blocklisting/<h2 id="tl-dr">TL;DR:</h2>
<ul>
<li>Truly Anonymous Service Providers can offer users confirmed privacy, but also allow inappropriate behavior 😳</li>
<li>Anonymous Blocklisting permits blocking ill-behaved users without denanonymizing them. Specifically, the service provider blocks individual posts, and the benign users use zero-knowledge arguments (zk-SNARKs) to prove that they didn’t make said posts, without revealing any information about them 😮</li>
<li>The more blocked posts you have the less efficient the blocklist is, making it practically infeasible 😞</li>
<li>SNARKBlock reduces the cost to logarithmic with respect to the size of the blocklist by introducing HICIAP, aggregating all the individual proofs into one efficient proof 🙌</li>
<li>You can now live out your anonymous double life 😎</li>
</ul>
<h1 id="introduction">Introduction </h1>
<p>The goal of this blog post is to teach the reader the fundamentals behind anonymous blocklisting, as well as to introduce a state-of-the-art blocklisting algorithm called SNARKBlock. In the conclusion, I will reflect on open problems for future anonymous blocklisting algorithms.</p>
<h2 id="anonymous-communications-systems">Anonymous Communications Systems</h2>
<p>Anonymous communications systems bring benefits but also harms. Computing users’ private information has always been vulnerable to irresponsible corporations and identity theft. On top of that, oppressed citizens living under authoritarian governments struggle to maintain their privacy and risk facing prosecution for speaking out, while political journalists have to fight to keep their sources hidden. Anonymous communications systems aim to help these users by keeping their identities private from other observers.
The largest network that allows anonymous communication to date is <a rel="noopener" target="_blank" href="https://svn-archive.torproject.org/svn/projects/design-paper/tor-design.pdf">Tor</a>, which utilizes “onion routing”, encrypting the data multiple times and passing it through a network of different nodes, making it difficult to trace back to the source. Unfortunately, malicious users can take advantage of the gift of anonymity resulting in online bullying and harassment, trolling, and the spread of harmful or illegal content without consequences. The problem is the following:</p>
<p><em>If no one knows who you are, no one can stop you.</em> </p>
<!-- <p></p> -->
<p>How can internet services provide anonymity to users without allowing inappropriate behavior?
Many service providers claim to be anonymous but have often been criticized for storing the user’s information, or metadata that can help identify them.<!--, like [Whisper](https://whisper.sh/) and [Blind](https://www.teamblind.com/). Another example is--> For instance, <a rel="noopener" target="_blank" href="https://www.wikipedia.org/">Wikipedia</a> provides weak anonymity by connecting a user’s personal information to a pseudonym instead of directly storing it. Thus, all of their actions (e.g., page edits) are publicly linked to their profile, and analyzing patterns in editing behavior or content preferences could lead to inferences about their identity.
One existing solution to linked metadata is using “revocable anonymity systems”, which allow for a user to be deanonymized or pseudoanonymized (having their actions linked) when necessary. For example, imagine if Wikipedia users were completely anonymous (i.e., without a public pseudonym), but if one of your edits is deemed “inappropriate”, your anonymity is stripped and your identity is revealed. This type of system typically relies on a Trusted Third Party aware of the identity of the user and capable of revoking a user’s privacy at their discretion.</p>
<h2 id="anonymous-blocklisting">Anonymous Blocklisting</h2>
<p>Anonymous blocklisting systems come to the rescue to enforce policies on users without deanonymizing them. These systems allow users to authenticate anonymously with a service provider, while service providers can revoke a user’s access without learning any information about their identity or involving a Trusted Third Party. Anonymous blocklisting systems can achieve blocking users from posting again by flagging individual posts instead of their accounts.</p>
<p>A way to realize this is to provide each user with a secret identity, and every one of their posts is secretly linked to that identity. Unlike the example with the Wikipedia users, there is not a public pseudonym connected to them, and their identity remains hidden even from the service provider. Whenever a user wants to post they have to prove that none of the flagged posts are linked to their identity, without leaking any information about it.</p>
<p>A savvy reader (you) can spot an immediate problem; how can we prevent users from making many different accounts? This is a common network service attack called the <a rel="noopener" target="_blank" href="https://www.freehaven.net/anonbib/cache/sybil.pdf">Sybil Attack</a>. In a “normal” system, users have to register through an identity provider (e.g., Google) using some identifier (e.g., their Gmail account). To solve this problem, blocklisting schemes can also utilize identity providers who would maintain a log of “registered users”. Hence, when a user posts, they have to also prove they are registered without revealing it’s them posting, using <del>magic</del> cryptography.</p>
<p>To summarize, anonymous blocklisting systems achieve blocking anonymous users without the need to deanonymize them. They allow users to post anonymously (even to the service provider), while service providers can block individual posts without any of the user’s information getting leaked.</p>
<h1 id="cryptographic-protocols">Cryptographic Protocols</h1>
<p>To delve deeper into the mechanics of anonymous blocklisting schemes, we’ll go over some cryptographic protocols that help ensure the users’ privacy.</p>
<h2 id="zk-snarks">ZK-SNARKs</h2>
<p>The main building block needed to build an anonymous blocklisting scheme is called a <a rel="noopener" target="_blank" href="https://www.di.ens.fr/%7Enitulesc/files/Survey-SNARKs.pdf">zk-SNARK</a>; Zero-Knowledge Succinct Non-Interactive Argument of Knowledge. Even if it is a mouthful, every single property is necessary. Let’s break them down together below. Assume we have two parties denoted as the Prover and the Verifier.</p>
<ul>
<li>Argument of Knowledge: A SNARK is a proof<sup class="footnote-reference"><a href="#1">1</a></sup> where the Prover can prove their possession of some information to the Verifier. Typically, the “information” is the solution (“witness”) to a computational problem that the Verifier could not solve by themselves.<!-- without knowledge of the information they are proving possession of.--></li>
<li>Non-Interactive: The communication between the two parties solely consists of a single proof message sent from the Prover to the Verifier (in more general models, the Verifier and Prover could engage in multiple rounds of interactive communication).</li>
<li>Succinct: The Prover’s message should be small compared to their witness.</li>
<li><a rel="noopener" target="_blank" href="https://people.csail.mit.edu/silvio/Selected%20Scientific%20Papers/Proof%20Systems/The_Knowledge_Complexity_Of_Interactive_Proof_Systems.pdf">Zero-Knowledge</a>: The Prover manages to prove possession of their witness without revealing any information about the witness itself.</li>
</ul>
<p>The crux of the whole protocol is the zero-knowledge property which can accompany a SNARK. To make Zero - Knowledge more tangible we can revisit one of the most overdone examples in the history of ZK, drawing from the world of <a rel="noopener" target="_blank" href="https://www.imdb.com/title/tt0417299/">Avatar: The Last Airbender</a><sup class="footnote-reference"><a href="#2">2</a></sup>. Imagine our Prover is Aang, and our Verifier is Toph. Aang has two different colored boomerangs (one red and one green) and wants to prove to Toph who is (color)blind that they are indeed different colors. However, Aang doesn’t want to let Toph know which boomerang is which color! Thankfully, they came up with the following Zero-Knowledge Protocol: Toph holds the boomerangs behind her back. She briefly displays one of the two boomerangs before hiding it again. She then again chooses one of the two at random, brings it out, and asks “Did I switch the boomerangs?”. Of course, Toph knows if she displayed the same or a different boomerang, and Aang can easily differentiate given their different colors. If Aang lies, he will only succeed 50% of the time. By repeating the protocol multiple times, Toph can be convinced, without actually learning any information about the individual boomerang’s colors!</p>
<!--<img src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdW4wYmRsaDZkcnkzbHNpbzc1OW40djVmbDkxN3B5d2t0azd5cXllciZlcD12MV9naWZzX3NlYXJjaCZjdD1n/4IzOgM1bfOe6k/giphy.gif" width="45%"/>-->
<img src="toph.gif" width="40%"/>
<p>Of course in this example, even though we demonstrate the ZK property, this is not a SNARK since it is interactive. With a ZK-SNARK, Toph and Aang would not need multiple rounds of communication.</p>
<h2 id="signature-schemes">Signature Schemes</h2>
<p>Another cryptographic protocol required to understand anonymous blocklisting is a <a rel="noopener" target="_blank" href="https://people.csail.mit.edu/rivest/pubs/GMR88.pdf">signature scheme</a>. A digital signature, just like a real-life signature, gives a Prover the ability to sign a message before sending it. Then a Verifier, using some public information, can verify that the Prover was the one who sent that message. Signature Schemes consist of 3 algorithms:</p>
<ul>
<li>Generation Algorithm<!-- \\( Gen(\lambda) \leftarrow (sk,vk) \\)-->: It produces a signature (secret) key <em>sk</em> only known by the signer and a verification key <em>vk</em> public to everyone.</li>
<li>Signing Algorithm<!-- \\(Sig(sk, m) \leftarrow \sigma\\)-->: Given a message <em>m</em> and the secret key <em>sk</em>, it produces the digital signature σ.</li>
<li>Verification Algorithm<!-- \\(Ver(vk, m, \sigma) \leftarrow \{0,1\}\\)-->: Given the public verification key <em>vk</em>, the original message <em>m</em> and the signature σ, it produces 1 if the signature is valid and 0 if it’s not.</li>
</ul>
<p>Signature schemes allow users to authenticate the origin and integrity of a message. Let’s understand their importance through another Avatar example, where Zuko is trying to capture the Avatar<sup class="footnote-reference"><a href="#2">2</a></sup>. Imagine Zuko wants to announce to the world online, and as an extension his dad -the Firelord-, that he caught the Avatar. However, anyone can try and give false information, impersonating Zuko, and claim the Avatar has been caught. To avoid this scenario, Zuko sets up a digital signature scheme; he runs the Generation algorithm and shares the (public) verification key with his dad before his trip. Now, if he catches the Avatar, he can publish his message “I caught the Avatar, and with him, my honor”, along with a signature σ. As a result, the Firelord can run the (public) verification algorithm, which would return true if this message is truly from Zuko, or false, indicating it did not, in fact, come from Zuko.</p>
<img src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExb28zZW5yaHozNjd0NmM5MGdudnZ6dWMwYmxuYjQxamVwdjZhMWlrZyZlcD12MV9naWZzX3NlYXJjaCZjdD1n/kN2THUm8712diLwn0F/giphy.gif" width="50%"/>
<h2 id="commitment-schemes">Commitment Schemes</h2>
<p>We will also encounter a <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Emblum/research/pdf/coin/">commitment scheme</a>, which allows one to commit to a specific value while keeping it hidden from everyone else. In addition, the committed value can be revealed later without any possibility of alteration. This ensures both confidentiality and integrity, ensuring that the committed value is securely locked and that after revealing it, the original commitment and the revealed value match perfectly. Commitment schemes consist of the following two phases, which take place between a committer and a receiver:</p>
<ul>
<li>The commit phase, during which the committer chooses and commits to a value by producing a “commitment” message.</li>
<li>The reveal phase, during which the committer sends the value to the receiver along with a “decommitment” message (e.g. the randomness used in the commit phase), and the receiver can verify its authenticity.</li>
</ul>
<p>Commitment schemes have to satisfy two properties:</p>
<ul>
<li>Hiding property: Given a commitment, no information about the committed value can be extracted.</li>
<li>Binding Property: The value chosen in the commit phase is the only one that the commitment can decommit to.</li>
</ul>
<p>Assume that Aang and Toph are playing a game where Toph thinks of a number between 1 and 10, and Aang tries to guess it. They want to ensure that Toph cannot change her pick after Aang makes his guess, so they use a commitment scheme. Toph thinks of the number 7 and sends a commitment to Aang. Then Aang makes his guess and says: “I think the number is 5”. Finally, Toph reveals her pick by sending a decommitment message. Due to the properties of the commitment scheme, Aang cannot cheat and get any information about Toph’s number before the reveal phase. At the same time, if Aang guessed 7 correctly, Toph cannot lie about her initial pick claiming it was a different number.</p>
<!--Assume that Aang wants to correctly guess how many guards Suki can take down, without letting her know beforehand. However, Suki doesn't trust him, so they use a commitment scheme. Soka guesses 42 and produces and sends a commitment message to Suki. After the fight, Aang reveals his answer, while also sending a decommitment message. Now, if the real number of guards was 43, Soka could not change his answer and still convince Suki he was right. At the same time, Suki cannot get any information about Aang's answer before the reveal phase and change her actions during the fight.
![suki](https://64.media.tumblr.com/5fe3d1ed55146768d4aabbb7a419f132/9b647a9d5c31eada-f0/s250x400/4814c265ff350458f9beb79eb4b7fe62509590ef.gifv)-->
<h2 id="pseudorandom-functions-prfs">Pseudorandom Functions (PRFs)</h2>
<p>The last protocol we will refer to is a <a rel="noopener" target="_blank" href="https://www.wisdom.weizmann.ac.il/%7Eoded/X/ggm.pdf">pseudorandom function</a>, or PRF. A PRF is a function that takes as input a key and a message, and returns a random-looking string: <em>PRK(key, message) = pseudorandom string</em>. A PRF must have the following two properties:</p>
<ul>
<li>It is easy to compute (i.e., in polynomial time).</li>
<li>One cannot distinguish between random strings and the results of a PRF without access to the key.</li>
</ul>
<p>In other words, randomness is expensive; PRFs are cryptography’s efficient way of faking randomness.</p>
<h1 id="anonymous-blocklisting-1">Anonymous Blocklisting</h1>
<p>We will now focus on anonymous blocklisting schemes, starting with their necessary properties and continue by doing a survey of state-of-the-art systems <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2021/1577.pdf">SNARKBlock</a> and its predecessor <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/1880022.1880033">BLAC</a>.</p>
<p>Overall, in an Anonymous Blocklisting Scheme, there is a blocklist filled with flagged posts. Every user gets a “secret identity” when they register for the service. Every time they want to post, they have to produce a token that is linked to their identity, and then prove both that they are registered and that none of the flagged posts were made by them. To do that, as we will see later, they can use zero-knowledge proofs, attesting to the fact that their identity is valid (produced during the registration) and that it is not connected to any of the posts in the blocklist, without revealing any actual information about their identity. </p>
<h2 id="definition">Definition</h2>
<p>We will begin with a simplified definition of anonymous blocklisting schemes due to <a rel="noopener" target="_blank" href="https://cacr.uwaterloo.ca/techreports/2010/cacr2010-24.pdf">Henry and Goldberg</a> that can be generalized to most existing blocklisting schemes.</p>
<!--Time to get a bit more technical and see what the requirements and properties of an anonymous blocklisting scheme are. In the literature, there exist numerous definitions that are often too specific to the scheme in question, or too informal. For the purposes of this blog, we are going to draw from [Henry and Goldberg](https://cacr.uwaterloo.ca/techreports/2010/cacr2010-24.pdf) and present a simplified version that can be generalized to most existing blocklisting schemes. Let's start with the parties involved:-->
<p>The parties involved are the following:</p>
<ul>
<li>Users: The set of individuals using the service are called users. All users are assigned to a random unique identifier <em>k</em>, which constitutes their identity and remains secret.</li>
<li>Identity Providers: Every user has to connect to an identity provider (e.g., Google) to register for the service and acquire a new and valid identity.</li>
<li>Service Providers: The entity (or entities) providing the service (e.g., Wikipedia).</li>
<li>Revocation Authorities: The authority responsible for flagging content and blocking users. For the context of this blog, we assume the service providers also play the role of the revocation authorities, which is the case for most blocklisting schemes. </li>
</ul>
<p>The protocols that can take place in an anonymous blocklisting scheme are:</p>
<ul>
<li>Registration Protocol: This protocol takes place between a user and an identity provider and happens once so that the user registers for the service. By running this algorithm, the user receives a valid unique identifier. </li>
<li><!--* Token Extraction Protocol: To anonymously connect a user every time they use a specific service use (e.g. post), the user has to run this protocol with their unique identifier to get an authentication token. As a result, every token is connected to their identity-->Token Extraction Protocol: For a user to take action on the service (e.g., post), they need an authentication token. By running this protocol with their unique identifier, they can obtain a token secretly linked to their identity. This process ensures that the token can be used for authentication while preventing anyone from gaining information about the user's identity by merely observing the token.
</li>
<li>Authorization Protocol: In this protocol, the service provider takes as input an authentication token and verifies that the user is eligible to use the service (i.e., not blocked).</li>
<li>Revocation Protocol: This protocol is run by the service provider, taking as input an authorization token and blocking the user by adding the token to the blocklist.</li>
<li>Reinstatement Protocol: Similarly to the revocation protocol, the service provider can also unban a user by removing their token from the blocklist. </li>
</ul>
<p>The crux of the anonymous blocklisting scheme is ensuring the following three security requirements:</p>
<ul>
<li>
<p>Blocklistability: Users can successfully authenticate to an honest service provider only if that user holds a valid identity not in the blocklist issued by an identity provider: Specifically, it encompasses the following two notions:</p>
<ol>
<li>Verification should succeed only on authentication tokens that are the result of a user correctly executing the established protocols.</li>
<li>Given an authentication token issued to some anonymous user, a service provider can have the user’s access revoked, such that they cannot post again until all his banned tokens are removed. </li>
</ol>
</li>
<li>
<p>Anonymity: No information about the user can be linked to an authentication token, which encompasses the following two notions:</p>
<ol>
<li>Given an authentication token from one of two users, it should be infeasible for an attacker to determine which user that authentication token was issued to.</li>
<li>Given two or more authentication tokens, it should be infeasible for an attacker to distinguish if they came from the same user or two different ones.</li>
</ol>
</li>
<li>
<p>Non-frameability: An honest user cannot be prevented from being authenticated by an honest service provider.</p>
</li>
</ul>
<p>Let’s go back to our example setting<sup class="footnote-reference"><a href="#2">2</a></sup> to better better the mechanics of such a scheme. Imagine that the officials of the city of Ba Sing Se have set up an anonymous forum of people sharing secrets from their everyday lives, like the app Whisper, and Joo Dee is an identity provider. Assume that Aang wants to subscribe to the forum and make posts. He first has to get in contact with Joo Dee to register. As an outcome, he gains a unique identifier <em>k</em>, secret even to Joo Dee. Then, assume he wants to post the message “There is war in Ba Sing Se”. He runs the token extraction protocol using his identifier to get a token α and anonymously sends the message along with the token to the city officials. Now they run the Authentication Protocol, verify that the message is not coming from a banned user, and publish it.
Of course, it’s not long before the message gets flagged for harmful content. So the authorities run the revocation algorithm using that token. Now, if Aang tries to post again with a new token, it won’t be authorized. However, no one can link his message to him or any of his futile attempts to post again.</p>
<img src="https://media.tenor.com/SOnmo9jnfQsAAAAM/avatar-the-last-airbender.gif" width="45%"/>
<h2 id="inefficient-constructions">Inefficient Constructions</h2>
<p>As usual in cryptography, things in practice are a little different, and by different, I mean worse. The security requirements explained above are necessary but not sufficient for a useful in-practice anonymous blocklisting scheme. The size of the blocklist can grow extremely fast depending on the use case.
For example, Wikipedia has approximately <a rel="noopener" target="_blank" href="https://stats.wikimedia.org/#/en.wikipedia.org/contributing/edits/normal%7Cbar%7C2020-11-04%7E2021-11-24%7C%7Etotal%7Cmonthly">2 edits per second</a> and Reddit around <a rel="noopener" target="_blank" href="https://old.reddit.com/r/blog/comments/k967mm/reddit_in_2020/">64 comments per second</a>. Estimating from event logs from 2020, the ban rate for Wikipedia is around 1%, which would result in approximately 2 thousand bans in Wikipedia and 40 thousand bans for Reddit.</p>
<p>We thus need schemes that are efficient, both for the users and the service provider. </p>
<p>On the user’s side, to be efficient means that authenticating a token and using the service has a predictable runtime and bandwidth so as not to add too much latency to their requests. On the service provider’s side, we need both the authentication and revocation processes to have predictable running times and bandwidth, so that the cost of servicing a user is not too high and the system can keep up with the expected rate of revocations.</p>
<p>Let’s start with a construction inspired by the first anonymous blocklisting scheme with a Trusted Third Party, <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/1880022.1880033">BLAC</a> by Tsang et al., to delve deeper into the mechanics<sup class="footnote-reference"><a href="#3">3</a></sup>.
Consider the blocklist as a set of tokens, where every token is of the form <em>(nonce, PRF(k, nonce))</em> for some random number <em>nonce</em> and someone’s unique identifier <em>k</em>.</p>
<p>The <strong>registration protocol</strong> has to take place once, before the user can access the service. The user randomly chooses his credential <em>k</em>, then computes and sends a commitment <em>com(k)</em> to the identity provider. The identity provider answers with a signature <em>σ</em> on that commitment. </p>
<p>During the <strong>token extraction protocol</strong>, the user randomly chooses a value <em>nonce</em> and computes a token <em>α = (nonce, PRF(k, nonce))</em>, tying the token with their identity.</p>
<p>The <strong>authorization protocol</strong> works as follows: the user sends to the service a token <em>α</em>, along with a zk-SNARK that: (i) the token is computed correctly and is equal to <em>PRF(k, nonce)</em>, (ii) they have a well-formed commitment <em>com(k)</em> such that it is signed from an identity provider and (iii) none of the tokens in the blocklist are related to the user’s identifier, i.e. <!--\\( \forall \alpha=(nonce^\prime, h) \in blocklist, PRF(k, nonce^\prime) \ne h \\)--> <em>for all α=(nonce′, h) in the blocklist, PRF(k, nonce′) ≠ h</em>. </p>
<p>Then, in the <strong>verification protocol</strong> the service provider checks that the proofs are valid (i.e., the user is not blocked), and only then offers their service.</p>
<p>Finally, for the <strong>revocation protocol</strong>, if the service provider notices harmful content, they add the token accompanying it in the blocklist. Respectively, they can remove it if they decide to unban them by running the <strong>reinstatement protocol</strong>.</p>
<p>It is clear to see the security of this protocol:</p>
<ul>
<li>
<p>The scheme satisfies <strong>blocklistablity</strong> since, if a user is blocked or tries to use fake credentials, their zk-SNARK wouldn’t verify. That is, there would either be a token in the blocklist connected to their unique identifier <em>k</em>, or they wouldn’t have a valid signature on the commitment of their identifier. Also, because of the binding property of the commitment scheme, they cannot connect a different value to the signed commitment σ from the identity provider. At the same time, the service provider can block any user by adding their corresponding token to the blocklist.</p>
</li>
<li>
<p>The scheme is also <strong>anonymous</strong>, since all the information sent to the service provider is through a zk-SNARK, revealing no information about the user. In addition, due to the hiding property of the commitment scheme, the identity provider also never learns anything about the user’s identity.</p>
</li>
<li>
<p>As far as <strong>non-frameability</strong> goes, for an honest user to be prevented from using the service, one would have to produce a token that would tie to the user’s unique identifier, impossible given the pseudorandomness of the PRF.</p>
</li>
</ul>
<p>However, there is an immediate efficiency flaw in the above construction; the server has to do linear work in the size of the blocklist to verify a user since the proof goes over the whole blocklist every time. At the same time, the proof sizes are also linear in the size of the blocklist. In BLAC, a single proof for a blocklist with 4 million blocks (a size that according to our previous estimations, Reddit would reach in approximately 100 days) would require a client to upload 549MiB of data. Overall, existing zk-SNARK implementations are fit to only handle pieces of the blocklist efficiently. </p>
<h1 id="snarkblock">SNARKBlock</h1>
<p>Enters <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2021/1577.pdf">SNARKBlock</a>, a new anonymous blocklisting scheme from Rosenberg et al. The authors build upon the aforementioned construction and can offer proofs that are only logarithmic in the size of the blocklist, while also requiring logarithmic verification time. </p>
<p>Blocklists mostly stay immutable and the service provider adds to them. As a result, both the service provider and the users end up recomputing a lot of the information. More specifically, if a user has calculated a proof for a blocklist with 99 blocked posts, after a new post gets added they have to calculate a new proof for all 100 posts.
The authors break up the blocklist into non-overlapping chunks so that users can reuse their proof computation over the unchanged chunks. Then they can combine all the distinct proofs into one logarithmic-sized proof (in relation to the blocklist size).
So for our example, we could separate the blocklist into 10 chunks and only have to recompute the proof for the last 10 blocked posts.</p>
<p>There are two immediate problems with the above technique. To begin with, in the original protocol, the proof that was attesting to the validity of the user posting was taking as input the user’s unique identifier as a witness, making sure that they have not posted any of the blocked posts. What happens though with proofs for different chunks? Each proof would have to take as input the witnesses anew, and a malicious user could potentially have a different identity for a specific chunk, bypassing the block. </p>
<p>Another less obvious problem is the need for rerandomization over the proofs. Reusing a proof for a specific chunk can reveal information that connects the user with previous posts. There are indeed SNARK proofs that allow rerandomization, like the <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2016/260.pdf">Groth16</a> scheme used in SNARKBlock. Nevertheless, the same cannot be said when presenting multiple proofs with a common hidden input.</p>
<p>Both of these problems are solved with the introduction of HICIAP.</p>
<h2 id="hiciap">HICIAP</h2>
<p>The main contribution of SNARKBlock is a new type of zero-knowledge proof, called HIdden Common Input Aggregate Proofs, or HICIAP (pronounced “high-chop”). With HICIAP, one can aggregate many zk-SNARKs (specifically Groth16 proofs) of the same relation into a single logarithmic-sized proof and show that they all share a common hidden input. At the same time, it is possible to link multiple HICIAP proofs of different relations, showing in a zero-knowledge proof that those inputs are equal. For our setting, this means that we can (i) have different proofs for each chunk of the blocklist that we aggregate to a single proof and (ii) link that proof with the other distinct proofs to make sure the same secret identity was used for all of them.</p>
<p>Let’s see now how SNARKBlock’s protocol differs from the BLAC-inspired inefficient construction. The authors separate the Authentication protocol into <strong>Sync</strong>, which is run by the user offline (i.e., before the authentication has to go through) performing necessary pre-computation, <strong>Attest</strong>, where the user produces and sends the token along with a zk-SNARK to prove eligibility, and <strong>Verify</strong>, where the service provider finally authenticates the user if the SNARK verifies correctly. Overall, the user can gather all the different proofs for each chunk and wrap them in a HICIAP proof, proving they share a common input. Later they can link this HICIAP proof with the rest of the proofs related to honest token extraction and registration.</p>
<p>More specifically, in <strong>Sync</strong>, the user starts by fetching the most recent version of the blocklist and its division into chunks. Then they compute:</p>
<ul>
<li>a proof \( \pi_{chunk_i} \) for each chunk of the blocklist that was altered or updated, proving that the user’s unique identifier is not correlated with any of the blocks in that chunk; i.e., if the user’s identifier is <em>k</em>, for all blocks <em>α = (nonce′, h)</em><!--\\( \alpha=(nonce^\prime, h) \\)--> in the chunk, <em>PRF(k, nonce′) ≠ h</em><!--\\( PRF(k, nonce^\prime) \ne h \\)--></li>
<li>a proof \( \pi_{isu} \) attesting to having registered, i.e. having a witness for a commitment signed by the identity provider</li>
</ul>
<p>When it’s time to use the service, the <strong>Attest</strong> protocol takes place. The user does the following:</p>
<ul>
<li>computes a proof \(\pi_{token}\) after they have extracted a token α, to prove that the token was computed honestly using their unique identifier <em>k</em>, <em>α = (nonce, PRF(k, nonce))</em></li>
<li>wraps the \( {\pi_{chunk_i}} \) proofs for all chunks, \( \pi_{isu} \) and \(\pi_{token}\) proofs into HICIAP proofs \( \hat{\pi_{chunk}} \), \( \hat{\pi_{isu}} \) and \(\hat{\pi_{token}}\) respectively</li>
<li>produces a proof \( {\pi_{link}} \) that all of the aforementioned HICIAP proofs share the same witness, their unique identifier <em>k</em></li>
<li>sends \( \hat{\pi_{chunk}} \), \( \hat{\pi_{isu}} \), \(\hat{\pi_{token}}\) and \( {\pi_{link}} \) to the service provider.</li>
</ul>
<p>Finally, the service provider checks the validity of the proofs during the <strong>Verification</strong> part.</p>
<h2 id="efficiency">Efficiency</h2>
<p>As noted before, SNARKBlock is much faster than BLAC, since both the verification time and proof size become logarithmic instead of linear in the size of the blocklist. In BLAC, a blocklist with 4 million bans would require a proof of 549MiB, whereas a SNARKBlock attestation for the same size blocklist is only 130KiB, making it feasible for use without elaborate hardware! But does this automatically make SNARKBlock efficient enough to be used in practice? We also cannot forget the extra cost SNARKBlock introduces: offline synchronization.</p>
<p>Here we can see the authors’ evaluation of different-sized blocklists. These include the synchronization time depending on how much the blocklist was altered, the attestation time (which translates to how much time the user takes to produce a proof), the verification time on the service provider’s side, and the size of the proof, with or without the use of different sized buffers<sup class="footnote-reference"><a href="#4">4</a></sup>.</p>
<!-- add pictures -->
<figure>
<img src="sync.png" alt="Image 1" style="display: inline-block; width: 45%; margin-right: 1%;">
<img src="attestation.png" alt="Image 2" style="display: inline-block; width: 45%; margin-right: 1%;">
<img src="verification.png" alt="Image 3" style="display: inline-block; width: 45%;">
<img src="proof_size.png" alt="Image 3" style="display: inline-block; width: 45%;">
<figcaption style="text-align: center;"><a href="https://eprint.iacr.org/2021/1577.pdf">SNARKBlock's evaluation</a> from the paper: (top left) synchronization time depending on blocklist alterations, (top right) attestation time, (bottom left) verification time, and (bottom right) proof size depending on the blocklist size.</figcaption>
</figure>
<p>More specifically, the top left figure shows the offline computation a client must do as a function of the number of changes to the blocklist. This includes syncing chunks and precomputing a proof that they are registered through an identity provider. We can see that the offline precomputation can take up to a couple of minutes for a large number of additions. Since the user can perform it asynchronously and periodically, it doesn’t introduce any significant overhead.</p>
<p>The top right figure shows the time clients take to attest to non-membership on a blocklist that has recently changed. This is the time it takes for a user to recompute the last chunk proof and link them all together. These results can be interpreted differently considering the different services that SNARKBlock can be used for; if the time to write a message and send it to get posted is smaller than the authentication time (a few seconds here) then the message would get queued. These times seem to be acceptable for forums primarily focused on posting and commenting anonymously. However, the results are impractical for implementations like real-time chat forums, where speed is of the essence and attestation needs to be on the order of milliseconds. </p>
<p>The two bottom graphs show the throughput and proof sizes for server verification. These graphs are in a semi-log scale and do in fact show that SNARKBlock proofs scale logarithmically with the number of elements in the blocklist, both in terms of size and time efficiency on the server’s side.</p>
<!--The only way to judge the scheme's usability is to consider what service it is being used for. If the time to write a message and send it to get posted is smaller than the authentication time, then the message would have to get queued. From the authors' evaluation, the times could be acceptable for forums primarily focused on posting and commenting anonymously. However, the results are impractical for implementations where speed is of the essence, like real-time chat forums. -->
<h1 id="conclusion">Conclusion</h1>
<p>Anonymous communication systems protect user privacy but face challenges in managing inappropriate behavior. Anonymous blocklisting schemes, powered by advanced cryptographic protocols like zk-SNARKs, enable blocking individual posts without revealing user identities. These schemes use signature and commitment schemes, along with pseudorandom functions, to maintain privacy while ensuring message authenticity. </p>
<p>SNARKBlock addresses inefficiencies in traditional systems by introducing HIdden Common Input Aggregate Proofs (HICIAP), which aggregate multiple proofs into a single efficient proof. This innovation achieves logarithmic proof sizes and verification times in relation to the size of the blocklist, making anonymous blocklisting practical for some large-scale applications, such as social media platforms. </p>
<p>However, further advancements are needed to fully realize a world where anonymous blocklisting schemes are seamlessly deployed and used in everyday applications. Future steps include examining the use of <a rel="noopener" target="_blank" href="https://iacr.org/archive/tcc2008/49480001/49480001.pdf">Incrementally Verifiable Computation</a> (IVC) or recursion techniques in order to recursively combine many proofs into one, and thus further reduce proof sizes and verification times.
Additionally, minimizing the cost of reinstating users without major recomputation is a key challenge that needs addressing to make the schemes more adaptable and user-friendly.
Finally, it is crucial to explore interoperability to ensure that anonymous blocklisting schemes can be seamlessly integrated with existing communication platforms and systems.
By tackling these challenges, we can move closer to using anonymous blocklisting in everyday digital communication.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>Technically this is the wrong terminology. The difference between a “proof” and an “argument” in cryptography lies in their soundness definition, which refers to the truthfulness of the protocol: if the statement is false, no Prover can convince a Verifier of the opposite. Proofs have statistical soundness (holds against an unbounded adversary), whereas arguments have only computational soundness (holds against a polynomially bounded adversary). For easier understanding, we can mislabel a SNARK, secure against bounded adversaries, as a “proof”.</p>
</div>
<div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup>
<p>In the world of Avatar, Aang (who is the Avatar) and Toph are part of a team trying to defend the Earth Kingdom against the Fire Nation, led by the Firelord and his son, Zuko. They end up in the city of Ba Sing Se (where Joo Dee resides) which has an authoritarian government refusing to acknowledge that a war is happening.</p>
</div>
<div class="footnote-definition" id="3"><sup class="footnote-definition-label">3</sup>
<p>While the described scheme has the same general mechanics as BLAC, it is presented in a simplified form that is closer to the SNARKBlock scheme for easier understanding. More details about the protocols can be found in the publications. </p>
</div>
<div class="footnote-definition" id="4"><sup class="footnote-definition-label">4</sup>
<p>Some of the experiments include a buffer. This optimization aims to fix the problem when the blocklist might be updated during the Sync process, resulting in recomputation during the Attest process and thus added latency. So they use a buffer of smaller chunks at the end of the list and a separate HICIAP instance to process them, which increases the number of distinct proofs but reduces the overall attestation time.</p>
</div>
Measuring and Exploiting Network Usable Information2024-06-21T00:00:00+00:002024-06-21T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/network-usable-information/<p>In large cloud service providers such as AWS, the customer provides an attributed graph and would like to perform tasks such as recommendations (i.e., link prediction in a graph) using message-passing methods within restricted budgets.
An attributed graph consists of both the graph structure and the node features.
As one of the message-passing methods, Graph Neural Networks (GNNs) are commonly used for graph tasks by propagating node features through the graph structure.
However, it is possible that not all the information in the provided graph is usable for solving the task.
Training a GNN would thus be a waste of time and resources for the customer.
Therefore, we aim to answer two questions:</p>
<ol>
<li>Given a graph with node features, how can we tell whether utilizing both graph structure and node features will yield better performance than utilizing either of them separately?</li>
<li>How can we know what information in the graph (if any) is usable to solve the tasks, namely, node classification and link prediction? </li>
</ol>
<p>Our goal is to design a metric for measuring how informative the graph structure and node features are for the task at hand, which we call network usable information (NUI).</p>
<p>In this blog post, we introduce how to measure the information in the graph, and to exploit it for solving the graph tasks. This blog post is based on our research paper, “NetInfoF Framework: Measuring and Exploiting Network Usable Information” [1], presented at ICLR 2024.</p>
<h2 id="what-is-an-attributed-graph">What is an attributed graph?</h2>
<figure>
<img src="./figure1.png" alt="attributed graph" width="500"/>
<figcaption>
Figure 1. An example of an attributed social network graph, where the nodes denote the users, and the edges denote whether two users are friends.
</figcaption>
</figure>
<p>A graph is a data structure that includes nodes and edges, representing the connections between nodes.
An attributed graph indicates that each node in the graph has a set of features.
For example, Figure 1 shows the attributed graph for a social network.
The nodes represent the users, illustrated as circles with university acronyms.
The edges indicate whether two users are friends, illustrated as black lines connecting circles.
A node might also contain additional information:</p>
<ol>
<li><strong>Node ID</strong>: illustrated as a thumbnail image of the user along with the user’s text name.</li>
<li><strong>Node attributes/features</strong>: represent whether the user is located in the US and whether the user likes to bike, illustrated as a \( 2 \times 2 \) table. </li>
<li><strong>Node label</strong>: signifies the user’s university, categorized into two groups represented by blue and red colors with acronyms: Carnegie Mellon University (CMU) or National Chiao Tung University (NCTU). </li>
</ol>
<p>In fact, node labels are similar to node features, but they often contain missing values, which we are interested in predicting.</p>
<figure>
<img src="./figure2.png" alt="mathematical representation of graph" width="500"/>
<figcaption>
Figure 2. The mathematical representation of the attributed graph, including an adjacency matrix, node features, and node labels. The red question mark denotes unknown.
</figcaption>
</figure>
<p>In Figure 2, the graph structure can be represented by an adjacency matrix, where 1 denotes the presence of an edge between two nodes, and 0 denotes no edge.
The node features are also represented by a matrix, where each feature is binary in the example, but it can also be continuous.
The node labels are represented by a matrix with one-hot encoding of the class label.</p>
<h2 id="what-are-the-common-graph-tasks">What are the common graph tasks?</h2>
<p>We consider the two common graph tasks:</p>
<ul>
<li><strong>Node Classification</strong>
<ul>
<li><em>Goal:</em> Classify the unlabeled nodes, while some labeled nodes are given.</li>
<li><em>Example:</em> Given a social network with features, can we guess which university Bob goes to, i.e., the label of the gray node in Figure 1?</li>
</ul>
</li>
<li><strong>Link Prediction</strong>
<ul>
<li><em>Goal:</em> Predict the potential additional edges in the graph.</li>
<li><em>Example:</em> Given a social network with features, can we guess whether David and Grace could become friends, i.e., the potential additional red-dash line in Figure 1? </li>
</ul>
</li>
</ul>
<h2 id="what-are-message-passing-methods">What are message-passing methods?</h2>
<!-- | U<sub>[n×r]</sub> | Left singular vectors of adjacency matrix | -->
<!-- | r | Rank for matrix decomposition | -->
<figure>
<img src="./figure3.png" alt="node embedding" width="500"/>
<figcaption>
Figure 3. Illustration of the nodes in a given graph projected into low-dimensional embedding space. The nodes that are more similar in the graph are closer in the embedding space.
</figcaption>
</figure>
<p>Message-passing methods utilize the graph structure to propagate information from the neighbors of a node to the node itself.
Known as sum-product message-passing, belief propagation methods [2, 3] directly perform inference on the graph through several propagation iterations.
Although they are fast and effective because they require neither parameters nor training, belief propagation methods are mainly designed to solve node classification problems based solely on the graph structure and usually do not consider node features.</p>
<p>Another variety of message-passing methods, Graph Neural Networks (GNNs) [4], are a class of deep learning models.
They are commonly used to generate low-dimensional embeddings of nodes to perform graph tasks by learning end-to-end with a training objective.
As shown in Figure 3, the nodes that are better connected in the graph are expected to have closer embeddings in the low-dimensional space.</p>
<p>Some studies remove the non-linear functions in GNNs, and still achieve good performance, which we call linear GNNs [5, 6].
One of the many advantages of linear GNNs is that their node embeddings are available prior to model training.
A comprehensive study on linear GNNs can be found in [6].</p>
<!-- | Symbol | Definition |
| -------- | -------- |
| \\( A_{[n \times n]} \\) | Adjacency matrix |
| \\( X_{[n \times f]} \\) | Node feature matrix |
| \\( W_{[f \times c]} \\) | Learnable parameter matrix in linear GNNs |
| \\( \hat{Y}_{[n \times c]} \\) | Predicted node label matrix |
| \\( n \\) | Number of nodes |
| \\( f \\) | Number of features |
| \\( c \\) | Number of classes | -->
<h2 id="measuring-nui-would-a-message-passing-method-work-in-a-given-setting"><em>Measuring NUI:</em> Would a message-passing method work in a given setting?</h2>
<figure>
<img src="./figure4.png" alt="scenarios" width="400"/>
<figcaption>
Figure 4. Scenarios in which the message-passing method may not work well. (a): The graph structure exhibits no network effects. (b): Node features are not correlated with node labels.
</figcaption>
</figure>
<p>Given an attributed graph, how can we measure the network usable information (NUI)? The message-passing method may not work well in the following two conditions:</p>
<ol>
<li><strong>No network effects:</strong> the graph structure is not useful to solve the graph task. In Figure 4(a), since every labeled node has one blue and one red neighbor, we cannot infer the label for Bob by examining its neighbors.</li>
<li><strong>Useless node features:</strong> the node features are not useful to solve the graph task. In Figure 4(b), we can see that whether a user likes to bike is not correlated with the user’s university.</li>
</ol>
<p>If either of these two extreme conditions applies, a message-passing method is likely to fail in inferring the unknown node label, i.e., Bob’s university.
However, it is very likely that the graph information has varying levels of usefulness, ranging between completely useful and completely useless.
For example, in Figure 2, only 1 out of 2 node features is useful: the node feature ‘located in the US’ is useful, while the node feature ‘likes to bike’ is not.</p>
<figure>
<img src="./figure5.png" alt="structural and neighbors' feature embedding" width="800"/>
<figcaption>
Figure 5. Illustration of structural embedding (left) and neighbors' feature embedding (right). (a): SVD is conducted on the adjacency matrix to extract structural embedding. (b): Dimensionality reduction is conducted on the node features. (c): Messages passed from a node's neighbors are aggregated to generate its node embedding.
</figcaption>
</figure>
<p>We focus on analyzing whether GNNs will perform well in a given setting, which is an important problem in the industry.
The reason is that, for large cloud service providers, the customer provides them with an attributed graph and requests them to solve the graph task (e.g., recommendation) using GNNs within restricted budgets.
However, if the given graph lacks network usable information (NUI), the resources spent on training GNNs will be wasted.
Therefore, our method serves as a preliminary tool to determine whether resources should be allocated for training expensive deep models.</p>
<p>A straightforward way is to analyze the node embedding of the given graph generated by GNNs, but this is only available after training, which is expensive and time-consuming.
For this reason, we propose to analyze the derived node embedding in linear GNNs, which can be precomputed before model training.
More specifically, we derive three types of node embedding that can comprehensively represent the information of the given graph, namely:</p>
<ol>
<li><strong>Structural embedding (\( U \)):</strong> for the information of the graph structure. It is extracted by decomposing the adjacency matrix with singular value decomposition (SVD). Intuitively, the left singular vectors U give the information of the node community. For example, in Figure 5(a), U identifies that the first three users belong to the blue community, while the last four belong to the red community. This is useful when the node features are not useful to solve the graph task.</li>
<li><strong>Feature embedding (\( F \)):</strong> for the information of the node features. It consists of the original node features after dimensionality reduction. As shown in Figure 5(b), Principal Component Analysis (PCA) [7] is used as the dimensionality reduction technique. This is useful when there are no network effects, i.e. the graph structure is not useful to solve the graph task.</li>
<li><strong>Neighbors’ feature embedding (\( S \)):</strong> for the information of the features aggregated from the neighbors. As shown in Figure 5(c), the messages are passed and aggregated from the neighbors for two steps. The intuition is that, in addition to the information from the user, the user’s neighbors also provide useful information to the task. Leveraging their information leads to better performance on the graph task. This is useful when both the graph structure and the node features are useful to solve the graph task.</li>
</ol>
<p>Once we have the embedding that can represent the information of the nodes of the graph, we propose NetInfoF_Score, and link the metrics of graph information and task performance with the following proposed theorem:</p>
<p><strong>Definition 1</strong> (NetInfoF_Score of \( Y \) given \( X \)).
<em>Given two discrete random variables \( X \) and \( Y \), NetInfoF_Score \( s \) of \( Y \) given \( X \) is defined as:</em>
\[ s = 2^{-H(Y|X)} \]
<em>where \( H(\cdot | \cdot) \) denotes the conditional entropy [8].</em></p>
<p><strong>Theorem 1</strong> (NetInfoF_Score).
<em>Given two discrete random variables \( X \) and \( Y \), NetInfoF_Score \( s \) of \( Y \) given \( X \) lower-bounds the accuracy:</em>
\[ s = 2^{-H(Y|X)} \leq accuracy(Y|X) = \sum_{x \in X}{\max_{y \in Y}{p_{x, y}}} \]
<em>where \( p_{x, y} \) is the joint probability of \( x \) and \( y \).</em></p>
<p>The proof is in [1].
The intuition behind this theorem is that the conditional entropy of \( Y \) (e.g. labels) given \( X \) (e.g. “like biking”), is a strong indicator of how good of a predictor \( X \) is, to guess the target \( Y \).
It provides an advantage to NetInfoF_Score by giving it an intuitive interpretation, which is the lower-bound of the accuracy. When there is little usable information for the task, the value of NetInfoF_Score is close to random guessing.</p>
<figure>
<img src="./figure6.png" alt="Emperical study" width="400"/>
<figcaption>
Figure 6. Our theorem holds, where NetInfoF_Score is always less than or equal to validation accuracy.
</figcaption>
</figure>
<p>In Figure 6, each point represents the accuracy and NetInfoF_Score obtained by solving graph tasks using one type of node embedding.
We find that NetInfoF_Score lower-bounds the accuracy of the given graph task, as expected.
If an embedding has no usable information to solve the given task, NetInfoF_Score gives a score close to random guessing (see lower left corner in Figure 6).
The details of the experiment can be found in [1].</p>
<h2 id="exploiting-nui-how-to-solve-graph-tasks"><em>Exploiting NUI:</em> How to solve graph tasks?</h2>
<p>In this blog, we focus on explaining how to solve node classification. How to solve link prediction is more complicated, and the details can be found in [1].</p>
<p>To solve node classification, we concatenate different types of embedding, and the input of the classifier is as follows:
\[ U \parallel F \parallel S , \]
where \( \parallel \) is concatenation, \( U \) is the structural embedding, \( F \) is the feature embedding, and \( S \) is the neighbors’ feature embedding. Among all the choices, we use logistic regression as the node classifier, as it is fast and interpretable.</p>
<h2 id="how-well-does-our-proposed-method-perform">How well does our proposed method perform?</h2>
<figure>
<img src="./table1.png" alt="Node classification" width="1000"/>
</figure>
<p>As shown in Table 1, applied on 12 real-world graphs, NetInfoF outperforms GNN baselines in 7 out of 12 datasets on node classification.</p>
<figure>
<img src="./figure7.png" alt="NetInfoF_Score on real-world datasets" width="200"/>
<figcaption>
Figure 7. NetInfoF_Score highly correlates to test performance in real-world datasets. Each point denotes the result of one type of embedding in each dataset.
</figcaption>
</figure>
<p>In Figure 7, NetInfoF_Score is highly correlated with test performance on node classification.
This indicates that NetInfoF_Score is a reliable measure for deciding whether to use a GNN on the given graph or not in a short time, without any model training.
Noting that although our theorem proves that NetInfoF_Score is a lower bound on <em>training</em> accuracy, it is possible for <em>testing</em> accuracy to be lower than the NetInfoF_Score (blue points below 45 degree line in Figure 7).</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post we investigate the problem of predicting whether a message-passing method would work well on a given graph.
To solve this problem, we introduce our approach called NetInfoF to measure and exploit the network usable information (NUI).
Applied on real-world graphs, NetInfoF not only correctly measures NUI with NetInfoF_Score, but also outperforms other baselines 7 out of 12 times on node classification.</p>
<p>Please find more details in our paper [1].</p>
<h2 id="references">References</h2>
<p>[1] Lee, M. C., Yu, H., Zhang, J., Ioannidis, V. N., Song, X., Adeshina, S., … & Faloutsos, C. NetInfoF Framework: Measuring and Exploiting Network Usable Information. International Conference on Learning Representations (ICLR), 2024.</p>
<p>[2] Koutra, D., Ke, T. Y., Kang, U., Chau, D. H., Pao, H. K. K., & Faloutsos, C. Unifying guilt-by-association approaches: Theorems and fast algorithms. Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD), 2011</p>
<p>[3] Günnemann, W. G. S., Koutra, D., & Faloutsos, C. Linearized and Single-Pass Belief Propagation. VLDB Endowment, 2015.</p>
<p>[4] Kipf, T. N., & Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR), 2017.</p>
<p>[5] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., & Weinberger, K. Simplifying Graph Convolutional Networks. PMLR International Conference on Machine Learning (ICML), 2019.</p>
<p>[6] Yoo, J., Lee, M. C., Shekhar, S., & Faloutsos, C. Less is More: SlimG for Accurate, Robust, and Interpretable Graph Mining. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2023.</p>
<p>[7] Principal component analysis. In Wikipedia. <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Principal_component_analysis">https://en.wikipedia.org/wiki/Principal_component_analysis</a>, 2024.</p>
<p>[8] Conditional entropy. In Wikipedia. <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conditional_entropy">https://en.wikipedia.org/wiki/Conditional_entropy</a>, 2024.</p>
Piano: Extremely Simple, Single-Server Private Information Retrieval2024-05-16T00:00:00+00:002024-05-16T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/piano-private-information-retrieval/<p>Information retrieval is a pervasive aspect of our digital life, yet the process of retrieval consistently compromises the privacy of the users (i.e., the information requesters). For example, when you access a website, you need to first retrieve the website’s IP address from a DNS (Domain Name Service) server, so that you can talk to the website provider associated with that IP address. Unfortunately, this process discloses your browsing history to the DNS server. Similarly, submitting a query to a search engine exposes your search history to the search service provider. Despite efforts to develop privacy-preserving information retrieval services, those service providers often employ the same retrieval techniques as their non-private counterparts and just promise to keep the users’ records securely or delete them afterward. Nonetheless, these service providers are still susceptible to data breaches. </p>
<p>Is there a way to completely eradicate the leakage of information during information retrieval?
A natural attempt is to encrypt the queries, so the server cannot read the queries in plaintext. However, without seeing the query, how can the server locate the relevant information? It seems like we are now facing a dilemma. </p>
<h1 id="what-is-private-information-retrieval-pir">What is Private Information Retrieval (PIR)?</h1>
<p>Private information retrieval (PIR), first introduced by <a rel="noopener" target="_blank" href="https://www.computer.org/csdl/proceedings-article/focs/1995/71830041/12OmNzYNNfi">Chor et al.</a> back in 1995, is exactly the formal abstraction of “retrieving public information privately”. It is defined as follows.</p>
<blockquote>
<p><strong>Definition (Private Information Retrieval Protocol):</strong> For simplicity, let’s assume there is only one server storing a database consisting of \(n\) integers, denoted by \(X_1,\dots, X_n\). Now assume the client wants to learn \(X_i\). The client and the server then engage in an interactive algorithm, formally referred as a Private Information Retrieval protocol. At the end of the protocol, the client should learn the correct value of \(X_i\). Also, the protocol should provide query privacy: the server should learn nothing about the query index \(i\).</p>
</blockquote>
<p></p>
<p>At first glance, it might seem impossible to design such a protocol. Don’t worry – there is at least a naive protocol satisfying the definition: given any query index \(i\), the client just downloads the whole database and reads \(X_i\) locally. This protocol is perfectly private: the server only knows that the client downloads the database and that is independent of the actual query. Nonetheless, this protocol is not practical – the computation and communication costs per query are both linear in the size of the database.</p>
<p>Previously, cryptographers focused on improving the <em>communication cost</em> of PIR. Most proposed PIR schemes rely on <em>Homomorphic Encryption (HE)</em>, a special type of encryption scheme that allows <em>computation on the ciphertexts</em>. HE essentially avoids the dilemma we saw before – the server can perform some form of computation on the encrypted query and somehow locate the necessary information for the client. Existing schemes already achieved \(O(\log n)\) communication per query based on HE (e.g., <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3460120.3485381">OnionPIR</a> and <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/abstract/document/9833700/">Spiral</a>).</p>
<!---
A typical scheme is as follows. The server represents the database as a $\sqrt{n}\times \sqrt{n}$ matrix, denoted as $M$. Suppose the client is interested in a database value located the $j$-th column of the matrix. The client can homomorphically encrypts an one-hot vector $e_j=(0,\dots,0,1,0\dots,0)^T\in \{0,1\}^{\sqrt{n}}$ where only the $j$-th location is 1. The clients sends the homomorphically encrypted vector $[e_j]$ to the server, and the server computes the matrix-vector multiplication $M[e_j]$ and return the results to the client. The client decrypts the result, which contains exactly the $j$-th column of the matrix, and they gets the desired value. This protocol saves the communication to $O(\sqrt{n})$.
-->
<p>An unsolved issue is the <em>computation cost</em>. Even with the help of HE, the computation cost is still linear in \(n\). Such linear-computation-cost PIRs are not suitable for larger databases. For example, a typical DNS server contains several hundreds of gigabytes of records and it is over-costly if the server has to scan all the records for each query. Unfortunately, <a rel="noopener" target="_blank" href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=c3404368a32ab694c862f88cd1b5a3e6208f1bff">Beimel, Ishai and Malkin</a> showed that in the classical PIR model, linear computation is inevitable. The intuition behind their lower bound is actually pretty straightforward – if the server does not touch any particular database entry during the query, the server knows the client’s query cannot be this entry. This at least leaks one bit of information. </p>
<p>To get around this lower bound and achieve sublinear computation cost PIR, we can resort to a powerful idea in computer science – <strong>preprocessing</strong>. Preprocessing PIR allows the client or the server to run some (possibly interactive) protocols before the query phase begins, and store some necessary hints in the client space or the server space. The client or the server then uses those hints to help with the online queries.
<a rel="noopener" target="_blank" href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=c3404368a32ab694c862f88cd1b5a3e6208f1bff">Beimel, Ishai and Malkin</a> first showed that preprocessing PIR can indeed achieve \(O(\frac{n}{\log n})\) computation cost per query. <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2019/1075.pdf">Corrigan-Gibbs and Kogan</a> (and <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/081.pdf">their follow-up work</a> with Henzinger) further showed the computation cost can be improved to \(O(\sqrt{n})\) per query. Nonetheless, these schemes are relatively complicated and remain in theory. Then, a natural question is:</p>
<blockquote>
<p>Is there a practical, and sublinear computation cost PIR protocol?</p>
</blockquote>
<p></p>
<h1 id="piano-extremely-simple-pir-with-preprocessing">Piano: Extremely Simple PIR with Preprocessing</h1>
<p>We now introduce our work Piano (<a rel="noopener" target="_blank" href="https://eprint.iacr.org/2023/452">Zhou et al., IEEE S&P 2024</a>), an extremely simple and practical PIR scheme with preprocessing. It is easy to implement – the core idea can be implemented within 150 lines of Go code, and it is blazingly fast. Piano only takes 12ms to finish a query on a 100GB database, which is nearly 1000x faster than the previous best solution (<a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/949">Henzinger et al. 2022</a>)!</p>
<p>Piano starts with the following idea: </p>
<blockquote>
<p>Downloading the whole database for every query seems bad, but what if the client can just download the database once and prepare for many queries?</p>
</blockquote>
<p></p>
<p>That is, the client downloads the database during the preprocessing (importantly, without knowing the following queries), computes and stores some sublinear size hints, and then deletes the database. The client then utilizes those hints to help with online queries.</p>
<p>This idea cannot scale to Internet-size databases (like Google search), for sure. However, for many medium-size databases, the idea is practical if we can build a <strong>streaming preprocessing</strong>. Streaming means that instead of downloading the database at once, the client can download a small portion of the database each time, locally update the hints, and delete that portion of the database from the local storage. One can imagine that this process is similar to watching a Youtube video – you do not need to download the whole video at once, but rather just dynamically fetch a piece of the video 30 seconds ahead. We indeed designed a streaming preprocessing algorithm, and given a database of size around 100 gigabytes, the cost will be similar to watching Youtube videos for several hours. </p>
<h2 id="preprocessing">Preprocessing</h2>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/piano-private-information-retrieval/preprocessing.png" alt="Preprocessing" />
<em>Figure 1: Illustration of the preprocessing for a database with 16 entries.</em></p>
<p>We now get to introduce some details about the preprocessing.
We split the \(n\) indices into \(\sqrt{n}\) chunks of size \(\sqrt{n}\) (see Figure 1 above).
During the preprocessing, the client will store roughly \(\tilde{O}(\sqrt{n})\) “random linear equations”, where the variables are the entries from the database. To generate one linear equation, the client samples one entry from each database chunk, resulting in \(\sqrt{n}\) variables for one equation. The client computes the sum of those variables during the preprocessing. In the end, the client only stores the sum values and the random seeds used to generate the equations. In addition, the client stores around \(\log n\) random entries for each chunk. Those random entries will be called the <em>replacement entries</em>.
Finally, the equations and the replacement entries comprise the client’s hint, which takes \(O(\sqrt{n}\log n)\) storage space. The server does not store any hints.
Since the client does all the preprocessing computation locally, the server cannot observe the preprocessing result. Moreover, as we mentioned before, the preprocessing is done in a streaming manner, so the temporal storage requirement is small: the client only needs to initialize the equations to zeros, download the chunks one by one, and accumulatively compute the sum values.</p>
<h2 id="handling-a-single-query">Handling a Single Query</h2>
<p>Now let’s see how the preprocessing helps with the online query. We use the example in Figure 1. Assume the client is querying for \(X_7\). The client will first scan the local equations and look for the first equation that contains \(X_7\). Because of the structure of those equations, during the scanning process, the client only needs to regenerate the second index in each equation using the stored random seeds, and match the generated index against “7”. The client will find the equation \(X_1 + X_7 + X_{10} + X_{12}=Y_1\) in the example. Since the client already stores \(Y_1\), which is the sum of these four elements, if the client can further learn the sum of \(X_1 + X_{10} + X_{12}\), it will learn the value of \(X_7\) by a basic subtraction. Unfortunately, the client cannot directly send the indices “\(1, 10, 12\)” to the server and ask the server to return the sum of \(X_1 + X_{10} + X_{12}\), because the server will immediately see that there are no indices from the second chunk and learn that the query must belong to the second chunk. This is an information leakage.</p>
<p>Luckily, we can mitigate this information leakage easily – remember that the client additionally stores some replacement entries in each chunk. So instead of directly removing the query index from the equation, the client <strong>replaces</strong> the query index with a known replacement entry index in the same chunk as the query. In our example, the client can simply replace \(X_7\) with \(X_6\), and send the four indices “\(1, 6, 10, 12\)” to the server. The server should return the sum of \(X_1 + X_6 + X_{10} + X_{12}\) to the client, upon receiving the indices. The client can then compute \(X_7\) as in Figure 2.
Note that this query <strong>leaks no information about the actual query</strong>: the server just sees four random indices and each one of them is just an independently random index within one chunk, given the fact that the server cannot see the preprocessing results. This query is also <strong>efficient</strong>: the client takes \(O(\sqrt{n})\) time to find the equation, the server takes \(O(\sqrt{n})\) time to compute the sum, and the communication cost is just \(O(\sqrt{n})\) (sending the edited equation to the server).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/piano-private-information-retrieval/query.png" alt="Query" />
<em>Figure 2: Illustration of the online query.</em></p>
<h2 id="multiple-queries">Multiple Queries</h2>
<p>To amortize the cost of the preprocessing, we want to support as many queries as possible. Let’s first assume we need to support \(\sqrt{n}\) random queries, which amortize the preprocessing costs to the same as the query costs, i.e., \(O(\sqrt{n})\) computation and communication cost per query.</p>
<p>The main issue is that we cannot reuse the same preprocessed equation or the same replacement entry to handle two queries. Otherwise, it will cause privacy leakage. We also don’t want to deplete our equations and replacement entries. What should we do?</p>
<p>It is easy to handle the replacements: recall that we have \(\sqrt{n}\) random queries and \(\sqrt{n}\) chunks. With a classical balls-into-bins argument, there will be at most \(O(\log n)\) queries in each chunk with high probability. So preparing \(O(\log n)\) replacements per chunk is enough. </p>
<p>It is trickier to handle the equations. A not-so-oblivious observation is that we cannot just remove the consumed equation or replace it with a random backup equation, because it will skew the joint distribution of the remaining equations – doing so makes the current query less likely to appear in the remaining equations! The correct strategy is to replace the consumed equation with a random equation conditioned on the current query being included. We modify our preprocessing algorithm and add additional structural backup equations to facilitate this refresh strategy. We omit the details here due to space constraints and refer the interested readers to <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2023/452">our original paper</a> for the full description.</p>
<!---
To facilitate this refresh strategy, we prepare some special backup equations, as shown in Figure 3. Specifically, we prepare \\(O(\log n)\\) backup equations per chunk, and those backups prepared for the \\(i\\)-th chunk will ignore the entries in the \\(i\\)-th chunk. The reason behind it will be clear if we see the refresh algorithm: after each query \\(X\\) located in the \\(j\\)-th chunk, we will pick a backup equation prepared for chunk \\(j\\), and just complete that backup equation by adding \\(X\\) to it. We then replace the consumed equation with this completed backup. See an example in Figure 4.
![Backup](backup2.png)
*Figure 4: Illustration of the backup strategy. We prepare \\(\sqrt{n}\\) groups of backups, each contains \\(\log n\\) equations. The \\(i\\)-th group ignores the entries in the \\(i\\)-th chunk. Assume the client just queried for \\(X_7\\). So the client picks a backup from the second group, completes it with \\(X_7\\), and replaces the consumed equation with this completed equation.*
--->
<p>Our ultimate goal is to support many adaptive queries and remove the “\(\sqrt{n}\) random queries” restriction. First, to make the queries look random, we can require the server to randomly permute the database upfront and share the permutation seed with the client. As long as the client is not making queries that depend on the permutation, the queries can be viewed as randomly distributed. An experienced reader may notice that a malicious server may not necessarily permute the database correctly. In our paper, we proposed extra steps to ensure that a malicious server only hurts the correctness but not the privacy of the scheme.</p>
<p>Moreover, to support more queries, the simplest way is to redo the preprocessing per \(\sqrt{n}\) queries. We can do better by utilizing a pipelining trick, shown in Figure 5. During the online query phase for the first \(\sqrt{n}\) queries, we simultaneously run the preprocessing for the next batch of \(\sqrt{n}\) queries. So when we finish the first batch of queries, we have already finished the preprocessing for the next batch of queries, and we can immediately start the next query.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/piano-private-information-retrieval/pipeline.png" alt="Pipeline" />
<em>Figure 5: Illustration of the pipelining trick. We run the preprocessing for the next batch simultaneously with the online queries. Then, the whole protocol can have a one-time preprocessing and a continuous online phase.</em></p>
<h2 id="asymptotic-metrics-of-piano">Asymptotic Metrics of Piano</h2>
<p>Piano’s asymptotic behaviors can be summarized as follows. </p>
<blockquote>
<p><strong>Simplified Theorem.</strong> Piano is a PIR protocol with one-time preprocessing, and it supports a polynomially bounded number of queries, while having the following asymptotic behaviors:</p>
<ol>
<li>One-time Preprocessing:
<ul>
<li>\(O(n)\) communication;</li>
<li>\(\tilde{O}(n)\) client computation.</li>
</ul>
</li>
<li>Online Query (per query):
<ul>
<li>\(O(\sqrt{n})\) communication.</li>
<li>\(\tilde{O}(\sqrt{n})\) client computation;</li>
<li>\(O(\sqrt{n})\) server computation; </li>
</ul>
</li>
<li>Storge:
<ul>
<li>\(\tilde{O}(\sqrt{n})\) client storage;</li>
<li>No additional server storage.</li>
</ul>
</li>
</ol>
<p>Here, \(\tilde{O}()\) hides the polylogarithmic terms.</p>
</blockquote>
<p></p>
<p>Notably, Piano achieves nearly optimal time-space tradeoff: <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/081.pdf">Corrigan-Gibbs, Henzinger and Kogan</a> showed that in any preprocessing PIR scheme, if the client stores \(S\) bits after preprocessing and the online query time cost is \(T\), then \(S \times T \ge \Omega(n)\). Piano achieves \(\tilde{O}(\sqrt{n})\) client storage and \(\tilde{O}(\sqrt{n})\) online time, which matches the bound except for a polylogarithmic factor!</p>
<!---
**Client Computation.** The client takes \\(n \log n\\) time to do preprocessing for \\(\sqrt{n}\\) queries. Each online query requires the client to take \\(O(\sqrt{n})\\) expected time to find the equation. So the amortized client cost per query is \\({O}(\sqrt{n}\log n)\\).
**Server Computation.** The server just takes linear time to stream the database for the client during the preprocessing phase and take \\(O(\sqrt{n})\\) time to compute the equation sum for each online query. So the amortized server cost per query is \\({O}(\sqrt{n})\\).
**Communication.** The client streams \\(n\\) integers for \\(\sqrt{n}\\) queries. The client also sends a \\(\sqrt{n}\\)-size equation to the server. The server's response is just a single integer. So the amortized communication cost per query is \\({O}(\sqrt{n})\\).
**Storage.** The client stores no more than \\({O}(\sqrt{n}\log n)\\) equations and \\({O}(\sqrt{n}\log n)\\) replacements. The client only stores the random seeds and the sum value for the equations. So the total client storage requirement is \\({O}(\sqrt{n}\log n)\\). Note that the server has no per-client storage.
--->
<h2 id="empirical-results">Empirical Results</h2>
<p>We tested Piano on a 100GB database containing 1.6 billion 64-byte records. We compared it to the previous state-of-the-art scheme SimplePIR (<a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/949">Henzinger et al. 2022</a>). As shown in the following table, our scheme has a nearly 1000x improvement in terms of pure computation, and a 120x improvement in the end-to-end latency, while having advantages in communication and storage. Note that the streaming preprocessing only takes 45 minutes (8-thread parallelization).</p>
<table><thead><tr><th align="left"></th><th align="center">Piano</th><th align="center">SimplePIR</th></tr></thead><tbody>
<tr><td align="left">End-to-end Latency</td><td align="center">12ms (computation) + 60ms (network)</td><td align="center">~11s</td></tr>
<tr><td align="left">End-to-end Communication</td><td align="center">220KB</td><td align="center">2.3MB</td></tr>
<tr><td align="left">Storage</td><td align="center">0.8GB</td><td align="center">1.2GB</td></tr>
</tbody></table>
<h1 id="applications-limitations-and-open-problems">Applications, Limitations and Open Problems</h1>
<p>We are now actively exploring potential applications of Piano. As we mentioned earlier, a private search engine is one of the most attractive applications of PIR,
and we are indeed building such an engine based on the combination of an optimized version of Piano and graph-based search algorithms. Our preliminary results show
that this private search engine can handle a static database nearly the size of English Wikipedia.
It achieves search quality comparable to those non-private search algorithms and provides orders of magnitude of efficiency improvement over previous best
solutions, including <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2023/1438">Tiptoe</a> and <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/154">Coeus</a>.</p>
<p>Some other applications, such as private DNS query, remain more challenging. One of the main technical difficulties is that the database (e.g., the DNS records) is
being updated frequently, which contradicts the assumption of Piano. Two recent works proposed possible solutions for updatable databases (<a rel="noopener" target="_blank" href="https://eprint.iacr.org/2024/303.pdf">Lazzaretti and Papamanthou</a>, <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2024/318.pdf">Hoover et al.</a>), but it remains an open problem as how to design a truly practical updatable PIR scheme.</p>
<p>Another limitation of Piano is the lack of appropriate permission control since the preprocessing of Piano reveals the whole database to the client. Imagine we are designing a PIR protocol for a personal credit score lookup service. It is not acceptable for one individual to learn all others’ credit scores. It remains an interesting open problem to design a PIR scheme with proper permission control, which is required by many real-world applications. </p>
Precise Data Center Traffic Engineering with Constrained Hardware Resources2024-05-10T00:00:00+00:002024-05-10T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/precise-traffic-engineering/<p>Data center networks are similar to large cities:
they are at massive scale, and there exist many available roads/paths between a pair of source and destination.
To move data from one place to another, we transfer data in chunks of bytes known as packets (roughly analogous to vehicles),
which traverse one of the available paths.
Just like what happens during rush hours—i.e., vehicles get congested on a road—packets
also suffer from congestion if too many of them end up taking the same path.
Congestion leads to bad application performance since more time is needed to transfer data.
Therefore, we want to route packets via different paths to avoid congestion.
Since the paths have different capacity (analogous to the number of car lanes), the number of packets
sent onto each path should follow some calculated split ratio, which we refer to as weight distribution, in order to achieve optimal load balancing.</p>
<p>At path intersections—i.e., where paths split or converge—a
<a rel="noopener" target="_blank" href="https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56990-series">purpose-built network switching chip</a>
exists to forward packets to an assigned path following a calculated weight distribution.
More details about how a switch achieves this will be seen shortly.
However, making packets precisely follow any given weight distribution is difficult because the switches have constrained hardware resources to support it.
When packets are distributed imprecisely, we will observe negative outcomes such as load imbalance and congestion.
<strong>A key challenge is the following: how can we achieve precise packet distribution given the constrained hardware resources?</strong>
Our recent work addresses this challenge and achieves very high fidelity in following the distribution.
In this blog post, I will first describe how load balancing is implemented in switch hardware
and then quantify the impact when the weight distributions are not followed precisely.
Finally, I will explain our proposed approach and show how it mitigates the problem.</p>
<h1 id="traffic-engineering-switch-hardware-and-precision-loss">Traffic engineering, switch hardware and precision loss</h1>
<p>There are two types of elements in a data center network: end hosts and switches.
These elements are interconnected by network links in a hierarchical, multi-rooted tree structure, as illustrated in the top right portion of Figure 1.
The end hosts are colored, and switches are labeled as S1, S2, etc.
A pair of end hosts with the same color might want to exchange some amount of bytes, say 100 GB (or simply 100G),
which is called <em>demand</em>. It is simple to find a path to route one demand.
But when there are so many of these demands from various end hosts, routing becomes a complicated optimization problem.
This is because we also need to ensure that the load on each path is balanced.
Hence we rely on a technique called <em>traffic engineering (TE)</em> to solve this problem.
The goal of TE is to route all demands while achieving optimal load balancing.
TE works as follows (shown in the top left part of Figure 1):</p>
<ol>
<li>With a global view of the network, a <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/nsdi21/presentation/ferguson">TE controller</a> collects demands from the network.</li>
<li>Next, TE solves the aforementioned optimization problem using some <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Linear_programming">linear programming</a> algorithm on a powerful server platform.</li>
<li>Once the TE controller has found the optimized routing for these demands, it generates a <em>TE solution</em> that contains weight distributions for each demand.
More specifically, the TE solution provides a concrete plan for each demand, specifying which paths to use and what fraction of the demand to allocate to each path.
This TE solution is then forwarded to the switches, which are responsible for implementing the specified distributions.</li>
</ol>
<img src="te-system.png" alt="Architecture of a TE system" width="550"/>
<p align="center"><i>Figure 1: <b>Top:</b> A typical TE system in data center networks. <b>Bottom:</b> Weighted traffic distribution in switch hardware.</i></p>
<p>Taking the red host pair in Figure 1 as an example, the TE solution specifies that the 100G red demand should be split as 62G and 38G across 2 ports on switch S3.
(I will explain the strikethroughs in the figure shortly.)
This blog post focuses solely on how S3 implements this desired traffic distribution.
How the TE controller generates the TE solution or why the TE system works in this particular way is beyond the scope of this post.</p>
<p>Now, let us zoom into S3 and try to understand how it works, as depicted by the bottom half of Figure 1.
Switches forward packets using two sets of configuration rules stored in their <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Static_random-access_memory">SRAM</a> memory: destination rules and groups.
A destination rule stores a pre-configured IP address and a pointer to a group.
Upon receiving an incoming packet, the switch compares the destination IP address in the packet header with the IP address in each destination rule.
If the two addresses match, the packet follows the pointer in the rule to the corresponding group, which determines the packet’s next destination.
A <em>group</em> is a data structure containing a set of egress ports.
The packet consults the group, which selects one of the ports uniformly at random as the egress port.
For packets destined for the red host, they are assigned to group G1, while packets bound for other hosts use different groups.</p>
<p>To approximate the desired traffic distribution, ports p1 and p2 are replicated 62 and 38 times within group G1.
This ensures that 62% of the packets will be directed to port p1, with the remaining 38% routed to port p2.
However, a problem arises: these switches have very limited memory.
Meanwhile, entry replication often consumes a significant amount of memory space.
In other words, the groups may not fit into the switch memory.
Unfortunately, a straightforward solution of simply adding more memory proves difficult due to
a concept known as the <a rel="noopener" target="_blank" href="https://www.linkedin.com/pulse/understanding-power-performance-area-ppa-analysis-vlsi-priya-pandey/">power-performance-area tradeoff</a>.
On-chip switch memory cannot be built to simultaneously achieve high read/write speed, large space and energy efficiency.
We thus have to seek other approaches.</p>
<p>For demonstration purposes, let us assume that there is only space available for 10 port entries.
To fit into this space, G1’s size needs to be reduced since it currently requires 100 entries.
The straightforward approach is to reduce the weights to 31:19, which is equivalent to 62:38 but smaller.
However, this is still too large.
Another option is to adjust the weights and round them to 3:2.
This is advantageous because G1 now only consumes 5 entries!
The process of reducing groups to a smaller version is called <em>group reduction</em>.
Typically, this is handled by a group reduction algorithm running alongside the TE controller.
It is important to note that the final weights of 3:2 would change the desired traffic distribution from 62G:38G (struckthrough in Figure 1) to 60G:40G (bold).
This creates a difference in link loads (i.e., the absolute volume of bytes placed on a link) between what the TE solution specifies and what is actually implemented on the switch.
This difference on a link is referred to as <em><strong>precision loss</strong></em>.
While precision loss may seem detrimental, it is crucial to understand its impact on real-world networks, which will be discussed in the next section.</p>
<h1 id="precision-loss-in-the-wild">Precision loss in the wild</h1>
<p>A good metric for quantifying the impact of precision loss is <em>link utilization</em>.
Link utilization reflects the aggregated load on a link as a percentage.
If the utilization exceeds the expected value, we know the link is overloaded.
While it is trivial to assess the utilization of one link, how can we evaluate the network as a whole?
The answer is to examine the distribution of link utilization.
Specifically, we are interested in the common case (p50) and tail (p99 and max) link utilization.</p>
<p>Figure 2 presents a cumulative distribution function (CDF) of the utilization of all links
measured from a large <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3544216.3544265">Google production data center</a>.
There are two curves in the figure: the blue dashed curve reflects the inferred (ideal) link utilization if all weight distributions are faithfully implemented,
while the red solid curve represents the actual link utilization after weight adjustments in group reduction.
The reality deviates significantly from the ideal
approximately 15% of the total links exhibit utilization higher than the <em>maximum ideal utilization</em> by up to <em>5 times</em>!
Consequently, due to overloading caused by precision loss, the worst few links experience severe congestion.</p>
<img src="jupiter-lu.png" alt="Link utilization of a Google production network" width="240"/>
<p align="center"><i>Figure 2: Link utilization of a large Google production data center network.</i></p>
<p>While limited memory is the root cause of precision loss, it’s important to understand how limited memory affects precision loss in different ways.
Let’s revisit the groups in Figure 1. The total space required depends on three factors:
(1) the number of groups, (2) the number of ports per group, (3) the port weights.
It turns out that all these three factors, along with the switch heterogeneity, contribute to precision loss.
I will discuss them one by one:</p>
<ol>
<li><em>Number of groups.</em>
As the network scales larger, the number of groups increases,
resulting in less relative memory space per group.</li>
<li><em>Number of ports per group.</em>
The TE solution sometimes requires using many ports in a group to ensure failure resilience.
This also enlarges the group.</li>
<li><em>Skewed port weights.</em>
Some groups contain skewed weight distributions that are hard to reduce.
For example, consider a distribution like 99:1—this requires a total of 100 entries.
Since 1 is the smallest possible weight value, it cannot be reduced further.
To make this group smaller, we have to reduce the weight of 99.
However, the more we reduce it, the more precision loss occurs.</li>
<li><em>Heterogeneity.</em>
Data centers typically comprise switches from various generations, as illustrated in Table 1.
Older generation switches have memory that is 8 times smaller than that of the newer ones, making it more challenging to accommodate groups.</li>
</ol>
<p align="center"><i>Table 1: Switch hardware profile in a Google production network.</i></p>
<table><thead><tr><th align="center">Switch generation</th><th align="center">Memory limit</th></tr></thead><tbody>
<tr><td align="center">Old generation</td><td align="center">4096 port entries</td></tr>
<tr><td align="center">New generation</td><td align="center">32768 port entries</td></tr>
</tbody></table>
<p>We have just seen the impact of precision loss and its root causes.
The next section will explore how to minimize precision loss.</p>
<h1 id="time-for-some-new-group-reduction-algorithms">Time for some new group reduction algorithms</h1>
<p>Recall that the focus of this work is to accurately map TE solutions to groups in switches.
The key to this task lies in an efficient group reduction algorithm.
The existing algorithm used in Google’s production network—called <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/2592798.2592803">WCMP TableFitting</a>—is inadequate in certain scenarios.
We have identified opportunities for improvement, thus proposing two new group reduction algorithms.
Some of you might wonder: why two?
It is because each of the two algorithms focuses on a different aspect:
one algorithm (named <em>Direct Reduction</em>) aims to find the most precise weights possible;
the other algorithm (named <em>Greedy Reduction</em>) strives to achieve faster reduction speed by sacrificing some optimality.</p>
<img src="dedup.png" alt="Example of de-duplication." width="800"/>
<p align="center"><i>Figure 3: Illustration of how group de-duplication works.</i></p>
<img src="alloc.png" alt="Example of demand-aware allocation." width="450"/>
<p align="center"><i>Figure 4: Illustration of how demand-aware resource allocation works.</i></p>
<p>I will omit the technical details about the algorithms. Those interested can take a look at our <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/nsdi24/presentation/chen-shawn">NSDI paper</a>.
The full implementation of our group reduction algorithms can also be found <a rel="noopener" target="_blank" href="https://shuoshuc.github.io/FabricEval/">on GitHub</a>.
Here, let me just highlight the core ideas that power both algorithms:</p>
<ul>
<li>
<p>Idea 1: <strong>De-duplication</strong>.
Recall that packets destined for different IP addresses are handled by different groups, as Figure 3 shows.
But what if these “different groups” actually appear identical?
Namely, the groups contain the same set of ports, and the weights for each port are also identical.
This situation is not uncommon, especially after group reduction, where groups can become identical.
In such cases, de-duplicating these identical groups and reusing a single instance allows for a more efficient use of the limited memory space available in switches.</p>
</li>
<li>
<p>Idea 2: <strong>Space allocation</strong>.
Groups naturally handle varying amounts of network traffic (demands), as reflected by the arrow sizes in Figure 4.
Consequently, their contribution to the overall precision loss differs.
Allocating a large chunk of memory space to a group serving minimal traffic would be inefficient (see the demand-unware allocation example in Figure 4).
Therefore, we require a form of performance isolation.
Our idea is to divide the available memory space into multiple allocations,
with the size of each allocation proportional to the traffic volume of a group.
In other words, each group receives dedicated space based on its traffic volume.
This is illustrated by the demand-aware allocation example in Figure 4.</p>
</li>
</ul>
<p>These simple techniques work surprisingly well. In the final section, I’d like to show you some evaluation results.</p>
<h1 id="evaluation">Evaluation</h1>
<p>We evaluate the proposed group reduction algorithms using <a rel="noopener" target="_blank" href="https://shuoshuc.github.io/FabricEval/">FabricEval</a>,
a traffic engineering evaluation framework developed specifically for this project.
Both the proposed algorithms and the current approach (WCMP TableFitting)
are implemented inside FabricEval for side-by-side comparison.</p>
<div style="display:flex">
<div style="padding-left:30px;align-self:center;">
<img src="lu.png" width="260"/>
<p align="center"><i>Figure 5: Common and tail link utilization.</i></p>
</div>
<div style="padding-left:30px;align-self:center;">
<img src="solve-speed.png" width="390"/>
<p align="center"><i>Figure 6: Time to complete group reduction.</i></p>
</div>
</div>
<p>The previous section has mentioned that WCMP TableFitting falls short in certain scenarios.
Here is a set of results showing how the new algorithms reduce precision loss compared to WCMP TableFitting.
Figure 4 displays the common case (p50) link utilization and tail utilization (p99 and max) in a
large-scale production-like data center network. Instead of testing in a real
deployment, we model after a Google production network and evaluate with simulation in FabricEval.
FabricEval allows us to run controlled experiments without interrupting user traffic.
The ideal (blue) bar represents the ideal link utilization without precision loss.
WCMP TableFitting exhibits as much as 67% higher link utilization over the ideal case.
Both <em>Direct Reduction</em> and <em>Greedy Reduction</em> show no more than 7% higher link utilization than the ideal.
<strong>Our algorithms reduce precision loss by 10x compared to WCMP TableFitting.</strong></p>
<p>Precision loss is just one aspect of the story. Since these algorithms operate in an online environment, we also care about their speed.
Figure 5 illustrates the time they take to complete reduction for a batch of groups.
It suffices to note that the evaluation covers a range of network scenarios, from common to rare, as listed on the x-axis.
One observation is that reducing a set of groups could take anywhere between a tenth of a second to hundreds of seconds,
depending on their complexity.
<strong>Compared to WCMP TableFitting, <em>Greedy Reduction</em> runs 1-2 orders of magnitude faster.
<em>Direct Reduction</em>, on the other hand, performs similarly to WCMP TableFitting.</strong></p>
<p>Of course, what I have presented above are just a few examples from a comprehensive set of experiments.
More information can be found in our <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/nsdi24/presentation/chen-shawn">NSDI paper</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Precision loss is an inherent problem when implementing traffic engineering with
limited switch memory resources. Large-scale heterogeneous data center networks
have exacerbated this problem. We propose two new group reduction algorithms that
can map original weight distributions in the TE solution to groups on the switch with minimal precision loss.
Evaluation results show that our algorithms achieve significant improvements
over the existing approach (WCMP TableFitting).
Our algorithms reduce precision loss by 10x and run up to 10x faster than WCMP TableFitting under various network scenarios.</p>
<p><strong>Acknowledgements.</strong> This work is a collaboration with <a rel="noopener" target="_blank" href="https://keqianghe.github.io/">Keqiang He</a> (Airbnb),
<a rel="noopener" target="_blank" href="https://research.google/people/rui-wang/">Rui Wang</a> (Google),
and my advisors <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Esrini/">Srinivasan Seshan</a> and <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Eprs/">Peter Steenkiste</a>.
I would like to thank Adithya Abraham Philip, Zico Kolter, Akshitha Sriraman,
Miguel Ferreira, Wei Bai, Brian Chang, Yige Hong, Weiyang Wang, Min Yee Teh and Nandita Dukkipati
for their feedbacks. The CMU authors are sponsored by the U.S. Army Contracting
Command under award number W911QX20D0008.
The views and conclusions contained in this document are those of the author
and should not be interpreted as representing the official policies,
either expressed or implied, of any sponsoring institution, the U.S. government or any other entity.</p>
Understanding Setup Times2024-05-08T00:00:00+00:002024-05-08T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/understanding-setup/<h2 id="why-understand-setup-times"><strong>Why Understand Setup Times?</strong></h2>
<h3 id="setup-times-waiting-frustration"><strong>Setup Times = Waiting + Frustration</strong></h3>
<p>Nobody likes waiting in line.
But some of the most frustrating experiences that I’ve ever had waiting are when I get in a super long line, I peek around to the front of the line, and I see that the server isn’t even ready to serve –they’re still setting up!
It’s terrible, and it happens everywhere:</p>
<ul>
<li>You’re on break and you just want to log in to your favorite game, but “the system is experiencing unusually high load” and the login process is taking forever.</li>
<li>You’re at a cookout and you just want to grab a burger, but the grill is still heating up.</li>
<li>You’re dead-tired from being sick and you just want to grab your antibiotics and go to sleep, but the pharmacist has to go through an excruciatingly long badge-in process.</li>
</ul>
<p>The frustrating part of these situations isn’t really the waiting <em>per se</em> ––kids learn to wait their turn in kindergarten.
No, the frustrating part is that you’re waiting and somehow it almost feels unnecessary; why weren’t these servers ready before this huge line formed in the first place?
<strong>Why do we spend so much time waiting for servers to set up?</strong></p>
<h3 id="why-we-wait"><strong>Why we wait</strong></h3>
<p>Of course, the answer to “why do we wait?” is the usual answer: <strong>because not-waiting costs money.</strong>
In basically all of these queueing systems, you could just have all of your servers running all the time.
And if the only thing you cared about was how long people spent waiting, then of course you <em>would</em> just have all of your servers running all the time.
But keeping a server on costs money –even if that server is not actively doing work.
That’s why, in many queueing systems, instead of keeping their servers on all the time, system managers will <em>actively scale</em> the number of servers that are on in a dynamic way.
It turns out that if you do this “dynamic scaling” in the right way, then you can cut down on operating costs <em>a lot.</em>
For example, Google’s version of dynamic scaling, called <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/3342195.3387524">Autopilot</a>, was able to cut average resource waste in half, from 46% to 23%.
And keep in mind that when we say “wasted resources,” we’re not just talking about wasted money –we’re also talking about unnecessary CO2 emissions and increased energy demand.</p>
<h3 id="why-we-only-sometimes-wait"><strong>Why we (only sometimes) wait</strong></h3>
<p>Alright, so then why doesn’t everyone implement the most-extreme version of dynamic scaling they can imagine, always keeping their system an inch away from being understaffed?
Well, the answer is simple: nobody likes waiting, and if your customers have to wait for too long, then they’ll take their business elsewhere.
If you want to keep your customers while also conserving resources, you’ve got to balance <em>waiting</em> with <em>wasting</em> when you design your system.
In the best case scenario, you find a design sitting in that optimal sweet spot, where your system uses just enough resources to be sure that your customers don’t spend too long waiting.
Unfortunately, we’re not even close to being able to find that sweet spot, since we haven’t been able to answer one of the most basic questions in this space: <strong>“How does the average waiting time behave in systems with setup times?”</strong></p>
<h2 id="understanding-setup-what-we-know"><strong>Understanding Setup: What we know</strong></h2>
<table><thead><tr><th align="center"><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/understanding-setup/setupVnosetup.png" alt="bar plot comparing setup to no-setup. The blue bar (no-setup) is much smaller than the green bar (Setup). Sim results with average job length of 1 m.s., load of one half, and setup time of 100 m.s. " /></th></tr></thead><tbody>
<tr><td align="center">Setup times hurt.</td></tr>
<tr><td align="center">Systems <em>without</em> setup have much, much lower delay than systems <em>with</em> setup. Simulation results with average job length of 1 ms, load of 0.5, and setup time of 100 ms.</td></tr>
</tbody></table>
<h3 id="setup-times-hurt"><strong>Setup times hurt.</strong></h3>
<p>At this point you might be wondering whether setup times actually hurt performance <em>that</em> much.
The short answer is: yes, in real systems, setup times hurt.
The longer answer is that the effect of setup times on a system is a complex interaction between 1) the length of a typical setup time, 2) the length of a typical <em>service time</em> (the average length of a job), 3) the total number of servers available, and 4) the arrival rate of jobs.
In real systems, where setup times are <em>hundreds</em> —even <em>thousands</em>— of times larger than service times, the average waiting time of a customer can be <em>almost wholly determined</em> by the system’s setup behavior.</p>
<table><thead><tr><th align="center"><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/understanding-setup/setup-model.png"><img src="setup-model.png" width="250"></a></th></tr></thead><tbody>
<tr><td align="center">A multiserver system with setup.</td></tr>
<tr><td align="center">Jobs (blue rectangles) enter into a central queue, where they wait in FCFS order until they are served by one of <em>k</em> servers (white circles with black outline containing 1 of 3 elements). Servers can be <em>off</em> (red “X”), <em>on</em> (blue rectangle), or <em>in setup</em> (green hourglass).</td></tr>
</tbody></table>
<h3 id="but-understanding-setup-is-hard"><strong>But understanding setup is <em>hard.</em></strong></h3>
<p>That said, understanding <em>how</em> and <em>why</em> setup times destroy queueing performance has taken the better part of a century.
Formal study began with <a rel="noopener" target="_blank" href="https://www.jstor.org/stable/167778">Welch’s</a> paper on “exceptional first services.”
In his study of single-server systems with setup times, he obtained a closed-form expression for not just the average waiting time, but the entire waiting time distribution!
In the multiserver case, little was known until the seminal 2014 paper of <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s11134-014-9409-7">Gandhi et. al</a> on “Recursive Renewal Reward.”
In their breakthrough paper, the authors develop and study a model of multiserver setup,
demonstrating that, given a number of servers \( k \), the average waiting time can be computed by solving a system of \( O(k^2) \) quadratic equations.
Since then, all theoretical work on setup times has studied setup times using their model.</p>
<p>However, even considering the <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s11134-014-9409-7">Gandhi et. al</a> paper and all the work it spawned, we still don’t <em>really</em> understand setup.
The model they study is very much an <em>approximate</em> model: it assumes that setup times vary much more than they do in practice, which enables them to adapt a wide array of existing queueing analysis techniques to the setup time setting.
Unfortunately, when setup times are large (and they are often <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/abstract/document/9582255">thousands of times larger than the average job size</a>), this assumption causes their model to dramatically underestimate the harm caused by setup times, especially when looking at larger systems.
However, even after acknowledging that this model underestimates the harm due to setup, <strong>most setup researchers continue to use the <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s11134-014-9409-7">Gandhi et. al</a> model.</strong></p>
<h2 id="what-makes-setup-hard"><strong>What makes setup hard?</strong></h2>
<p>To fully understand why the <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s11134-014-9409-7">Gandhi et. al</a> model is still in use, we first need to discuss what makes setup systems so hard to understand.</p>
<table><thead><tr><th align="center"><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/understanding-setup/setup-indirect.png" alt="A comic illustrating the indirect nature of setup. In the first panel, the queue starts empty. Then, a job arrives triggering the setup of the server. While the initial job is waiting for the server to turn, more jobs arrive, all observing the server in setup —the main cause of their delay. However, jobs which arrive after the server is ready still experience additional delay —but do not directly observe why they are delayed." /></th></tr></thead><tbody>
<tr><td align="center">Setup can be invisible.</td></tr>
<tr><td align="center">A comic illustrating the indirect nature of setup. In the first panel, the queue starts empty. Then, a job arrives triggering the setup of the server. While the initial job is waiting for the server to turn on, more jobs arrive, all observing the server in setup —the main cause of their delay. However, jobs which arrive after the server is on still experience additional delay —but do not directly observe <em>why</em> they are delayed.</td></tr>
</tbody></table>
<h3 id="first-reason-the-setup-effect-can-be-invisible"><strong>First reason: The setup effect can be <em>invisible.</em></strong></h3>
<p>There are two reasons why setup is so hard to understand.
First, the harm caused by setup times can be invisible.
For example, if I’m the first person in line when the pharmacist starts badging in, then I can directly see the reason why I’m waiting; I can observe the setup process.
But, while I’m waiting in line, other people will get in line behind me, and when the pharmacist finally finishes badging in, the line might be pretty long.
At that point, everyone in the line knows exactly why we have been waiting for so long: setup times.
But if another customer arrives after the pharmacist has finished badging in, then they’ll have <em>no idea</em> why the line is so long; the harm caused by setup times has become <strong>invisible.</strong></p>
<h3 id="second-reason-servers-interact"><strong>Second reason: Servers <em>interact.</em></strong></h3>
<p>The second hard-to-understand aspect of setup times only emerges when there are multiple servers in play.
Setup gets more complicated in multiserver systems because now the work of one server can influence the setup behavior of another.
For example, suppose there are two pharmacists on hand, but only one is currently badged in and serving customers ––the other pharmacist is in the back, filling prescriptions.
If the line gets <em>too</em> long, the pharmacist in the back might think they need to start serving customers, and thus begin the long drawn-out badge-in process.
If, however, the already-serving front pharmacist somehow quickly works through the line, then it might not even make sense for the not-yet-serving back-pharmacist to complete the badge-in process; it might make sense for them to <em>cancel</em> their setup.
Note that something like this would never happen if there was only one pharmacist, since, if there’s only one pharmacist and they’re currently badging in, there’s no way for the line to disappear (assuming everyone doesn’t abandon the line altogether).
More generally, if we scale <em>up</em> when the line is long and scale <em>down</em> when the line is short/empty, then the setup behavior of our servers becomes governed by how quickly the already-on servers are working;
<strong>this interaction between servers makes multiserver setup much, much more complex.</strong></p>
<h2 id="where-we-went-wrong-before"><strong>Where we went wrong before</strong></h2>
<p>The main issue with all the previous research on setup lies in how they dealt with this “interaction” complication.
For context, when studying complicated systems, reality is often way too hard to understand directly.
In order to make progress, researchers need to make simplifying assumptions about various aspects of their system.
Done correctly, these simplifying assumptions can allow us to discard the unnecessary details of a system and draw meaningful conclusions about the parts that actually matter.
That said, if these simplifying assumptions are <em>too unrealistic,</em> then, even if we understand the simplified model well, the conclusions we obtain might end up meaning very little; this is exactly what happens in the previous model of setup.</p>
<h3 id="first-issue-an-unrealistic-but-tractable-model"><strong>First issue: An unrealistic, but tractable model.</strong></h3>
<p>As we discussed, before our work, the state-of-the-art model of multiserver setup made a simplification that turns out to be too unrealistic.
I mean “unrealistic” here in two different ways.
First, the model behavior is unrealistic in the sense that its moment-to-moment behavior doesn’t seem to match what happens in reality.
Without going into too much detail, by assuming that setup times are distributed <em>Exponentially</em>, in the previous model, <strong>the speed of setup ends up <em>scaling</em> with the number of servers in setup.</strong>
For example, if 100 servers are setting up, the first few servers end up setting up ~100 times faster.
In real systems, this couldn’t be further from what actually happens: when you turn on a computer, it goes through a series of steps which takes almost the same amount of time, every time.
That said, plenty of useful models contain strange or unrealistic edge cases in their behavior; that, in and of itself, is not enough to prevent a model from being useful.</p>
<h3 id="second-issue-unrealistic-behavior-poor-predictions"><strong>Second issue: Unrealistic behavior => Poor predictions</strong></h3>
<p>However, although we could maybe ignore some unrealistic behavior, the second problem of prior work is much more serious: <strong>previous models <em>vastly</em> underestimate the harm caused by setup times.</strong>
In simulations, we’ve found that the average waiting time predicted by previous models can be orders of magnitude smaller than the true waiting time.
Although the true waiting times and the previous model’s predicted waiting times are (relatively) close when studying small systems, the relative gap between these predictions rapidly widens as we increase the system scale.
Given that we don’t understand exactly how the <em>size</em> of this gap behaves, it’s extremely difficult to know when it’s actually alright to use the previous model.</p>
<table><thead><tr><th align="center"><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/understanding-setup/setup-comparison-smaller.png" alt="bar plot with no setup (blue bars), previous model (orange bars), Reality (green bars), and Our approximation (light blue bars). The no setup bars are very, very small and the previous model varies widely, but the Our approximation bars are very close to their Reality counterparts. Uses same parameters as previous figure." /></th></tr></thead><tbody>
<tr><td align="center">Our results match reality.</td></tr>
<tr><td align="center">A bar plot comparing the predictions of our approximation to the predictions of the Exponential model. Our approximation is much closer to reality! Simulation results with average job length of 1 ms, load of 0.5, and setup time of 100 ms.</td></tr>
</tbody></table>
<h2 id="how-our-results-change-the-game"><strong>How our results change the game</strong></h2>
<p>Our results change the game in three major ways: Compared to previous work, we 1) study a much more realistic model, 2) prove much stronger theoretical results, and 3) greatly improve upon the practical utility of existing work.
Let’s describe each point in a little more detail.</p>
<h3 id="our-model-is-more-realistic"><strong>Our model is more realistic.</strong></h3>
<p>First, let’s talk about how the setup process in our model is more realistic than in previous models.
As we noted before, there’s a big difference in performance between systems with and without setup times.
But previous models assume that setup times are distributed Exponentially, an assumption which leads them to dramatically underestimate the harm caused by setup times.
In particular, this Exponential assumption makes it so that, when more servers set up, the setup process happens faster.
In contrast, in our model, we make the setup time <em>Deterministic:</em> setup times take the same amount of time, every time.
For example, if booting up a single server takes a minute, then booting up 100 servers will <em>also</em> take a full minute.
In other words, we study setup times as they actually occur in <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/abstract/document/9582255">real systems</a>.</p>
<h3 id="our-results-are-stronger"><strong>Our results are stronger.</strong></h3>
<p>Second, our main results are stronger than most (if not all) previous results on multiserver setup.
Our two main results investigate the average waiting time in our new, more realistic model.
In particular, we give both an upper bound and a lower bound on the average wait in our model, and also show that these upper and lower bounds differ by at most a multiplicative factor.
Moreover, the bounds we derive are just explicit closed-form formulae; no additional computation is needed (see my <a rel="noopener" target="_blank" href="https://jalaniw.github.io">thesis document</a> for the details).
While our results are enhanced by our more realistic model, these are also the <strong>first closed-form results ever</strong> for any finite-server system with setup times.</p>
<h3 id="our-results-are-more-practically-useful"><strong>Our results are more practically useful.</strong></h3>
<p>These bounds also give rise to our practical contribution: by combining the analysis of our upper and lower bounds, we also construct an easy-to-compute and extremely accurate approximation to the average waiting time.
And while you’ll need to look at my thesis to see the full formula, this approximation goes beyond the state-of-the-art in three important respects:</p>
<ol>
<li>First, as we’ve already stated, our approximation gives extremely accurate predictions, whereas the previous state-of-the-art model underpredicts waiting times by orders of magnitude.</li>
<li>Second, our approximation is fast/cheap to compute. Since it is a simple rational function of the system parameters, it can be easily computed, even by hand. By contrast, in order to generate a single prediction, the previous state-of-the-art required one to solve a large system of quadratic equations.</li>
<li>Third, our approximation gives intuition. As discussed, since our approximation is a simple formula, one can directly observe <em>how</em> and <em>why</em> the waiting time will increase or decrease in response to some alteration in the system parameters. On the other hand, it’s difficult to anticipate someone getting a lot of intuition from the previous state-of-the-art’s complicated system of equations.</li>
</ol>
<p>In other words, our approximation is <em>better,</em> <em>faster,</em> and <em>stronger.</em> These three aspects of the approximation together make it <em>much, much easier</em> to design multiserver systems with both performance <em>and</em> efficiency in mind.</p>
<h2 id="conclusion"><strong>Conclusion</strong></h2>
<p>Given the above, you’re probably chomping at the bit to learn how we managed to prove such uniquely powerful results,
especially given that the model we study has <em>Deterministic</em> setup times, which are notoriously difficult to analyze with existing methods.
Unfortunately, we don’t have time here to properly discuss the method that we developed to analyze our complicated Deterministic model.
However, those interested will be able to find an in-depth description of the <strong>MIST</strong> method in my thesis document, available on my <a rel="noopener" target="_blank" href="https://jalaniw.github.io">website</a> after my defense.
And don’t worry, it’ll definitely be posted by that day —no need to factor in setup times.</p>
T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction2024-05-05T00:00:00+00:002024-05-05T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/t2fpv/<p>As AI technology advances, more and more autonomous robots are being tasked with
navigating among people in shared environments. Such applications span academia
and industry, including sidewalk delivery robots, robotic museum
guides, and automated room service in hotels. In order to have robust,
socially-compliant navigation policies, these robots must be adept at predicting
pedestrian motion in busy environments. In these environments,
the most natural setting for humans and robots is an egocentric, first-person
view (FPV), where a camera is placed on the robot itself as it moves around, as
highlighted in the figure below:</p>
<figure style="text-align:center;">
<img src=./sidewalk_robot.png width="400" alt="Coco Delivery sidewalk robot" title="Coco Delivery sidewalk robot">
<figcaption style="margin-top:10px">A Coco Delivery robot, having to navigate among humans.</figcaption>
</figure>
<p>However, the vast majority of prior work in pedestrian trajectory prediction
instead relies on third-person sensing, where cameras are mounted
on infrastructure (such as rooftops) or on a stationary drone hovering
above the scene in order to record naturalistic behavior of humans.
These birds-eye view (BEV) recordings are then processed into datasets
comprising of ground truth examples for how humans navigate among each other, and
used to train downstream trajectory prediction models.
So, <strong>why has this been the standard approach</strong>, and <strong>why is it problematic</strong>?</p>
<h2 id="background">Background</h2>
<p>Pedestrian motion prediction is important as it is used to inform a robot’s
planning module as to other peoples’ intents and likely paths. This is used not
just to avoid collisions, but also to ensure social compliance—that is, moving
in a way so as to not cause disruption, discomfort, or general inconvenience to
humans in the scene.</p>
<p>While heuristic models, such as social forces, have long been utilized, the
advancement of deep learning techniques has dominated the recent
state-of-the-art (SOTA) [8, 9]. In these approaches, a machine learning model is
trained on a dataset of recorded human behavior as a form of static
learning-from-demonstration. These datasets are considered trajectory datasets,
containing at minimum the coordinates over time (or trajectory) of all people
(or “agents”) in a given scene. High quality dataset curation is thus paramount
for high quality model performance.</p>
<p>A majority of existing datasets for this problem utilize a top-down perspective,
such as the the ETH/UCY [1] collection of pedestrian datasets, shown below. One
reason for this approach is the ease of annotation. In a BEV perspective,
peoples’ trajectories can be easily tracked in ground-plane coordinates over
time. This is in stark contrast to FPV, where 3D segmentation and detection
algorithms are required to annotate observed pedestrians. Furthermore, using BEV
eliminates much of the occlusion problem, where people may be impossible to
annotate when behind other people, buildings, or objects. This occlusion can
lead to tracking errors, like misassociation or losing somebody’s trajectory
altogether, which can require data imputation to fill the missing values.
Together, these errors and noise from FPV sensing are denoted as “FPV errors”.</p>
<figure style="text-align:center;">
<img src=./hotel_bev.png width="400" alt="Hotel example from ETH/UCY" title="Hotel example from ETH/UCY">
<figcaption style="margin-top:10px">ETH/UCY dataset example in Hotel setting.</figcaption>
</figure>
<p>Another problem that the BEV perspective addresses is ensuring naturalistic
human behavior. To collect data in FPV, either robots or humans themselves have
to wear cameras when navigating among others. This can lead to several undesired
psychological effects, such as the Hawthorne effect, where people’s behavior can
change when they know they’re being observed, as well as the novelty effect,
where humans behave differently when first exposed to new technology [10].
Thus, while FPV datasets for pedestrian trajectory prediction exist, they
contain both FPV errors as well as no guarantees of naturalistic behavior in the
first place.</p>
<p>However, BEV data has a very serious flaw: in nearly all deployed
situations, a robot does <strong>not</strong> have access to top-down, perfect information of
others in the scene. Training a prediction policy which <strong>relies</strong> on having
this information, rather than having to deal with FPV errors, causes prediction
models to be unrealistically effective, leading to a false sense of confidence
in their abilities.</p>
<h2 id="our-approach-trajectories-to-first-person-view-t2fpv">Our Approach: Trajectories to First Person View (T2FPV)</h2>
<p>To address the above limitations, we propose “Trajectories to First Person View”
(T2FPV), a method for constructing an FPV version of data from a trajectory-only
dataset. This process entails starting with a BEV-recorded dataset and then
performing a high-fidelity simulation from each person’s FPV perspective. We use
this approach to generate, annotate, and release a version of the popular
ETH/UCY dataset in this new perspective. Then, we conduct SOTA detection and
tracking therein to get realistic partial perception from each person’s view. In
this setting, we observe the effects of FPV errors, and develop a module to
address them by refining the initial imputation of missing data in an end-to-end
manner with trajectory prediction. This “Correction of FPV Errors” (CoFE)
module decreases prediction displacement errors by more than 10% on average when
compared to all tested imputation and prediction approach combinations.</p>
<h2 id="constructing-an-fpv-dataset">Constructing an FPV Dataset</h2>
<p>We begin by leveraging the SEANavBench [2] simulation environment, consisting of
the five high-fidelity pre-modeled locations, or “folds”, within the ETH/UCY
dataset. We then replay the recorded data by attaching a simulated camera to
each pedestrian, requiring a few simplifying assumptions: pedestrians are all
roughly the same height, using randomly selected human models, and also their
gazes are aligned with the direction that they are moving in. We render videos for
each person and output ground truth (GT) annotations at each frame, consisting of the
list of which other people are visible at any given time.</p>
<p>Next, we conduct SOTA detection and tracking on these rendered videos, in order
to emulate realistic perception which a deployed social navigation robot uses.
We employ an off-the-shelf object detector, DD3D [3], as well as a very
effective probabilistic tracker [4]. We modify DD3D to improve performance on
our specific task, such as altering the tracker’s matching metric and modifying
the feature map thresholds therein. As is common for ETH/UCY evaluation, we
train one model for each of the five folds as a test set, using the other four
folds as training and validation in a leave-one-out manner. Overall, we find
that this approach performs reasonably well on standard metrics including
Average Precision and Average Multi-Object Tracking Accuracy.</p>
<p>Finally, we put together these outputs into an FPV dataset. We start with the
standard scene segmentation steps on ETH/UCY, where scenes are considered in a
fixed-length, sliding window manner over the original data, and only scenes with
at least two agents present at the same time are kept. These scenes consist of
20 timesteps at 2.5 frames per second, where the first eight are kept as
“history” and the next 12 are considered the “future” to be predicted. We
utilize the Hungarian matching algorithm [11] to associate together the GT set of
visible agents (from our simulation annotations directly) with the observed set
of people from detection and tracking. Where a given BEV scene has <em>N</em> people
moving around at the same time, we thus create <em>N</em> FPV variations of this scene,
from each agent’s perspective. Importantly, <strong>these scenes contain FPV errors!</strong></p>
<p>This entire process is highlighted in the figure below, showing how a single top-down scene
produces many first-person scenes. The heading titles such as “Sec IV-A” refer to sections in
our reference paper, for further reading [5].</p>
<figure style="text-align:center;">
<img src=./overview.png width="800" alt="T2FPV process overview" title="T2FPV process overview">
<figcaption style="margin-top:10px">T2FPV process overview; real-world top down trajectories are transformed to a first-person version in simulation, with limited (i.e., more realistic) perception.</figcaption>
</figure>
<h2 id="improving-robustness-to-perception-errors-cofe">Improving Robustness to Perception Errors: CoFE</h2>
<p>As discussed above, one key type of FPV error is that of missing observations in
the history portion of a trajectory due to detection and tracking errors.
This requires the imputation of missing data points for most trajectory prediction
methods. Although many prior works leverage simple imputation approaches like
linear interpolation and exponential smoothing, there are more sophisticated,
SOTA deep learning imputation techniques such as NAOMI [6]. However, these
approaches still rely on unrealistic assumptions, for human motion
prediction: 1) data points are missing in a random manner; and 2) data points
observed around missing values can be trusted. The first assumption does not hold
because data is missing pathologically due to errors in the perception system,
whereas the second fails because surrounding data points also incur positional
estimation errors from perception.</p>
<p>Therefore, we propose to incorporate a new module to sit in between the
imputation and prediction steps of the pipeline, consisting of a neural network
trained end-to-end (E2E) with the downstream prediction model. This “Correction
of FPV Errors”, or CoFE, module is similar to previous recurrent neural network
(RNN) prediction approaches. The model first takes in an initial guess at
imputation from an upstream method (e.g. NAOMI). Then, it proceeds in an
encoder-decoder manner, where an encoder RNN builds a hidden state
representation of this input sequence. A decoder RNN then sequentially outputs
<strong>refinements</strong> of the trajectory, before passing it on to the trajectory
prediction phase. To encourage this module to perform such refinements, a
simple mean-square error (MSE) loss objective is utilized, between the refined
history track (i.e., the output of the decoder) and the ground truth associated
history. The refined trajectory is used <strong>instead of</strong> the trajectory produced
directly from the detection and tracking modules, to train both the CoFE module
as well as the trajectory prediction model itself in an E2E fashion, along with
the original loss objective of the prediction model. </p>
<p>The full architecture is visualized in the figure below, with a deeper
discussion of each component explained in our paper.</p>
<figure style="text-align:center;">
<img src=./cofe.png width="800" alt="CoFE module architecture" title="CoFE module architecture">
<figcaption style="margin-top:10px">CoFE module architecture.</figcaption>
</figure>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p>We implemented several representative approaches for the ETH/UCY trajectory
prediction task, standing out along key techniques in human motion prediction:
variational prediction (VRNN [7]), social awareness (A-VRNN [8]), and goal
conditioning (SGNet [9]). For data imputation, we also incorporate three
relevant approaches, including the commonly-used linear interpolation,
exponential smoothing, and the aforementioned NAOMI deep learning method.</p>
<p>We utilized the standard leave-one-out evaluation methodology for ETH/UCY, where
one model is trained for each of the five folds and each imputation and
prediction approach combination. We trained one version of the prediction
approach with our CoFE module and objective, and one version without it.
Finally, we used the standard metrics in the trajectory prediction task of
Average and Final Displacement Errors (ADE and FDE). These measure the L2-distance
between the predicted future path and ground truth for the entire predicted
portion and just the last time step, respectively. The table below summarizes
our results, where the final column refers to the average of ADE / FDE
respectively over the five folds. As shown, all combinations of approaches
performed better with our CoFE module than without, by an average of more than
10%.</p>
<figure style="text-align:center;">
<img src=./results.png width="400" alt="Experiment results" title="Experiment results">
</figure>
<p>To gain further insight into the behavior of CoFE, we conducted an ablation
study and various qualitative analyses. In the ablation, we find that the E2E
training is essential for the improved performance, as is the effect of only
refining the missing, imputed data points rather than surrounding observed
points as well. We include an example of the qualitative analysis below:</p>
<figure style="text-align:center;">
<img src=./qualitative.png width="1000" alt="Qualitative example" title="Qualitative example">
<figcaption style="margin-top:10px">Qualitative example of CoFE effectiveness in improving predictions.</figcaption>
</figure>
<p>In this example, NAOMI by itself trusts surrounding points in the data too much,
performing a simple extrapolation. When paired with CoFE, the approach is more
effective at capturing underlying patterns in the data, correcting the FPV
errors and resulting in better prediction.</p>
<h2 id="future-work">Future Work</h2>
<p>While our T2FPV approach and CoFE module is effective, we note here some
potential avenues of future improvements. Although the simulation platform we
used, SEANavBench, is a high-fidelity environment, further effort in improving
its realism would be useful. Realism could be enhanced not just by increasing
the 3D-modeling asset and animation qualities, but also by further improving
alignment between the reproduced scenery and the original locations.
Additionally, for associating detection and tracking trajectories with their
corresponding GT tracks, we relied on Hungarian matching on our tracking output
directly, which incurred a number of identity association errors.
Incorporating recent works on affinity-based techniques for re-tracking
algorithms could be a promising way to help with this problem and even further
reduce FPV errors. One further thread of research is to apply these
techniques to related domains where FPV sensing is required, such as autonomous
driving. While this related field has its own challenges, considering imputation
and prediction together to account for sensing errors could be a promising
direction therein.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In existing work, pedestrian trajectory prediction has mainly been studied in an
unrealistic BEV perspective. In this work, we introduce a more realistic
first-person view trajectory prediction problem where agents need to make
predictions based on partial, imprecise information. We present T2FPV, a method
for generating high-fidelity FPV datasets for pedestrian navigation by
leveraging existing real-world trajectory datasets, and use it to create and
release an FPV version of ETH/UCY. We also propose and evaluate CoFE, a module
that successfully refines imputation of missing data in an end-to-end manner
with trajectory prediction algorithms to reduce FPV errors. Therefore, we argue
that incorporating more realism throughout the perception pipeline is an
important direction to move toward in enabling robots to navigate in the real
world. For more information, please see our paper [5].</p>
<h2 id="references">References</h2>
<p>[1] Pellegrini, Stefano, et al. “You’ll never walk alone: Modeling social behavior for multi-target tracking.” 2009 IEEE 12th international conference on computer vision. IEEE, 2009. <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/document/5459260">https://ieeexplore.ieee.org/document/5459260</a></p>
<p>[2] Tsoi, Nathan, et al. “An approach to deploy interactive robotic simulators on the web for hri experiments: Results in social robot navigation.” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2012.12336">https://arxiv.org/abs/2012.12336</a></p>
<p>[3] Park, Dennis, et al. “Is pseudo-lidar needed for monocular 3d object detection?.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2108.06417">https://arxiv.org/abs/2108.06417</a></p>
<p>[4] Chiu, Hsu-kuang, et al. “Probabilistic 3d multi-object tracking for autonomous driving.” arXiv preprint arXiv:2001.05673 (2020). <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2001.05673">https://arxiv.org/abs/2001.05673</a></p>
<p>[5] Stoler, Benjamin, et al. “T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction.” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2209.11294">https://arxiv.org/abs/2209.11294</a></p>
<p>[6] Liu, Yukai, et al. “Naomi: Non-autoregressive multiresolution sequence imputation.” Advances in neural information processing systems 32 (2019). <a rel="noopener" target="_blank" href="https://arxiv.org/abs/1901.10946">https://arxiv.org/abs/1901.10946</a></p>
<p>[7] Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems 28 (2015). <a rel="noopener" target="_blank" href="https://arxiv.org/abs/1506.02216">https://arxiv.org/abs/1506.02216</a></p>
<p>[8] Bertugli, Alessia, et al. “AC-VRNN: Attentive Conditional-VRNN for multi-future trajectory prediction.” Computer Vision and Image Understanding 210 (2021): 103245. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2005.08307">https://arxiv.org/abs/2005.08307</a></p>
<p>[9] Wang, Chuhua, et al. “Stepwise goal-driven networks for trajectory prediction.” IEEE Robotics and Automation Letters 7.2 (2022): 2716-2723. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2103.14107">https://arxiv.org/abs/2103.14107</a></p>
<p>[10] Irfan, Bahar, et al. “Social psychology and human-robot interaction: An uneasy marriage.” Companion of the 2018 ACM/IEEE international conference on human-robot interaction. 2018. <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3173386.3173389">https://dl.acm.org/doi/abs/10.1145/3173386.3173389</a></p>
<p>[11] Kuhn, Harold W. “The Hungarian method for the assignment problem.” Naval research logistics quarterly 2.1‐2 (1955): 83-97. <a rel="noopener" target="_blank" href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109">https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109</a></p>
Boot: accelerating training data generation for self-driving database management systems2024-04-24T00:00:00+00:002024-04-24T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/boot/<h1 id="tl-dr">TL;DR</h1>
<p>Background:</p>
<ol>
<li>Optimizing a database management system (DBMS) is difficult; the best configuration
depends on time-varying factors like its environment and workload.</li>
<li>Researchers have developed machine learning (ML) models that outperform humans at these tasks.</li>
<li>However, the cost (e.g., training time, dollars) of obtaining these models makes them
impractical for establishing an end-to-end database tuning loop.</li>
<li>To build these ML models, the DBMS must gather training data by collecting telemetry as it
executes a workload.</li>
<li>ML techniques and model training keep getting faster and better. Workload execution has not
changed much, in comparison.
<ul>
<li>Historically, it takes weeks to collect data and weeks to train models – both processes are
slow, we just have to accept that models are hard to get.</li>
<li>Today, weeks to collect data and minutes to train models – workload execution is the main
bottleneck now!</li>
</ul>
</li>
</ol>
<p>Big idea:</p>
<ol>
<li>Training data collection is slow because the DBMS couples it to workload execution.</li>
<li>However, training data collection <strong>does not care about query results</strong>; it fundamentally
differs from workload execution.</li>
<li>Take shortcuts during workload execution for faster training data generation!
<ul>
<li>Workloads are repetitive - avoid executing similar queries twice.</li>
<li>Operators are repetitive - execute less of an operator if there is enough data.</li>
</ul>
</li>
<li>Obtain up to 225x speedups by eliminating repetition at the:
<ul>
<li>Inter-query level (macro-acceleration)</li>
<li>Intra-query level (micro-acceleration)</li>
<li>At a modest cost to model accuracy, but tuning algorithms are surprisingly robust.</li>
</ul>
</li>
</ol>
<h1 id="introduction">Introduction</h1>
<blockquote>
<p>Database management systems are hard to configure by hand.<br />
Machine learning models perform better.</p>
</blockquote>
<p></p>
<p>Database management systems (DBMSs) are challenging to optimize correctly because their ideal
configuration depends on their workload, database contents, hardware, and run-time environment,
which all fluctuate over time.
To address this difficulty, researchers have designed methods for automated DBMS configuration, in
<a rel="noopener" target="_blank" href="https://ottertune.com/blog/history-ottertune-research-part1">one case</a> obtaining 20% more
performance than the most skilled human.</p>
<p>The unifying goal of such research is to develop a
<a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf"><em>self-driving</em> DBMS</a>
that configures, tunes, and optimizes itself automatically.
Given a target objective function (e.g., latency, throughput, cost), a self-driving DBMS aims to
find the best configuration for its objective autonomously.
It <a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2021/p3211-pavlo.pdf">achieves this</a>
by relying on machine learning (ML) models that predict its run-time behavior under
different configurations.
Such models allow the DBMS to evaluate whether a candidate configuration is beneficial without
executing the queries in its workload, which would be too expensive.
For example, the DBMS may need hours to complete a computationally intensive SQL query.
If the DBMS only needs the query’s run-time, it can achieve significant time savings by using a ML
model to predict the query’s latency instead of running it.</p>
<p>To build its ML models, the DBMS collects
<a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2022/moddm074-butrovich.pdf"><em>training data</em></a>
comprised of database metadata (e.g., optimizer statistics) and run-time telemetry
(e.g., the latency of an operator in its query plan).
It generates this data by observing itself as it executes a
<a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2018/mod435-maA.pdf">representative workload</a>,
such as an application trace of SQL queries.
It then constructs its models by applying ML
<a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2021/ma-sigmod2021.pdf">techniques</a>.</p>
<blockquote>
<p>Obtaining training data for ML models is expensive.<br />
Especially when training data collection is not a one-time cost.</p>
</blockquote>
<p></p>
<p>However, the high cost of obtaining these models makes them impractical for real-world deployments.
In the past, the primary contributors to this cost were training data collection and model
building (<em>collection time</em> and <em>training time</em>, respectively).
But although ML techniques continue to improve and model construction becomes faster, training data
collection speeds have largely remained the same.
Today, the ratio of collection time to training time is over 99% (e.g., weeks to collect data,
minutes to train models).<br />
<strong>Problem 1: Training data is expensive to collect!</strong></p>
<p>Furthermore, unlike other ML domains that dismiss training data collection as a one-time cost
(e.g., LLM researchers share model weights because their training data does not change as much),
an autonomous DBMS needs training data
<a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2023/p27-lim.pdf">specific to its workload and configuration</a>.
Reusing models from other deployments is challenging because the database’s contents, hardware,
and system configuration influence its training data labels.
For example, the speed at which the DBMS’s sequential scan operator reads tuples from disk depends
on its hardware and configured buffer pool size.
Additionally, even if the DBMS already has models, they are often invalidated because of workload
drift, schema changes, dataset growth, and more.<br />
<strong>Problem 2: Training data is difficult to reuse!</strong></p>
<p>Consequently, an autonomous DBMS must regularly collect training data from scratch to maintain
its ML models.
Due to the high cost and frequency of training data generation, it often spends more time
collecting training data than improving its configuration.</p>
<h1 id="insight">Insight</h1>
<h2 id="decoupling">Decoupling</h2>
<blockquote>
<p>Training data generation is slow because it is coupled to regular query execution.<br />
This coupling is unnecessary, so we can obtain the same amount of training data much faster.</p>
</blockquote>
<p></p>
<p>Training data generation is slow because it builds off the DBMS’s regular query execution.
To execute a SQL query, the DBMS uses its internal statistics to search for a <em>query plan</em> that
efficiently computes the query’s answer.
A query plan is a tree composed of the DBMS’s <em>operators</em> (e.g., sequential scans read a table
from disk, hash joins combine the output of their children).
The DBMS executes the query plan by sending all <em>tuples</em> (i.e., rows of data) from the children
nodes to their parent nodes, which computes the query’s result.</p>
<p>But training data generation differs from regular query execution because it concerns telemetry
(e.g., timing information), not results (i.e., the tuples returned from a SQL query).
A DBMS <strong>does not need accurate query results</strong> during this process.
Therefore, it can accelerate its training data production if it can infer a workload’s telemetry
without executing the workload to completion.
This inference is possible because a DBMS’s workload is repetitive at the query and operator level.</p>
<h2 id="query-repetition">Query Repetition</h2>
<blockquote>
<p>For most workloads, the DBMS executes the same query template repeatedly.<br />
Many of those executions are redundant for training data generation.</p>
</blockquote>
<p></p>
<p>Most accesses to a DBMS are <a rel="noopener" target="_blank" href="https://db.cs.cmu.edu/papers/2018/mod435-maA.pdf">programmatic</a>.
Software (e.g., dashboards, reporting tools, web applications) generates most SQL queries
from similar <em>query templates</em>, differing only in their input parameters.
A query template is a normalized SQL string with its constants removed. For example,</p>
<p> Query: <em>SELECT name FROM pets WHERE kind = ‘dog’ and name = ‘terrier’</em><br />
Template: <em>SELECT name FROM pets WHERE kind = ‘$1’ and name = ‘$2’</em></p>
<p>Given the same template, a DBMS often generates an identical query plan; some DBMSs even cache
query plans by their corresponding template.
That is, they reuse the same query plan for different instantiations of a query template, leading
to repetition in query execution.</p>
<table><thead><tr><th align="center">Workload</th><th align="center"># Queries</th><th align="center"># Templates</th><th align="center">Repetition</th></tr></thead><tbody>
<tr><td align="center">Admissions</td><td align="center">2564M</td><td align="center">4060</td><td align="center">627k X</td></tr>
<tr><td align="center">BusTracker</td><td align="center">1223M</td><td align="center">334</td><td align="center">3.66M X</td></tr>
<tr><td align="center">MOOC</td><td align="center">95M</td><td align="center">885</td><td align="center">107k X</td></tr>
<tr><td align="center">TPC-H</td><td align="center">20000</td><td align="center">22</td><td align="center">909 X</td></tr>
<tr><td align="center">DSB</td><td align="center">11440</td><td align="center">52</td><td align="center">220 X</td></tr>
<tr><td align="center">Stack</td><td align="center">5000</td><td align="center">25</td><td align="center">200 X</td></tr>
</tbody></table>
<p style="text-align: left;">
<b>Table 1, Query Repetition:</b>
<em>
The number of queries and query templates in the workloads used in recent ML approaches for
database tuning.
</em></p>
<p>We observe that both real-world application traces and synthetic benchmarks exhibit high
repetition – the DBMS executes a small number of query templates hundreds, thousands, or even
millions of times.
This repetition is necessary during regular query execution because the DBMS must return
accurate results.
However, during training data generation, eliminating this repetition is an opportunity to achieve
significant speedups.</p>
<p>What is needed is a way to determine <em>when</em> and <em>why</em> the DBMS should execute a query again so that
it can execute fewer queries during training data generation.
For the purpose of obtaining training data for its ML models,
it should skip all queries that do not contribute valuable training data
(i.e., exhibit new behavior).
The DBMS should only re-execute a query if it produces substantially different telemetry (e.g.,
run-time) from its past parameterizations.</p>
<h2 id="operator-repetition">Operator Repetition</h2>
<blockquote>
<p>Across all queries, the DBMS only uses a small number of operators.<br />
If a query is slow, skip the expensive operators – they already appear elsewhere.</p>
</blockquote>
<p></p>
<p>For every unskipped query that the DBMS must execute, there remain opportunities to identify
and eliminate redundancies in its query plan.
Recall that a query plan is composed of a DBMS’s operators.
However, because a typical DBMS has only a few dozen to a hundred operators, the repetition in
observing a particular operator’s behavior is even more frequent than that of entire queries in
Table 1.
Eliminating this repetition is especially important when the DBMS is exploring new configurations
that turn out to be bad, causing queries to take a long time to complete.
For example, a query that runs slowly because of missing indexes will remain slow for the rest of
its execution.
Yet the importance of that particular query to the training data corpus
is minimal because it spends most of its time performing disk reads in its highly predictable
sequential scan operators.
Hence, it is essential to reduce the time the DBMS spends executing operators after they
have become predictable.</p>
<p>The reason this is possible is that operators are independent of each other; a DBMS operator’s
behavior depends only on its input tuple(s).
Existing research relies on this independence to build ML models for the DBMS.
Our key insight is that integrating such modeling assumptions earlier into the training data
generation process enables early termination in query execution.</p>
<h2 id="aside-the-need-for-models">Aside: The Need for Models</h2>
<p>Exploiting repetition allows us to obtain cheap training data for the DBMS’s models,
but such techniques may not be suitable for replacing the models themselves
(e.g., directly running accelerated queries during database tuning).</p>
<p>We aim to build <em>bootstrap</em> models, which are fast and cheap but not necessarily as precise.
These models allow the DBMS to begin its tuning loop.
However, as that happens, the DBMS can spend more compute to build more complex models (e.g.,
<a rel="noopener" target="_blank" href="https://arxiv.org/abs/2403.02286">hierarchical models</a>) on the same training data.
The higher precision of these complex models can improve the quality of its tuning recommendations.</p>
<p>Additionally, most models only require query plans as an input.
They do not require the DBMS to be running on the same machine.
This allows the DBMS to deploy and parallelize its tuning algorithms across different machines
(e.g., with GPUs for faster inference) without paying the high cost of provisioning separate copies
of the DBMS’s hardware and data.</p>
<p>Because keeping modeling as a separate step in the tuning pipeline confers various benefits,
we limit our scope to collecting training training data faster.</p>
<h1 id="solution">Solution</h1>
<p>To summarize the discussion above:</p>
<ul>
<li>The DBMS generates training data by executing queries.</li>
<li>Executing queries is a bottleneck for ML-based DBMS automation.</li>
<li>Because training data does not need to compute exact query results, the DBMS can skip or
accelerate query execution by exploiting repetition to synthesize training data.</li>
</ul>
<h2 id="architecture">Architecture</h2>
<p>Given this, we present the Boot framework to accelerate training data generation.
Boot is transparent to the DBMS’s upstream ML components and leverages workload repetition in two
ways to expedite training data collection while minimizing its impact on the accuracy of the ML
models.
First, Boot reduces the number of queries the DBMS executes by recognizing redundant queries
based on their high-level semantics, avoiding re-execution through reusing previously computed
training data (<em>macro-acceleration</em>).
Next, for the queries that the DBMS does execute, Boot modifies their run-time execution behavior
by injecting special operators into their query plans.
These operators (1) dynamically identify redundant computations and then (2) intelligently
short-circuit parts of the plan to expedite their completion (<em>micro-acceleration</em>).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/boot/architecture.png" alt="Figure 1: Architecture." /></p>
<p style="text-align: left;">
<b>Figure 1, Architecture:</b>
<em>
An overview of Boot's internal modules and execution flow.
The Macro-Accelerator (MA) decides whether to execute a query, and the
Micro-Accelerator (µA) accelerates the execution of a specific query.
</em></p>
<style>
.step {
border-radius: 50%;
width: 1.5em;
height: 1.5em;
background: #fff;
border: 2px solid #000000;
color: #FFFFFF !important;
background-color: #C41230;
pointer-events: none;
display: inline-block;
text-align: center;
}
</style>
<p>As Fig 1 shows, Boot integrates into a DBMS using two modules:</p>
<ol>
<li>The <strong>Macro-Accelerator</strong> (MA) sits between the DBMS’s network handler and planner,</li>
<li>The <strong>Micro-Accelerator</strong> (µA) embeds itself in the DBMS’s execution engine.</li>
</ol>
<p>Boot’s design does not modify the DBMS’s interface for training data generation.
Clients still connect to a Boot-enhanced DBMS over standard APIs (e.g., JDBC, ODBC) to execute
their workloads and collect training data.
This compatibility allows Boot to drop into existing modeling pipelines without any code change;
the only effect is that the DBMS produces training data faster.
However, because Boot alters the DBMS’s regular query execution semantics, it is
<a rel="noopener" target="_blank" href="https://www.cidrdb.org/cidr2023/papers/p27-lim.pdf">fundamentally unsuitable</a>
for production environments.
Therefore, we deploy Boot on an offline clone of the production DBMS to avoid application errors.</p>
<pre data-lang="SQL" style="background-color:#393939;color:#dedede;" class="language-SQL "><code class="language-SQL" data-lang="SQL"><span style="font-weight:bold;color:#b7b7b7;">SELECT</span><span style="font-weight:bold;color:#95bff3;"> nation, o_year, </span><span style="font-weight:bold;color:#fffd87;">SUM</span><span style="font-weight:bold;color:#95bff3;">(amount) </span><span style="font-weight:bold;color:#ececec;">as</span><span style="font-weight:bold;color:#95bff3;"> sum_profit
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#b7b7b7;">FROM</span><span style="font-weight:bold;color:#95bff3;"> (
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#b7b7b7;">SELECT</span><span style="font-weight:bold;color:#95bff3;"> n_name </span><span style="font-weight:bold;color:#ececec;">as</span><span style="font-weight:bold;color:#95bff3;"> nation, EXTRACT(YEAR </span><span style="font-weight:bold;color:#b7b7b7;">FROM</span><span style="font-weight:bold;color:#95bff3;"> o_orderdate) </span><span style="font-weight:bold;color:#ececec;">AS</span><span style="font-weight:bold;color:#95bff3;"> o_year,
</span><span style="font-weight:bold;color:#95bff3;"> l_extendedprice*(</span><span style="font-weight:bold;color:#87d6d5;">1</span><span style="font-weight:bold;color:#ececec;">-</span><span style="font-weight:bold;color:#95bff3;">l_discount)</span><span style="font-weight:bold;color:#ececec;">-</span><span style="font-weight:bold;color:#95bff3;">ps_supplycost*l_quantity </span><span style="font-weight:bold;color:#ececec;">AS</span><span style="font-weight:bold;color:#95bff3;"> amount
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#b7b7b7;">FROM</span><span style="font-weight:bold;color:#95bff3;"> part, supplier, lineitem, partsupp, orders, nation
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#b7b7b7;">WHERE</span><span style="font-weight:bold;color:#95bff3;"> s_suppkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> l_suppkey </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> ps_suppkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> l_suppkey
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> ps_partkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> l_partkey </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> p_partkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> l_partkey
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> o_orderkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> l_orderkey </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> s_nationkey </span><span style="font-weight:bold;color:#ececec;">=</span><span style="font-weight:bold;color:#95bff3;"> n_nationkey
</span><span style="font-weight:bold;color:#95bff3;"> </span><span style="font-weight:bold;color:#ececec;">AND</span><span style="font-weight:bold;color:#95bff3;"> p_name </span><span style="font-weight:bold;color:#ececec;">LIKE </span><span style="font-weight:bold;color:#d6d6d680;">'</span><span style="font-weight:bold;color:#d68686;">%[COLOR]%</span><span style="font-weight:bold;color:#d6d6d680;">'
</span><span style="font-weight:bold;color:#95bff3;">) </span><span style="font-weight:bold;color:#ececec;">AS</span><span style="font-weight:bold;color:#95bff3;"> profit </span><span style="font-weight:bold;color:#b7b7b7;">GROUP BY</span><span style="font-weight:bold;color:#95bff3;"> nation,o_year </span><span style="font-weight:bold;color:#b7b7b7;">ORDER BY</span><span style="font-weight:bold;color:#95bff3;"> nation, o_year </span><span style="font-weight:bold;color:#fed6af;">DESC</span><span style="font-weight:bold;color:#95bff3;">;
</span></code></pre>
<p style="text-align: center;">
<b>Listing 1:</b>
<em>TPC-H Q9.</em>
</p>
<p>We now provide an overview of Boot’s macro- and micro-accelerators using the TPC-H Q9 query shown
in Listing 1 as a running example. Executing 1000 iterations of Q9
(with different strings substituted for <code>%COLOR%</code>) at scale-factor (SF) 100 on
PostgreSQL (v15) takes 17 hours. Enabling Boot reduces the time required to 1 minute with minor
degradation in ML model accuracy. We omit technical details for brevity and present only the
high-level intuition in this blog post; we encourage interested readers to check out our paper.</p>
<h2 id="macro-accelerator">Macro-Accelerator</h2>
<p>Boot’s Macro-Accelerator (MA) module inspects each query request as it arrives to determine whether
it should be executed again (i.e., whether executing it would increase the diversity of the
training data gathered thus far).</p>
<blockquote>
<p>Macro-Accelerator = (1) Similarity + (2) Adaptivity<br />
(1) <em>Fingerprinting</em> identifies similar queries<br />
(2) <em>Exponential skipping</em> decides whether to re-execute queries</p>
</blockquote>
<p></p>
<p>Figure 1 shows that <a class="step">1</a> when a SQL query arrives,
the MA fingerprints it and checks whether it has executed the query before.
This fingerprint is computed on raw SQL strings because the MA is placed before query planning
to avoid the overhead of planning unnecessary queries.
A simple example of a fingerprint is for the MA to remove the SQL string’s constants, producing a
query template.
In the Q9 example above, the MA replaces the ‘%[COLOR]%’ in the string with a placeholder.
More complex schemes for fingerprinting may incorporate a query’s parametric behavior or the DBMS’s
current configuration to increase the quality of the training data, which we elaborate on in the
paper.</p>
<p>Next, <a class="step">2</a> the MA looks up the query’s fingerprint in its <em>result cache</em> to
determine whether the DBMS recently executed a similar query.
This cache maps each fingerprint to a record containing (1) the query’s output and (2) the
telemetry produced by the DBMS while executing the query.
The former is necessary because existing workload replay and benchmarking tools assume the
DBMS returns query results with a particular schema (e.g., JSON-formatted plans).
After performing the lookup,</p>
<ul>
<li>If the cache does not contain a matching fingerprint, the Boot framework forwards the request
for the DBMS’s processing as usual.</li>
<li>However, if the MA’s cache contains a match, it then decides whether to skip the query.
<ul>
<li>To skip the query, it returns the cached result and records that it skipped.</li>
<li>To execute the query, it again forwards the request to the DBMS for processing.</li>
</ul>
</li>
</ul>
<p>We now sketch a brief overview of the MA’s policies for skipping query re-execution, leaving
the technical details to our paper.
The high-level idea is similar to exponential backoff, which we now describe as <em>exponential
skipping</em>.
Each time the DBMS executes a query, the MA analyzes its resulting telemetry to see whether the
run-time falls within two standard deviations of the fingerprint’s historical mean.</p>
<ul>
<li>If so, the latest query instance is considered similar. The MA exponentially increases the
number of times to skip this fingerprint until the subsequent execution.</li>
<li>Otherwise, the skipping algorithm resets. The MA clears the corresponding cache entry.</li>
</ul>
<p>For example, suppose that Q9 always takes a median run-time of 65 seconds and that the MA skips
queries up to a threshold of 100 times.
The number of times that the MA skips Q9 in between executions is given by the sequence
<code>[1, 2, 4, 8, 16, 32, 64, 100, 100, ... ]</code>, dropping the time required for 1000 executions of Q9
from 17 hours to 16 minutes.
However, should a Q9 invocation exhibit new behavior, the skipping sequence is reset to sample
future instances more frequently.
The choice of run-time as a metric is an optimization to avoid storing and comparing against all
executed plans, allowing the MA to maintain only approximately 10 KB of state
(e.g., run-time, number of input rows per operator)
per query template.
As Table 1 shows, a workload typically contains up to a few thousand query templates, so the total
storage overhead of MA is only in the tens of MBs.</p>
<h2 id="micro-accelerator">Micro-Accelerator</h2>
<blockquote>
<p>Micro-Accelerator = (1) Modifying Tuple Flows + (2) Sampling + (3) Scaling<br />
(1) When the flow of tuples stabilizes, the operator is predictable<br />
(2) An operator’s output induces more work, so sample to control the work generated<br />
(3) When stopping an operator early, scale up its telemetry to prevent underestimation</p>
</blockquote>
<p></p>
<p>Each query that the MA module does not skip then goes to the DBMS’s query planner.
The Boot’s Micro-Accelerator (µA) injects its special operators into the query plan at this stage.
These injected components wrap a plan’s operators for the µA to monitor their run-time execution
constantly.
When the µA detects that an operator’s behavior has stabilized (i.e., more training data
from that operator is unnecessary), it sends a message to the corresponding wrapper to alter
the operator’s tuple processing behavior (e.g., reduce the amount of output produced).</p>
<p>The core idea that the µA’s injection exploits is that the DBMS performs query processing at the
granularity of its operators.
For example, the
<a rel="noopener" target="_blank" href="https://15445.courses.cs.cmu.edu/spring2023/notes/12-queryexecution1.pdf">Volcano model</a>
requires every operator to implement a <em>Next()</em> function that returns its next tuple.
Therefore, to support µA’s injection, the DBMS only needs to provide a way to wrap the tuple
production function (e.g., by replacing an operator’s <em>Next()</em> function pointer), either at a
source-code level or through
<a rel="noopener" target="_blank" href="https://archive.fosdem.org/2021/schedule/event/postgresql_extensibility/">extensibility hooks</a>.
Because most DBMSs have existing code paths that provide such functionality for
<a rel="noopener" target="_blank" href="https://www.postgresql.org/docs/current/sql-explain.html">instrumentation</a>,
implementing µA’s injection is relatively straightforward.</p>
<p>Figure 1 shows that <a class="step">3</a> the µA encapsulates each of the DBMS’s physical
plan operators (e.g., scans, joins) with a special wrapper operator to control its run-time
behavior dynamically.
For example, this wrapper can sample the wrapped operator’s output to emit only 10% of the tuples
that it would otherwise produce.
It can also terminate an operator’s execution early when the µA detects it should do so,
stopping the operator from creating additional work.</p>
<p>Since µA may short-circuit an operator’s execution, <a class="step">4</a> Boot scales each
operator’s telemetry to approximate the telemetry of its full execution.
For example, a scan operator that processes only 10% of its expected rows may have its measured
timings scaled ten-fold.
This scaling improves ML model accuracy as it guards against underpredicting query execution times.</p>
<p>The key advantage of µA’s wrapper-based approach that directly modifies a physical query plan is
that the DBMS is guaranteed to generate the same plan with and without Boot enabled.
Some DBMSs alter the query plan when using other sampling techniques (e.g., SQL-level
<em>TABLESAMPLE</em>), producing plans that differ in shape and performance characteristics.
Because the DBMS is collecting training data to build models of its behavior, the training plans
must resemble an actual deployment’s plans as much as possible.</p>
<p>Lastly, after the DBMS executes the µA-wrapped plan, it forwards the query result and telemetry to
the MA module.
The MA module stores this information in its result cache for future invocations of similar
queries.
Boot’s MA and µA modules are independent: if either component is disabled, the DBMS will
process data with its regular non-accelerated components instead.
However, they enhance each other’s ability to accelerate workload execution.
In the paper, we show that the combined effect of both accelerators obtains up to 225x speedup.</p>
<h1 id="results">Results</h1>
<p>We highlight a few key results on a “standard” benchmarking setup for PostgreSQL (v15), that is:</p>
<ul>
<li>Benchmark 1: <a rel="noopener" target="_blank" href="https://www.tpc.org/tpch/">TPC-H</a>, scale factor 100
<ul>
<li>Represents a workload with uniform data distribution.</li>
</ul>
</li>
<li>Benchmark 2: <a rel="noopener" target="_blank" href="https://aka.ms/dsb/">DSB</a>, scale factor 10
<ul>
<li>Represents a workload with complex data distributions, join patterns, and skew.</li>
</ul>
</li>
<li>Modeling technique: <a rel="noopener" target="_blank" href="https://github.com/autogluon/autogluon">AutoGluon</a>
<ul>
<li>AutoGluon automatically searches over hyperparameters and techniques (e.g., gradient-boosted
trees, random forests, linear models, neural networks) that existing work uses to build models.</li>
<li>We found that AutoGluon is competitive with (and often outperforms) hand-crafted models from
prior work.</li>
</ul>
</li>
</ul>
<p>In this blog post, we compare four configurations:</p>
<ol>
<li>the default DBMS without Boot enabled (<strong>Original</strong>),</li>
<li>the DBMS with only Boot’s macro-accelerator active (<strong>MA</strong>),</li>
<li>the DBMS with only Boot’s micro-accelerator active (<strong>µA</strong>),</li>
<li>the DBMS with both of Boot’s accelerators active (<strong>Combined</strong>).</li>
</ol>
<p>The full paper contains other scenarios, sensitivity analyses, and comparisons to different
techniques.</p>
<h2 id="collection-time">Collection Time</h2>
<blockquote>
<p>Boot achieves up to 225x speedup on the examples in this blog post.<br />
The general scale of its improvement is improving collection time from weeks to days.</p>
</blockquote>
<p></p>
<p>We first measure the <em>collection time</em> as previously defined; that is, the time the DBMS
takes to generate training data by executing the workload.</p>
<table style="pointer-events: none; box-shadow: none">
<tr>
<td style="border: none"><img src="./tpch_sf_100_runtime_dataset.png" alt="TPC-H Collection Time."></img></td>
<td style="border: none"><img src="./dsb_sf_10_runtime_dataset.png" alt="DSB Collection Time."></img></td>
</tr>
<tr>
<td style="border: none; padding: 0"><p style="text-align: center;">Figure 2a: TPC-H</p></td>
<td style="border: none; padding: 0"><p style="text-align: center;">Figure 2b: DSB</p></td>
</tr>
<tr>
<td style="border: none; padding: 0" colspan="2">
<b>Figure 2, Collection Time:</b>
<em>
The time to generate training data with different modules of Boot active (lower is better).
</em>
</td>
</tr>
</table>
<p>Figure 2 shows the following:</p>
<ul>
<li>MA obtains speedups of 2.26x on TPC-H and 3.14x on DSB.</li>
<li>µA obtains speedups of 2.22x on TPC-H and 23.2x on DSB.</li>
<li>Combined obtains speedups of 6.64x on TPC-H and 225x on DSB.</li>
</ul>
<p>We describe a broader range of speedups across different scale factors in the paper, with Boot
generally performing better as the dataset size grows.
The takeaway is that Boot reduces collection time from weeks to days, or from days to hours.
This acceleration is significant because it allows an autonomous DBMS to obtain its ML models in a
fraction of the time, minimizing the time it spends without autonomous capabilities.</p>
<p>We also observe from Figure 2 that:</p>
<ul>
<li>The MA and µA obtain different speedups depending on the workload’s complexity and skew.
We analyze speedup sources in the paper and only provide a summary here.
<ul>
<li>The MA executes exponentially fewer queries.
On a histogram of query runtimes, enabling the MA maintains the overall shape but decreases
each bar’s height significantly (i.e., fewer query invocations) because of exponential
skipping.</li>
<li>The µA executes the same number of queries but accelerates individual queries by reducing the
time spent in individual operators.
A bar chart of individual operator timings shows that the µA achieves up to 209x speedup on
physical operators such as index scans, sequential scans, and hash joins.
These speedups are from reducing the number of expensive disk operations and tuples processed.</li>
</ul>
</li>
<li>The MA and µA achieve higher speedups together than they do individually.
<ul>
<li>In the paper, we show that this effect is most pronounced on queries that are long-running or
have instances which time out.</li>
</ul>
</li>
</ul>
<h2 id="absolute-error">Absolute Error</h2>
<blockquote>
<p>Boot results in models with comparable mean absolute error.<br />
Tuning methods are surprisingly robust to high errors.</p>
</blockquote>
<p></p>
<p>We next evaluate the absolute error; that is, the absolute difference between a query’s actual
latency and the corresponding model’s predicted latency. We also measure the mean absolute error
(MAE). We visualize the absolute error across all predictions as a boxplot.</p>
<table style="pointer-events: none; box-shadow: none">
<tr>
<td style="border: none"><img src="./tpch_sf_100_ae_boxplot.png" alt="TPC-H Absolute Error."></img></td>
<td style="border: none"><img src="./dsb_sf_10_ae_boxplot.png" alt="DSB Absolute Error."></img></td>
</tr>
<tr>
<td style="border: none; padding: 0"><p style="text-align: center;">Figure 3a: TPC-H</p></td>
<td style="border: none; padding: 0"><p style="text-align: center;">Figure 3b: DSB</p></td>
</tr>
<tr>
<td style="border: none; padding: 0" colspan="2">
<b>Figure 3, Absolute Error:</b>
<em>
The absolute error of models that are trained on the individual datasets (lower is better).
The red circle shows sample mean and the whiskers extend to 1.5 interquartile range.
</em>
</td>
</tr>
</table>
<p>Figure 3 shows that compared to the MAE of the Original configuration’s models:</p>
<ul>
<li>MA’s MAE is 2.64x on TPC-H and 1.06x on DSB.</li>
<li>µA’s MAE is 11.0x on TPC-H and 0.999x on DSB.</li>
<li>Combined’s MAE is 11.5x on TPC-H and 1.11x on DSB.</li>
</ul>
<p>Intuitively, these results make sense.</p>
<ul>
<li>The MA introduces less error because it produces telemetry under similar conditions.
It only decides whether to execute a query and does not modify query execution itself.</li>
<li>The µA results in more error because it (1) terminates execution early and (2) scales the
telemetry.</li>
<li>The Combined configuration generally has the worst of both errors.</li>
</ul>
<p>To contextualize these results, we present a scenario from the paper in which we summed the
prediction error for all invocations of TPC-H’s Q1.</p>
<ul>
<li>Invoking Q1 for 1000 iterations took 3.7 hours.</li>
<li>The Original models are off by 308 seconds (5 minutes), whereas the Combined models are off by
2099 seconds (35 minutes).
<ul>
<li>The Combined models have a 7x more error. However, they only required four days of
training data generation, whereas the Original models required three weeks.</li>
<li>The final prediction of the Combined models, 3.1 hours when the actual time was 3.7 hours,
remains accurate enough to be useful.</li>
</ul>
</li>
</ul>
<p>At this stage, a natural question is how bad the errors can get before the models are no
longer usable.
<a rel="noopener" target="_blank" href="https://www.vldb.org/pvldb/vol17/p823-zhao.pdf">Recent work</a>
shows that even models with terrible errors (e.g., 10x, 50x) have comparable
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/F-score">F1 scores</a> for the task of index recommendation.
We started our investigation into this line of research after making similar observations in
internal experiments; intuitively, for many tuning tasks, all that matters is getting the
“direction” of tuning right.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Researchers have developed effective ML models for database tuning, but integrating these models
into an end-to-end tuning loop is challenging because their construction requires an expensive and
lengthy training data generation process.
We introduce two acceleration techniques to expedite training data collection by leveraging the
unique characteristics of the training data environment; unlike regular query execution, there is
no need for the DBMS to compute accurate results.
We integrate these techniques into our framework Boot that drops into existing modeling pipelines.
Boot’s sharp reduction in training data collection time makes it well-suited for bootstrapping an
autonomous DBMS’s initial models, minimizing the time spent without self-tuning capabilities.</p>
<p>Our paper is under submission.</p>
Better streaming algorithms for Maximum Directed Cut via 'snapshots'2024-04-11T00:00:00+00:002024-04-11T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/streaming-csps/<p>\[
\gdef\bias{\mathrm{bias}}
\gdef\deg{\mathrm{deg}}
\gdef\indeg{\mathrm{indeg}}
\gdef\outdeg{\mathrm{outdeg}}
\gdef\Snap{\mathrm{Snap}}
\gdef\RSnap{\mathrm{RefSnap}}
\]</p>
<p>In this blog post, I’ll discuss a new algorithm based on two joint papers of mine with Raghuvansh Saxena, Madhu Sudan, and Santhoshini Velusamy (appearing in SODA’23 and FOCS’23). The goal of this algorithm is to “approximate” the value of a graph optimization problem called “maximum directed cut”, or <strong>Max-DICUT</strong> for short, and the algorithm operates in the so-called “streaming model”. After defining these terms, I will describe how we reduce the problem of approximating the <strong>Max-DICUT</strong> value of a directed graph to the problem of estimating a certain matrix, which we call the “snapshot”, associated to a directed graph; finally, I will present some ideas behind streaming algorithms for estimating these snapshot matrices.</p>
<h1 id="introduction">Introduction</h1>
<p>To start, we will define the particular algorithmic model (streaming algorithms) and computational problem (<strong>Max-DICUT</strong>) that we are interested in.</p>
<h2 id="streaming-algorithms">Streaming algorithms</h2>
<p>Motivated by applications to “big data”, in the last two decades, theoretical models of computing on massive inputs have been widely studied. In these models, the algorithm is given <em>limited</em>, <em>partial</em> access to some input object, and is required to produce an output fulfilling some guarantee related to that object. Some classes of models include:</p>
<ul>
<li><em>Property testing</em>, where an algorithm must decide whether a large object either has some property \(P\) or “really doesn’t have \(P\)”<sup class="footnote-reference"><a href="#ppty-tst">1</a></sup> given only a few queries to the object. Depending on the specific model, the algorithm may be able to choose these queries adaptively, or, more restrictively, the queries might just be randomly and independently sampled according to some distribution.</li>
<li><em>Online algorithms</em>, where an algorithm is forced to make progressive decisions about an object while it is revealed piece by piece.</li>
<li><em>Streaming algorithms</em>, where an algorithm is allowed to make a decision about an object after seeing it revealed progressively in a “stream”, but there is a limit on the amount of information that can be stored in memory.</li>
</ul>
<p>This blog post is concerned with streaming algorithms. In this setting, <em>memory space</em> is the most important limited resource. Sometimes, there are even algorithms that pass over a data stream of length \(n\) but maintain their internal state using only \(O(\log n)\) or even fewer bits of memory! One exciting aspect of the streaming model of computation is that space restrictions can often be studied mathematically from the standpoint of information theory, opening an avenue for proving impossibility results.<sup class="footnote-reference"><a href="#contrast">2</a></sup></p>
<p>Numerous algorithmic problems have been studied in the context of streaming algorithms. These include statistical problems, such as finding frequent elements in a list (so-called “heavy hitters”) or estimating properties of the distribution of element frequencies in lists (like so-called “frequency moments”), as well as questions about graphs, such as testing for connectivity, or computing the maximum matching size, where the stream consists of the list of edges. The common denominator between all these problems is that the “usual” algorithms might not be good streaming algorithms, whether because they require too much space or because they require “on-demand” access to the input data.</p>
<h2 id="constraint-satisfaction-problems">Constraint satisfaction problems</h2>
<p>Many “classical” computational problems can be recast into questions in the streaming model. Here, we are interested in one class of problems that have been particularly well-studied classically, namely <em>constraint satisfaction problems</em> (CSPs). These occur often in practice and include many problems one might encounter in introductory algorithms courses, such as <strong>Max-3SAT</strong>, <strong>Max-CUT</strong>, and <strong>Max-\(q\)Coloring</strong>.</p>
<p>CSPs are defined by variables and local constraints over a finite alphabet. More formally, a CSP is defined by:</p>
<ul>
<li>A finite set \(\Sigma\), called an <em>alphabet</em>; in the typical “Boolean” case, \(\Sigma=\{0,1\}\).</li>
<li>A number of <em>variables</em>, \(n\).</li>
<li>A number of local <em>constraints</em>, \(m\), and a list of constraints \(C_1,\ldots,C_m\). Each constraint \(C_j\) is defined by four objects:
<ol>
<li>A number \(k_j \geq 1 \in \mathbb{N}\), called the <em>arity</em>, that determines the number of variables \(C_j\) involves.</li>
<li>A choice of \(k_j\) distinct variables \(i_{j,1},\ldots,i_{j,k_j} \in \{1,\ldots,n\}\).</li>
<li>A <em>predicate</em> (or “goal” function) \(f_j : \Sigma^{k_j} \to \{0,1\}\) for those variables.</li>
<li>A <em>weight</em> \(w_j \geq 0\).</li>
</ol>
</li>
</ul>
<p>The CSP asks us to optimize over potential <em>assignments</em>, which are functions \(x : \{1,\ldots,n\} \to \Sigma\) mapping each variable to an element of \(C_j\). In particular, the objective is to maximize<sup class="footnote-reference"><a href="#max">3</a></sup> the number of “satisfied” (or “happy”, if you’d like) constraints, where a constraint \(C_j\) is “satisfied” if the alphabet symbols assigned by \(x\) on the variables \(i_{j,1},\ldots,i_{j,k_j}\) satisfy the predicate \(f_j\). The maximum number of constraints satisfied by any assignment is called the <em>value</em> of the CSP.</p>
<p>Some examples of CSPs are:</p>
<ul>
<li>In <strong>Max-CUT</strong> (a.k.a. “Maximum Cut”), the alphabet is Boolean (\(\Sigma = \{0,1\}\)), and all constraints are binary and use the same predicate: \(f(x,y) = x \oplus y\) (where \(\oplus\) denotes the Boolean XOR operation). I.e., if we apply a constraint to the variables \((i_1,i_2)\), then the constraint is satisfied iff \(x(i_1) \neq x(i_2)\). <strong>Max-\(q\)Coloring</strong> is similar, over a larger alphabet of size \(q\), with the predicate \(f(x,y)=1 \iff x \neq y\).</li>
<li>In <strong>Max-DICUT</strong> (a.k.a. “Maximum Directed Cut”), the alphabet is again Boolean, and the predicate is now \(f(x,y) = x \wedge \neg y\), so that a constraint \((i_1,i_2)\) is satisfied iff \(x(i_1) = 1 \wedge x(i_2) = 0\).</li>
<li>In <strong>Max-3SAT</strong>, the alphabet is also Boolean, all constraints are ternary, and the assorted predicates are all possible disjunctions on literals, such as \(f(x,y,z) = x \vee \neg y \vee z\) or \(f(x,y,z) = \neg x \vee \neg y \vee \neg z\).</li>
</ul>
<p>Both <strong>Max-CUT</strong> and <strong>Max-DICUT</strong> can be described interchangeably in the language of <em>graphs</em>, which might be more familiar. For <strong>Max-CUT</strong>, given an instance on \(n\) variables, we can form a corresponding undirected graph on \(n\) vertices, and add an edge \(i_1 \leftrightarrow i_2\) for each constraint \((i_1,i_2)\) in the instance (with the same weight). Now an assigns each vertex to either \(0\) or \(1\), and an edge is satisfied iff its endpoints are on different sides of the cut. (We can think of an assignment as a “cut” which partitions the vertex-set into two sets: one side corresponding to the variables \(\{i : x(i)=0\}\) and one for \(\{i : x(i)=1\}\).) For <strong>Max-DICUT</strong>, because of the asymmetry, we have to create a <em>directed</em> graph. We add an edge \(i_1 \to i_2\) for each constraint \((i_1,i_2)\), and an edge \(i_1 \to i_2\) is satisfied iff \(i_1\) is assigned to \(1\) and \(i_2\) to \(0\). (Similarly, an assignment is an “ordered” partition of the vertex into two sets, i.e., we have a designated “source” set and “target” set and they are not interchangeable.)</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/streaming-csps/csps-graphs.png" alt="A table with two columns, marked “CSPs” and “Graphs”, and then three rows, the first with “Max-CUT constraint \(x_3 \oplus x_7\)” and an undirected edge between \(3\) and \(7), the second with “Max-DICUT constraint \(x_3 \wedge \overline{x_7}\) and a directed edge from \(3\) to \(7\), and finally a row with “Boolean assignment \(x_1=0,x_2=0,x_3=1,x_4=0\)” and a “graph cut” where \(3\) is on one side and \(1,2,4\) on the other." />
<em>Figure.</em> A “dictionary” between the CSP and graph versions of <strong>Max-CUT</strong> and <strong>Max-DICUT</strong>. Each variable becomes a vertex, each constraint becomes an edge (directed in <strong>Max-DICUT</strong>, undirected in <strong>Max-CUT</strong>), and a Boolean assignment \(x\) becomes a “cut” of the vertices in the graph.</p>
<p>Note that in these examples, the arity is a small constant (i.e., either \(2\) or \(3\)). What makes CSPs so interesting is that we can “build up” complicated global instances on arbitrarily many variables by applying predicates to “local” sets of a few variables at a time.</p>
<p>For various reasons, we are interested in studying the feasibility of <em>approximating</em> the values of CSPs (and not <em>exactly</em> determining this value). Firstly, the approximability of CSPs by “classical” (i.e., polynomial-time) algorithms is a subject of intense interest, stemming from connections to probabilistically checkable proofs and semidefinite programming. But the theory of classical CSP approximations relies on unproven assumptions like \(\mathbf{P} \neq \mathbf{NP}\). Space-bounded streaming algorithms generally seem very weak compared to polynomial-time algorithms, but this gives us the satisfaction of proving unconditional hardness results — and some CSPs still admit nontrivial streaming approximation algorithms. Secondly, it turns out that exact computation of CSP value is very hard in the streaming setting. Further, exact computation is hardest for <em>dense</em> instances, which is typical for many streaming problems, while approximation is, interestingly, hardest for <em>sparse</em> instances, i.e., for instances with \(O(n)\) constraints. This is because of the following well-known “sparsification lemma”, which reduces computing the value (approximately) for arbitrary instances to computing the value (approximately) for sparse instances:</p>
<p><strong>Lemma (sparsification, informal)</strong>. Let \(\Psi\) be an instance of a CSP with \(n\) variables and \(m\) constraints. Suppose we construct a new instance \(\Psi’\), also on \(n\) variables, but with \(m = \Theta(n)\) constraints, by randomly sampling constraints from \(\Psi\). Then with high probability, the values of \(\Psi\) and \(\Psi’\) will be roughly the same.</p>
<p>(To make this lemma formal: For \(\epsilon > 0\), if \(m’ = \Theta(n/\epsilon^2)\), then we get high probability of the values being within an additive \(\pm\epsilon\). In the unweighted case, “randomly sampling constraints” literally means that each constraint is randomly sampled from \(\Psi\)’s constraints. It is possible to generalize to the weighted case assuming the ratio of maximum to minimum weights is bounded.)</p>
<p>Because of this sparsification lemma, in the remainder of this post, we will assume for simplicity that all CSP instances on \(n\) variables have \(\Theta(n)\) constraints. (Note we are assuming they also have \(\Omega(n)\) constraints. The algorithms we describe below will also work for \(o(n)\) constraints, but this case can sometimes be messier.)</p>
<h1 id="streaming-algorithms-meet-csps-max-cut-and-max-dicut">Streaming algorithms meet CSPs: <strong>Max-CUT</strong> and <strong>Max-DICUT</strong></h1>
<p>It is natural to ask whether streaming algorithms can use the <em>local</em> constraints defining an instance to deduce something about the quality of the best <em>global</em> assignment:</p>
<blockquote>
<p><em>Key question:</em> How much space does a streaming algorithm need to approximate the value of (the best global assignment to) a CSP given a pass over its list of local constraints?</p>
</blockquote>
<p> </p>
<p>This question was first posed at the 2011 Bertinoro workshop on sublinear algorithms (see <a rel="noopener" target="_blank" href="https://sublinear.info/index.php?title=Open_Problems:45">the <code>sublinear.info</code> wiki</a>). In this section, we examine this question through the lens of <strong>Max-CUT</strong> and <strong>Max-DICUT</strong>, which are two of the simplest and most widely studied Boolean, binary CSPs.</p>
<h2 id="streaming-csps-and-max-cut">Streaming CSPs and Max-CUT</h2>
<p>For the rest of this blog post, we adopt the “graph” language for describing <strong>Max-CUT</strong> and <strong>Max-DICUT</strong>. Thus, in the streaming setting, we are interested in algorithms for <strong>Max-CUT</strong> and <strong>Max-DICUT</strong> where the input is a stream of undirected edges (<strong>Max-CUT</strong>) or directed edges (<strong>Max-DICUT</strong>) from a graph, and the goal is to output an approximation to the value of the graph.</p>
<p>Now, we turn to some prior results about streaming algorithms for <strong>Max-CUT</strong> and <strong>Max-DICUT</strong>. Recall that streaming algorithms are characterized by the amount of space they use. We will be interested in three “regimes” of space. We define these regimes using “\(O\)-tilde” notation: \(g(n) = \tilde{O}(f(n))\) if \(g(n) = O(f(n) \cdot \log^C n)\) for some constant \(C>0\). The regimes are as follows.</p>
<h3 id="large-space">Large space</h3>
<p>We use “large space” to refer to space between \(\Omega(n)\) and \(\tilde{O}(n)\). This space regime is sufficient to store entire input instances in memory! Thus, we can exactly calculate the value of instances once we see all their constraints, simply by enumerating all possible \(2^n\) global assignments. (Recall that the streaming model places no restrictions on the time usage of algorithms!)</p>
<p>Kapralov and Krachun (STOC’19) showed that for <strong>Max-CUT</strong>, this algorithm is the best possible: no algorithms using less-than-large space can get a \((1/2+\epsilon)\)-approximation for any \(\epsilon>0\). (\(1/2\)-approximation is “trivial” since every <strong>Max-CUT</strong> instance has value at least \(1/2\); indeed, a random assignment has expected value \(1/2\) in any instance.) However, the picture for <strong>Max-DICUT</strong> is much more complicated.</p>
<h3 id="medium-space">Medium space</h3>
<p>We use “medium space” to refer to space between \(\Omega(\sqrt n)\) and \(\tilde{O}(\sqrt n)\). This space regime is important because the “birthday paradox” phenomenon kicks in:</p>
<blockquote>
<p><em>Key fact:</em> Medium space is sufficient to store a set \(S\) of variables large enough that we expect that there are constraints involving at least two variables in \(S\).</p>
</blockquote>
<p> </p>
<p>Indeed, suppose we have an instance \(\Psi\) on \(n\) variables, we pick a random subset \(S \subseteq [n]\) of \(\Theta(\sqrt n)\) variables, and we look at all constraints which involve at least two variables in \(S\). Each constraint has this property with probability roughly \(\Theta((1/\sqrt n)^2) = \Theta(1/n)\), so by linearity of expectation, we expect roughly \(\Theta(1)\) constraints to have this property.</p>
<p>This key fact implied the breakdown of certain lower bound techniques for problems like <strong>Max-DICUT</strong> which worked in less-than-medium space, and it is also the starting point for unlocking improved approximation algorithms for <strong>Max-DICUT</strong> once medium space is available, as we’ll discuss below.</p>
<h3 id="small-space">Small space</h3>
<p>Finally, we use “small space” to refer to space which is \(\tilde{O}(1)\). Surprisingly, a result of Guruswami, Velingker, and Velusamy (APPROX’17, based out of CMU!) showed that even in small space, there <em>are</em> nontrivial algorithms for <strong>Max-DICUT</strong>. Chou, Golovnev, and Velusamy (FOCS’20) gave a variant of this algorithm with better approximation guarantees, which they also showed is optimal in less-than-medium space.<sup class="footnote-reference"><a href="#cgv-ratio">4</a></sup> These algorithms achieve nontrivial approximations in small space by using an important tool from the literature on streaming algorithms: the small-space streaming algorithm, from the seminal work of Indyk (2006), for estimating vector norms.</p>
<h2 id="max-dicut-and-bias"><strong>Max-DICUT</strong> and bias</h2>
<p>The work of Chou <em>et al.</em> left wide open the gap between medium and large space for approximating <strong>Max-DICUT</strong>. That is: Are there medium-space (or even less-than-large-space) algorithms which get better approximations than is possible in less-than-medium space? In the next section, I present our affirmative answer to this question, but first, I will introduce a further quantity we will need, which first showed up in this context in the work of Guruswami <em>et al.</em></p>
<p>Given an instance \(\Psi\) of <strong>Max-DICUT</strong> (a.k.a., a directed graph), and a vertex \(i \in \{1,\ldots,n\}\), let \(\outdeg_\Psi(i)\) denote the total weight of edges \(i \to i’\), \(\indeg_\Psi(i)\) the total weight of edges \(i’ \to i\), and \(\deg_\Psi(i) = \outdeg_\Psi(i) + \indeg_\Psi(i)\) the total weight of edges \(i_1\to i_2\) in which \(i \in \{i_1,i_2\}\). (These are called, respectively, the out-degree, in-degree, and total-degree of \(i\).) If \(\deg_\Psi(i) > 0\), then we define a scalar quantity called the <em>bias</em> of \(i\):
\[ \bias_\Psi(i) := \frac{\outdeg_\Psi(i) - \indeg_\Psi(i)}{\deg_\Psi(i)}. \] Note that \(-1 \leq \bias_\Psi(i) \leq +1\). The quantity \(\bias_\Psi(i)\) captures whether the edges incident to \(i\) are mostly outgoing (\(\bias_\Psi(i) \approx +1\)), mostly incoming (\(\bias_\Psi(i) \approx -1\)), or mixed (\(\bias_\Psi(i) \approx 0\)).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/streaming-csps/vertices.png" alt="Three vertices, each incident to eight directed edges. The first vertex is labeled \(\approx +1\) and has mostly outgoing edges. The second vertex is labeled \(\approx 0\) and has a mix of outgoing and incoming edges. The third vertex is labeled \(\approx -1\) and has mostly incoming edges." />
<em>Figure.</em> Visual depictions of three vertices in a directed graph with biases close to \(+1,0,-1\), respectively. Green edges are outgoing and red edges are incoming.</p>
<p>This concept of bias, which relies crucially on the asymmetry of the predicate (and therefore has no analogue for <strong>Max-CUT</strong>), is the key to unlocking nontrivial streaming approximation algorithms for <strong>Max-DICUT</strong>. Observe that if e.g. \(\bias_\Psi(i) = -1\), then <em>all</em> edges incident to \(i\) are incoming, and therefore, the optimal assignment for \(\Psi\) should assign \(i\) to \(0\).<sup class="footnote-reference"><a href="#opt-asst">5</a></sup> Indeed, an instance is perfectly satisfiable iff all variables have bias either \(+1\) or \(-1\). What Guruswami <em>et al.</em> showed was that (i) this relationship is “robust”, in that instances with “many large-bias variables” have large value and vice versa, and (ii) whether an instance has “many large-bias variables” can be quantified using small-space streaming algorithms. Chou <em>et al.</em> gave an algorithm with better approximation ratios by strengthening the inequalities in (i).</p>
<p><strong>Remark:</strong> While we will not require this below, we mention that the notion of “many large-bias variables” is formalized by a quantity called the <em>total bias</em> of \(\Psi\), which is simply the sum over \(i\), weighted by \(\deg_\Psi(i)\), of \(|\bias_\Psi(i)|\). By definition, the total bias is equal to \(\sum_{i=1}^n |\outdeg_\Psi(i)-\indeg_\Psi(i)|\), which is simply the \(1\)-norm of the vector associated to \(\Psi\) whose \(i\)-th entry is \(\outdeg_\Psi(i)-\indeg_\Psi(i)\)! So the <strong>Max-DICUT</strong> algorithms of Guruswami <em>et al.</em> and Chou <em>et al.</em> use the small-space \(1\)-norm sketching algorithm of Indyk as a black-box subroutine to estimate the total bias of the input graph.</p>
<h1 id="improved-algorithms-from-snapshot-estimation">Improved algorithms from snapshot estimation</h1>
<p>Finally, we turn to the improved streaming algorithm for <strong>Max-DICUT</strong> from our recent papers in (SODA’23, FOCS’23). Our result is the following:</p>
<blockquote>
<p><strong>Theorem (Saxena, S., Sudan, Velusamy, FOCS’23).</strong> There is a medium-space streaming algorithm for <strong>Max-DICUT</strong> which achieves an approximation ratio \(\alpha\) strictly larger than the ratio \(\beta\) possible in less-than-medium space (and achievable in small space).</p>
</blockquote>
<p> </p>
<p>The various results on streaming approximations for <strong>Max-DICUT</strong> are collected in the following figure:</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/streaming-csps/ratios.png" alt="A 2D chart. The horizontal axis is labeled “exponent of \(n\)”, the vertical axis “approximation ratio”. There are green points at \((0,1/4)\) labeled “Trivial”, \((0,2/5)\) labeled “GVV’17”, \((0, 4/9)\) labeled “CGV’20”, and \((1,1)\) labeled “Sparsifier”. There are red points at \((1/2,4/9)\) labeled “CGV’20” and \((1,1/2)\) labeled “KK’19”. There is a blue point at \((1/2,0.483)\) labeled “SSSV’23”." />
<em>Figure.</em> A diagram of the known upper and lower bounds on streaming approximations for <strong>Max-DICUT</strong>. The exponents of \(0,1/2,1\) on the \(x\)-axis correspond to the small-, medium-, and large-space regimes; green dots are prior upper bounds, red dots are prior lower bounds, and the blue dot is our new upper bound. Of note, Chou, Golovnev, and Velusamy showed that \(4/9\)-approximations are achievable in small space and optimal in sub-medium space, while Kapralov and Krachun showed that \(1/2\)-approximations are optimal in sub-large space (where in fact arbitrarily good approximations are known). Our new algorithm gives a \(0.484\)-approximation, lying strictly between \(4/9\) and \(1/2\).</p>
<h2 id="the-snapshot-matrix">The snapshot matrix</h2>
<p>To present our algorithm, we first need to define a matrix, which we call the <em>snapshot</em>, associated to any directed graph \(\Psi\). This matrix has the property that a certain linear combination of its entries gives a good approximation to the <strong>Max-DICUT</strong> value of \(\Psi\) (a better approximation than is possible with a less-than-medium space streaming algorithm). Then, the goal of our algorithm becomes simply estimating the snapshot.</p>
<p>The snapshot matrix is simply the following. Recall that the interval \([-1,+1]\) is the space of possible biases of a variable in a <strong>Max-DICUT</strong> instance. Fix a partition \(I_1,\ldots,I_B\) of this interval into a finite number of subintervals. Given this partition, we can partition the (positive-degree) variables in \(\Psi\) into “bias classes”: Each vertex \(i \in \{1,\ldots,n\}\) has bias \(\bias_\Psi(i)\) falling into a unique interval \(I_b\) for some \(b \in \{1,\ldots,B\}\). Edges also are partitioned into biases classes: To an edge \(i_1 \to i_2\) in \(\Psi\) we associate class \((b_1,b_2) \in \{1,\ldots,B\} \times \{1,\ldots,B\}\), where \(b_1\) and \(b_2\) are respectively the classes of \(i_1\) and \(i_2\). The snapshot matrix, which we denote \(\mathsf{Snap}_\Psi \in \mathbb{R}_{\geq 0}^{B \times B}\), is simply the \(B \times B\) matrix which captures the weight of edges in each bias class, i.e., the \((b_1,b_2)\)-th entry is the total weight of edges \(i_1 \to i_2\) with \(\bias_\Psi(i_1) \in I_{b_1}\) and \(\bias_\Psi(i_2) \in I_{b_2}\).</p>
<h2 id="aside-oblivious-algorithms">Aside: Oblivious algorithms</h2>
<p>At this point, we can “black-box” the notion of snapshot, since our algorithmic goal is now only to estimate the snapshot. However, to give intuition for the snapshot and show why it lets us achieve good approximations for <strong>Max-DICUT</strong>, we first take a detour into describing a simple class of “local” algorithms for <strong>Max-DICUT</strong>. These algorithms, called <em>oblivious algorithms</em>, were introduced by Feige and Jozeph (Algorithmica’17). Again, fix a partition of the space of possible biases \([-1,+1]\) into intervals \(I_1,\ldots,I_B\). For each interval \(I_b\), also fix a probability \(p_b\). Now an <em>oblivious algorithm</em> is one which, given an instance \(\Psi\), inspects each variable \(i\) independently and randomly sets it to \(1\) with probability \(p_b\), where \(b\) is the class of \(i\), and \(0\) otherwise. These algorithms are “oblivious” in the sense that they ignore everything about each variable except its bias.</p>
<p>As discussed in the previous section, in <strong>Max-DICUT</strong>, if a variable has bias \(+1\), we always “might as well” assign it to \(1\), and if it has bias \(-1\), we “might as well” assign it to \(0\). Oblivious algorithms flesh out this connection by choosing how to assign <em>every</em> variable based on its bias. For instance, if a variable has bias \(+0.99\), we should still want to assign it to \(1\) (at least with large probability).</p>
<p>Feige and Jozeph showed that for a specific choice of the partition \(I_b)\) and probabilities \(p_b)\), the oblivious algorithm gives a good approximation to the overall <strong>Max-DICUT</strong> value. In particular, we realized the ratio achieved by their oblivious algorithm is strictly better than what Chou <em>et al.</em> showed was possible with a less-than-medium space streaming algorithm. (In a paper of mine at APPROX’23, I generalized this definition and the corresponding algorithmic result to <strong>Max-\(k\)AND</strong> for all \(k \geq 2\).) Thus, to give improved streaming algorithms it suffices to “simulate” oblivious algorithms (and in particular the oblivious algorithm of Feige and Jozeph).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/streaming-csps/fj-sel.gif" alt="A step function, see caption for more details." />
<em>Figure.</em> The specific choice of bias partition \(I\) and probabilities \(\pi\) employed by Feige and Jozeph to achieve a \(0.483\)-approximation for <strong>Max-DICUT</strong>. Here, these two objects are presented together as a single step function, with bias on the horizontal axis and probability on the vertical axis. This choice deterministically rounds vertices with bias \(\geq +1/2\) to \(1\), \(\leq -1/2\) to \(0\), and it performs a (discretized version of a) linear interpolation between these extremes for vertices with bias closer to \(0\).</p>
<p>The key observation is then that to simulate an oblivious algorithm on an instance \(\Psi\), <em>it suffices to only know (or estimate) the snapshot of \(\Psi\)</em>. Indeed, every edge of class \(b_1, b_2\) is satisfied with probability \((\pi_{b_1})(1-\pi_{b_2})\) (the first factor is the probability that the first endpoint is assigned to \(1\), the second the probability that the second endpoint is assigned to \(0\), and these two events are independent). Thus, by linearity of expectation, the expected weight of the constraints satisfied by the oblivious algorithm is</p>
<p>\[ \mathop{\mathbb{E}}_{x \sim \mathcal{X}}\left[\mathsf{Obl}(\Psi) \right] = \sum_{b_1,b_2 = 1}^B (\pi_{b_1})(1-\pi_{b_2}) \cdot \Snap_\Psi(b_1,b_2). \]</p>
<p>The upshot of this for us is that to estimate the value of an instance \(\Psi\), it suffices to calculate some linear function of this snapshot matrix \(\Snap_\Psi\). Another important consequence of this formula is that it allowed Feige and Jozeph to determine the approximation ratio of any oblivious algorithm using a linear program which minimizes the weight of constraints satisfied over all valid snapshots.<sup class="footnote-reference"><a href="#symmetry">6</a></sup> </p>
<h1 id="a-medium-space-algorithm-and-smoothing-the-snapshot">A medium-space algorithm and “smoothing” the snapshot</h1>
<p>At this point, our goal is to use streaming algorithms to estimate a linear function of the entries of the snapshot \(\Snap_\Psi\). To calculate this function up to a (normalized) \(\pm \epsilon\), it suffices to estimate each entry of the snapshot up to \(\pm \epsilon/B^2\). \(B\) is a constant and so, reparametrizing \(\epsilon\), we seek an algorithm to estimate a given entry of the snapshot up to \(\pm \epsilon\) error.</p>
<p>Recall that the \((b_1,b_2)\)-th entry of the snapshot of \(\Psi\) is the weight of edges in \(\Psi\) with bias class \((b_1,b_2)\), i.e., the weight of edges from bias class \(b_1\) to bias class \(b_2\). To estimate this, we would ideally sample a random set \(E\) of \(T = O(1)\) edges in \(\Psi\), measure the biases of their endpoints, and then use the fraction of edges in the sample with bias class \((b_1,b_2)\) as an estimate for the total fraction of edges with this bias class. But it is not clear how to use a streaming algorithm to randomly sample a small set of edges and measure the biases of their endpoints simultaneously.<sup class="footnote-reference"><a href="#model">7</a></sup> Indeed, this cannot be possible in small space, since we know via Chou <em>et al.</em>’s lower bound that medium space is necessary for improved <strong>Max-DICUT</strong> approximations, and therefore for snapshot estimation! In this final section, we describe how we are able to estimate the snapshot using medium space.</p>
<h2 id="algorithm-for-bounded-degree-graphs">Algorithm for bounded-degree graphs</h2>
<p>First, suppose we were promised that in \(\Psi\), every vertex has degree at most \(D\), and \(D = O(1)\). An algorithm to estimate the \((b_1,b_2)\)-th entry of the snapshot of \(\Psi\) in this case is the following:</p>
<ol>
<li><em>Before the stream</em>, sample a set \(S \subseteq \{1,\ldots,n\}\) of \(k\) random vertices, where \(k\) is a parameter to be chosen later.</li>
<li><em>During the stream</em>, (i) store all edges whose endpoints are both in \(S\), and (ii) measure the biases of each vertex in \(S\).</li>
<li><em>After the stream</em>, take \(E\) to be the set of edges whose endpoints are both in \(S\). Observe that we know the biases of the endpoints of all edges in \(E\), and therefore the bias class of every edge in \(E\). Use the number of edges in \(E\) in bias class \((b_1,b_2)\) to estimate the total number of edges in \(\Psi\) in this bias class.</li>
</ol>
<p>Observe that the expected number of edges in \(E\) is \(\sim m (k/n)^2\) where \(m\) is the number of edges in \(\Psi\). If \(m = O(n)\), then \(|E| = \Omega(1)\) (in expectation) as long as \(k = \Omega(\sqrt n)\), which is precisely why this algorithm “kicks in” once we have medium space! <sup class="footnote-reference"><a href="#hash">8</a></sup> Once \(S\) is this large, we can indeed show that \(E\) suffices to estimate the snapshot. The proof of correctness of the estimate relies on <em>bounded dependence</em> of \(E\), by which we mean that in the collection of events \(\{e \in E\}_{e \in \Psi}\), each event is independent of all but \(O(1)\) other events. Indeed, observe that since \(\Psi\) has maximum degree \(D\), every edge in \(\Psi\) is incident to \(\leq 2D-1\) other edges. (Two edges are <em>incident</em> if they share at least one endpoint.) And for any two edges \(e, e’ \in \Psi\), the events “\(e \in \Psi\)” and “\(e’ \in \Psi\)” are <em>not</em> independent iff \(e\) and \(e’\) are incident.</p>
<h2 id="the-general-case">The general case</h2>
<p>General instances \(\Psi\) need not have bounded maximum degree. This poses a serious challenge for the bounded-degree algorithm we just presented. Consider the case where \(\Psi\) is a “star”, where each edge connects a designated center vertex \(i^*\) to one of the remaining vertices. In this situation, not every vertex is created equal. Indeed, if \(i^* \not\in S\) (which happens asymptotically almost surely), \(E\) will be empty, and therefore we learn nothing about \(\Psi\)’s snapshot.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/streaming-csps/star.png" alt="A graph on \(9\) vertices with one high-degree central vertex, and three side vertices marked by a blob. The one edge within the blob is solid, while all other edges are dashed." />
<em>Figure.</em> An example graph with a highlighted subset of vertices \(S\) (green). Only edges with both endpoints in \(S\) are placed in \(E\) — in this case, there is only a single solid edge. All other edges are not in \(E\). There is a high-degree vertex (\(1\)) which we would ideally put in \(S\): since it is adjacent to so many other vertices, adding it to \(S\) would make \(E\) much larger.</p>
<p>To deal with this issue, the algorithm must become substantially more complex. We design the new algorithm to treat vertices of different degrees differently, giving “higher priority” to storing high-degree vertices, and it also captures more information than the above algorithm — in particular, it stores edges that have <em>one</em> endpoint in the “sampled set”, as opposed to both.</p>
<p>Our new algorithm aims to estimate a <em>more detailed</em> object than the snapshot itself, which we call the <em>refined snapshot</em> of \(\Psi\). To define this object, we also choose a partition into intervals \(J_1,\ldots,J_D\) of the space \([0,O(n)]\) of possible degrees. (We only need that each interval has ratio \(O(1)\) between the minimum and maximum degrees it contains. For simplicity, we pick the intervals to be powers of two: \([1,2), [2,4), [4,8),\ldots\).) This lets us define a unique <em>degree class</em> in \(\{1,\ldots,D\}\) for every vertex, and a corresponding degree class in \(\{1,\ldots,D\}^2\) for every edge. Now the refined snapshot is a four-dimensional array \(\RSnap_\Psi \in \mathbb{R}^{D^2 \times B^2}\), whose \((d_1,d_2,b_1,b_2)\)-th entry is the number of edges in \(\Psi\) with degree class \((d_1,d_2)\) and bias class \((b_1,b_2)\).</p>
<p>Now, how do we estimate entries of this refined snapshot, i.e., estimate the number of edges in \(\Psi\) with degree class \((d_1,d_2)\) and bias class \((b_1,b_2)\)? First, we sample a subset \(\Phi_1 \subseteq \Psi\) of \(\Psi\)’s edges, which I’ll call a <em>slice</em>, in the following way:</p>
<ol>
<li>Sample a set \(S_1\) of vertices by including each vertex in \(\{1,\ldots,n\}\) independently w.p. \(p_1\).</li>
<li>Sample a set \(H_1\) of edges in \(\Psi\) by including each edge in \(\Psi\) independently w.p. \(q_1\).</li>
<li>\(\Phi_1\) consists of edges in \(H_1\) with at least one vertex in \(S_1\).</li>
</ol>
<p>Here, \(p_1\) and \(q_1\) are two parameters that depend only on the degree class \(d_1\). We claim that a streaming algorithm can sample a slice (this follows from the definitions), and we observe that this slice can be stored in medium space assuming that \(p_1 q_1 = \tilde{O}(1/\sqrt n)\), since \(\Psi\) has \(O(n)\) edges and therefore \(\Phi_1\) has \(O(p_1q_1n)\) edges in expectation. We repeat the above process to produce a second slice \(\Phi_2\), with corresponding parameters \(p_2,q_2\), and then use the slices \(\Phi_1,\Phi_2\) to calculate our estimate of the snapshot.</p>
<p>The choices of \(p_1,q_1,p_2,q_2\) are delicate. Taking \(p_1,q_1\) as an example, if the highest degree in class \(J_{d_1}\) is constant, then we pick \(p_1=\Theta(1/\sqrt n)\) and \(q_1 = 1\), and our algorithm recovers the bounded-degree algorithm above. But in general, \(q_1\) is chosen so that vertices in degree-class \(J_{d_1}\) have expected constant degree in \(H_1\), which allows us to recover similar “bounded dependence” behavior to the bounded-degree algorithm and therefore get concentration in the estimate.</p>
<p>But still, how does the algorithm use the slices \(\Phi_1,\Phi_2\) to estimate the snapshot entry? Let \(W_1\) denote the set of “target” vertices in \(\Psi\) which <em>actually</em> have bias class \(b_1\) and degree class \(d_1\). Similarly, define \(W_2\) as the “target” vertices in bias class \(b_2\) and degree class \(d_2\). The \((d_1,d_2,b_1,b_2)\)-th entry of the snapshot is then simply \(|\Psi \cap (W_1 \times W_2)|\). Let \(V_1 = W_1 \cap S_1\) and \(V_2 = W_2 \cap S_2\). Suppose that the algorithm, in addition to the slices \(\Phi_1,\Phi_2\), received \(V_1,V_2\) as its input. Now note that for any edge \(e = (v_1,v_2) \in \Psi \cap (W_1 \times W_2)\), the event “\(e \in \Phi_1 \cap (V_1 \times V_2) \)” has probability \(p_1 p_2 q_1\), since the events “\(v_1 \in S_1\)”, “\(v_2 \in S_2\)”, and “\(e \in H_1\)” are all independent. We could therefore hope to use \(|\Phi_1 \cap (V_1 \times V_2)|\) to estimate the snapshot entry;<sup class="footnote-reference"><a href="#counting">9</a></sup> indeed, (assuming that \(d_1 < d_2\)) this turns out to be true, and the proof goes by first conditioning on \(H_1\), and then arguing that given \(H_1\), degrees are sufficiently small to imply bounded dependence of which edges are in \(\Phi_1\) over the choice of \(S_1,S_2\).</p>
<p>But unfortunately, the algorithm does not get to see the actual sets \(V_1\) and \(V_2\). Instead, we have to employ certain “proxy” sets \(\hat{V}_1,\hat{V}_2\). To define these sets, observe that in the graph \(H_1\), for every vertex \(v \in \{1,\ldots,n\}\),
\[ \mathbb{E}_{H_1}[\deg_{H_1}(v)] = q_1 \cdot \deg_{\Psi}(v). \] Thus, by just looking at the slice \(\Phi_1\), we can estimate the degree of every vertex in \(S_1\). We can similarly estimate the bias, since
\[ \mathbb{E}_{H_1}[\bias_{H_1}(v)] = \bias_\Psi(v). \] So, given \(\Phi_1\) we can define a set \(\hat{V}_1 \subseteq \{1,\ldots,n\}\) of vertices in \(S_1\) which <em>appear to have</em> bias class \(b_1\) and degree class \(d_1\), based on their estimated degrees and biases in the slice. \(\hat{V}_1\) is an “estimate” for \(V_1\), and similarly we can define \(\hat{V}_2\) “estimating” \(V_2\) using the second slice \(\Phi_2\).</p>
<h3 id="smoothing-the-snapshot">Smoothing the snapshot</h3>
<p>There is an additional complication caused by using “estimated” sets \(\hat{V}_1,\hat{V}_2\) instead of the actual sets \(V_1,V_2\): It is not improbable for there to be “extra” or “missing” vertices in the estimated sets. Suppose, for instance, there is a vertex \(v\) which is in degree class \(d_1+1\), but whose degree is close to the lower limit of the interval \(J_{d_1+1}\). Then \(v\) is by definition not in \(V_1\), but depending on the randomness of \(H_1\), it could end up in \(\hat{V}_1\) with decent probability. This means we actually cannot estimate any particular entry of the refined snapshot with good probability!</p>
<p>To deal with this issue, we slightly modify the underlying problem we are trying to solve: Instead of aiming to directly estimate the refined snapshot, we aim to estimate a “smoothed” version of this snapshot, where the entries “overlap”, in that each entry captures edges whose bias and degree classes fall into certain “windows”. More precisely, for some window-size parameter \(w\), the \((d_1,d_2,b_1,b_2)\)-th entry captures the number of edges whose degree class is in \(\{d_1-w,\ldots,d_1+w\} \times \{d_2-w,\ldots,d_2+w\}\) and bias class class is in \(\{b_1-w,\ldots,b_1+w\} \times \{b_2-w,\ldots,b_2+w\}\). Each particular vertex will fall into many (\(\sim w^4\)) of these windows, meaning that any errors from mistakenly shifting a vertex into adjacent bias or degree classes are “averaged out” for sufficiently large \(w\). Finally, we show that estimating the “smoothed” snapshot is still sufficient to estimate the <strong>Max-DICUT</strong> value using a continuity argument, essentially because slightly perturbing vertices’ biases cannot modify the <strong>Max-DICUT</strong> value too much.</p>
<h1 id="finale">Conclusion</h1>
<p>Several interesting open questions remain after the above results on streaming algorithms for <strong>Max-DICUT</strong>. Firstly, it would be interesting to extend these results to other CSPs besides <strong>Max-DICUT</strong>. For instance, we know of analogues for oblivious algorithms for <strong>Max-\(k\)AND</strong> for all \(k \geq 2\), but whether there are snapshot estimation algorithms that “implement” these oblivious algorithms in less-than-large space is an open question. Also, there is a yawning gap between medium and large space. Proving any approximation <em>impossibility</em> result, or constructing better approximation algorithms, in the between-medium-and-large space regime would be very exciting. We do mention that the snapshot-based approach cannot give optimal (i.e., ratio-\(1/2\)) approximations for <strong>Max-DICUT</strong> because of another result of Feige and Jozeph, namely, a pair of graphs \(\Psi,\Phi\) which have the same snapshot, but the ratio of their <strong>Max-DICUT</strong> values is strictly less than \(1/2\).</p>
<h1 id="bibliography">Bibliography</h1>
<p>J. Boyland, M. Hwang, T. Prasad, N. Singer, and S. Velusamy, “On sketching approximations for symmetric Boolean CSPs,” in <em>Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques</em>, A. Chakrabarti and C. Swamy, Eds., in LIPIcs, vol. 245. Schloss Dagstuhl — Leibniz-Zentrum für Informatik, Jul. 2022, p. 38:1–38:23. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2022.38">10.4230/LIPIcs.APPROX/RANDOM.2022.38</a>.</p>
<p>C.-N. Chou, A. Golovnev, and S. Velusamy, “Optimal Streaming Approximations for all Boolean Max-2CSPs and Max-\(k\)SAT,” in <em>IEEE 61st Annual Symposium on Foundations of Computer Science</em>, IEEE Computer Society, Nov. 2020, pp. 330–341. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1109/FOCS46700.2020.00039">10.1109/FOCS46700.2020.00039</a>.</p>
<p>U. Feige and S. Jozeph, “Oblivious Algorithms for the Maximum Directed Cut Problem,” <em>Algorithmica</em>, vol. 71, no. 2, pp. 409–428, Feb. 2015, doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1007/s00453-013-9806-z">10.1007/s00453-013-9806-z</a>.</p>
<p>V. Guruswami, A. Velingker, and S. Velusamy, “Streaming Complexity of Approximating Max 2CSP and Max Acyclic Subgraph,” in <em>Approximation, randomization, and combinatorial optimization. Algorithms and techniques</em>, K. Jansen, J. D. P. Rolim, D. Williamson, and S. S. Vempala, Eds., in LIPIcs, vol. 81. Schloss Dagstuhl — Leibniz-Zentrum für Informatik, Aug. 2017, p. 8:1-8:19. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2017.8">10.4230/LIPIcs.APPROX-RANDOM.2017.8</a>.</p>
<p>P. Indyk, “Stable distributions, pseudorandom generators, embeddings, and data stream computation,” <em>J. ACM</em>, vol. 53, no. 3, pp. 307–323, May 2006, doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1145/1147954.1147955">10.1145/1147954.1147955</a></p>
<p>M. Kapralov, S. Khanna, and M. Sudan, “Streaming lower bounds for approximating MAX-CUT,” in <em>Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms</em>, Society for Industrial and Applied Mathematics, Jan. 2015, pp. 1263–1282. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1137/1.9781611973730.84">10.1137/1.9781611973730.84</a>.</p>
<p>M. Kapralov and D. Krachun, “An optimal space lower bound for approximating MAX-CUT,” in <em>Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,</em> Association for Computing Machinery, Jun. 2019, pp. 277–288. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3313276.3316364">10.1145/3313276.3316364</a>.</p>
<p>N. G. Singer, “Oblivious algorithms for the Max-\(k\)AND problem,” in <em>Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques</em>, N. Megow and A. D. Smith, Eds., in LIPIcs, vol. 275. May 2023. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2023.15">10.4230/LIPIcs.APPROX/RANDOM.2023.15</a>.</p>
<p>R. R. Saxena, N. G. Singer, M. Sudan, and S. Velusamy, “Streaming complexity of CSPs with randomly ordered constraints,” in <em>Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms</em>, Jan. 2023. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1137/1.9781611977554.ch156">10.1137/1.9781611977554.ch156</a>.</p>
<p>R. R. Saxena, N. Singer, M. Sudan, and S. Velusamy, “Improved streaming algorithms for Maximum Directed Cut via smoothed snapshots,” in <em>IEEE 63rd Annual Symposium on Foundations of Computer Science</em>, IEEE Computing Society, 2023, pp. 855–870. doi: <a rel="noopener" target="_blank" href="https://doi.org/10.1109/FOCS57990.2023.00055">10.1109/FOCS57990.2023.00055</a>.</p>
<div class="footnote-definition" id="ppty-tst"><sup class="footnote-definition-label">1</sup>
<p>More precisely, this typically means that the object is “far from” the set of objects having \(P\) in some mathematical sense. For instance, if the objects are graphs and the property \(P\) is the graph property of bipartiteness, “really not having \(P\)” might mean that many edges in the graph must be added or deleted in order to get \(P\) to hold.</p>
</div>
<div class="footnote-definition" id="contrast"><sup class="footnote-definition-label">2</sup>
<p>This is in contrast to more traditional areas of theory, such as time complexity, where many impossibility results are “conditional” on conjectures like \(\mathbf{P} \neq \mathbf{NP}\).</p>
</div>
<div class="footnote-definition" id="max"><sup class="footnote-definition-label">3</sup>
<p>It is also interesting to study <em>minimization</em> versions of CSPs (i.e., trying to minimize the number of <em>unsatisfied</em> constraints), but that is out of scope for this post.</p>
</div>
<div class="footnote-definition" id="cgv-ratio"><sup class="footnote-definition-label">4</sup>
<p>Specifically, Chou <em>et al.</em> showed a sharp threshold in the space needed for \(4/9\)-approximations. The analysis of their algorithm was subsequently simplified in a joint work of mine with Boyland, Hwang, Prasad, and Velusamy in (APPROX’21).</p>
</div>
<div class="footnote-definition" id="opt-asst"><sup class="footnote-definition-label">5</sup>
<p>More precisely, there exists an optimal assignment with this property.</p>
</div>
<div class="footnote-definition" id="symmetry"><sup class="footnote-definition-label">6</sup>
<p>This is an oversimplification: The goal is to minimize the <em>approximation ratio</em> (i.e., the value of the oblivious assignment over the value of the optimal assignment). However, Feige and Jozeph observe that under a symmetry assumption for \(\pi\), it suffices to only minimize over instances where (i) the (unnormalized) value of the instance is \(1\) and (ii) the all-\(1\)’s assignment is optimal. Given (i), the algorithm’s ratio on an instance is simply the (unnormalized) expected value of the assignment produced by the oblivious algorithm, and (i) and (ii) together can be implemented as an additional linear constraint in the LP.</p>
</div>
<div class="footnote-definition" id="model"><sup class="footnote-definition-label">7</sup>
<p>This task is easier in some “nonstandard” streaming models. Firstly, suppose we were guaranteed that the edges showed up in the stream in a <em>uniformly random order</em>. Then since the first \(T\) edges in the stream are a random sample of \(\Psi\)’s edges, we could simply use these edges for our set \(E\), and then record the biases of their endpoints over the remainder of the stream. Alternatively, suppose we were allowed <em>two passes</em> over the stream of edges. We could then use the first pass to sample \(T\) random edges \(E\), and use the second pass to measure the biases of their endpoints. Both of these algorithms use small space, since we are only sampling a constant number of edges.</p>
</div>
<div class="footnote-definition" id="hash"><sup class="footnote-definition-label">8</sup>
<p>To avoid having to sample \(S\) upfront and store it, it turns out to be instead sufficient to use a \(4\)-wise independent hash function.</p>
</div>
<div class="footnote-definition" id="counting"><sup class="footnote-definition-label">9</sup>
<p>It turns out to be important for the concentration bounds that we use the slice with <em>smaller</em> degree, e.g., if \(d_1 < d_2\) then we count edges in \(\Phi_1\). In this case, if we instead counted edges in \(\Phi_2\), the expectation would be \(O(p_1 p_2 q_2 m)\), which could be smaller than \(1\) if \(d_2\) is very large.</p>
</div>
<div class="footnote-definition" id="factor"><sup class="footnote-definition-label">10</sup>
<p>More precisely, for all \(\epsilon>0\) these algorithms output some value \(\hat{v}\) satisfying \(\hat{v} \in (1\pm\epsilon) \|\mathbf{v}\|_p\) with high probability, and use \(O(\log n/\epsilon^{O(1)})\) space.</p>
</div>
<div class="footnote-definition" id="hash2"><sup class="footnote-definition-label">11</sup>
<p>See <sup class="footnote-reference"><a href="#hash">8</a></sup>.</p>
</div>
Integrating Static and Data-Driven Resource Analyses for Programs2024-03-16T00:00:00+00:002024-03-16T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2024/hybrid-resource-analysis/<p>Resource analysis of programs aims to infer their worst-case cost bounds. It has
a number of practical use cases. For example, when executing a client’s program
in cloud computing, a cloud-service provider (e.g., Amazon Web Services or
Microsoft Azure) would like to to avoid both over-provisioning resources (which
would reduce profits) and under-provisioning resources (which would violate the
service-level agreement). So the provider would like to to estimate the resource
usage of the client’s program in advance, thereby optimizing resource
allocation.</p>
<p>There are two approaches to resource analysis: <em>static analysis</em> and
<em>data-driven analysis</em>. Static analysis infers a cost bound by examining the
source code and reasoning about all theoretically possible behaviors of a
program, including its worst-case behavior. Data-driven analysis
first runs the program on many inputs and records the execution costs. It then
analyzes the cost measurements to infer a cost bound.</p>
<p>Static and data-driven analyses have complementary strengths and weaknesses.
Static analysis is <em>sound</em>: if it returns some candidate cost bound, it is
guaranteed to be a valid upper bound on the actual execution cost. However,
resource analysis for a Turing-complete language is generally undecidable.
Consequently, static analysis is <em>incomplete</em>: no matter how clever the static
analysis is, there always exists a program that the static analysis cannot
handle. In contrast to static analysis, data-driven analysis can infer a cost
bound for any program. However, because data-driven analysis cannot rigorously
reason about the program’s worst-case behavior, data-driven analysis offers no
soundness guarantees of its inferred cost bounds.</p>
<p>In this blog post, we describe how to integrate static and data-driven resource
analyses into <em>hybrid resource analysis</em>. By combining the two complementary
analysis techniques, hybrid resource analysis partially retains their respective
strengths while mitigating their weaknesses. We first introduce static resource
analysis, followed by data-driven resource analysis. We then describe hybrid
resource analysis and demonstrate its advantages over static and data-driven
analyses using an example of a linear-time selection algorithm.</p>
<h1 id="formulation-of-resource-analysis">Formulation of Resource Analysis</h1>
<p>Given a program \(P\), the goal of resource analysis is to infer its
worst-case cost bound. Concretely, it is a function \(\text{cost}_{P} (x)\)
parametric in an input \(x\) (or its size \(\lvert x \rvert\)) to the
program \(P\) such that, for any input \(x\), the value \(\text{cost}_{P}
(x)\) is a correct upper bound of the execution cost of \(P(x)\). The
execution cost of the program \(P\) is defined by a resource metric such as
running time, memory, or energy.</p>
<p>To specify a resource metric of interest, a user inserts an instruction <code>tick q</code>
throughout their code, where <code>q</code> is a (positive or negative) number. The
instruction indicates that <code>q</code> many resources are consumed. For example, if we
are interested in the amount of memory (in bits) consumed by a program, whenever
a 64-bit memory cell is allocated in the source code, we indicate it by
inserting an instruction <code>tick 64</code>.</p>
<h1 id="static-analysis">Static Analysis</h1>
<p>To automate resource analysis, <em>static resource analysis</em> automatically analyzes
the source code of a program. In this blog post, we focus on <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/1926385.1926427">Automatic
Amortized Resource Analysis</a>
(AARA) as a concrete example of state-of-the-art static resource analysis.
Taking as input a functional program \(P\), AARA tries to automatically infer
a polynomial cost bound of the program \(P\).</p>
<h2 id="aara">AARA</h2>
<p>AARA builds on the potential method from <a rel="noopener" target="_blank" href="https://epubs.siam.org/doi/10.1137/0606031">amortized
analysis</a>. Every data structure
during program execution is equipped with <em>potential</em>, which we can consider as
fuel for computation. To perform computation, data structures must come with
enough potential to pay for the cost of computation. For example, if we are to
run an instruction <code>tick 64</code>, we must have at least 64 units of potential
available. The remaining potential can be later used to pay for subsequent
computation. In the potential method, our goal is to figure out an appropriate
amount of potential to store in data structures such that, whenever they go
through computation, they have enough potential to pay for it. The overall cost
is then bounded above by the initial potential minus the final potential stored
in the data structures. So the difference between the initial and final
potential serves as a cost bound.</p>
<p>AARA uses types to express the amount of potential stored in data structures.
For illustration, consider a variable \(x\) of the integer-list type \(
L(\mathtt{int}) \). If we want the list \(x\) to come with one unit of
potential per list element, we write
$$
x: L^{1} (\mathtt{int})
$$
where the superscript 1 indicates the amount of potential stored in each list
element. The type \(L^{1} (\mathtt{int})\) is called a <em>resource-annotated
type</em>, and the superscript 1 is called a <em>resource annotation</em>.</p>
<p>Let’s look at a slightly more complicated example. Consider function <code>split</code>
that splits an input list \(x\) into two equal halves. The function traverses
the input \(x\) and processes each list element. If the cost of processing
each element is one, the total computational cost is equal to the input list
length \( \lvert x \rvert\). We express the resource usage of the function
<code>split</code> by
$$
x : L^{1} (\mathtt{int}) \vdash \mathtt{split} \; x : L^{0} (\mathtt{int}) \times L^{0} (\mathtt{int})
$$
On the left-hand side of the turnstile (i.e., \(\vdash\)), we have \(x :
L^{1} (\mathtt{int})\), which means the input \(x\) carries one unit of
potential per element. On the right-hand side of the turnstile, we have
\(\mathtt{split} \; x : L^{0} (\mathtt{int}) \times L^{0} (\mathtt{int})\),
which means the two output lists of function <code>split</code> each carry zero units of
potential per element. This assignment of potential makes sense. In the input
list \(x\), each element initially carries one unit of potential. This
potential is used to pay for the cost of function <code>split</code>, and the
remaining zero units of potential are stored in the two output lists after
splitting. The difference between the input potential (i.e., \(1 \cdot \lvert x
\rvert\)) and output potential (i.e., zero) immediately translates to the cost
bound \(\lvert x \vert\) of function <code>split</code>.</p>
<p>Although the function <code>split</code> stores zero potential in the output, we will need
a positive amount of potential in the output if it later undergoes computation
that demands potential. In such a case, we can increase the potential stored in
the input and output of the function <code>split</code>. For example, another valid
assignment of potential is
$$
x : L^{3} (\mathtt{int}) \vdash \mathtt{split} \; x : L^{2} (\mathtt{int}) \times L^{2} (\mathtt{int})
$$
where the input carries three units of potential per list element and the output
carries two unit of potential, which can be used to pay for subsequent
computation.</p>
<p>Given a program \(P\), how do we infer its resource-annotated type? First, we
assign numeric variables to all inputs and outputs that appear in \(P\)’s
source code. These variables stand for (yet-to-be-determined) resource
annotations, which encode the amounts of potential. For example, in the program
\(\mathtt{split} \; x\), the input and output of the whole program are
assigned variables \(q_0, q_1, q_2 \in \mathbb{R}_{\geq 0}\):
$$
x : L^{q_0} (\mathtt{int}) \vdash \mathtt{split} \; x : L^{q_1} (\mathtt{int}) \times L^{q_2} (\mathtt{int})
$$</p>
<p>We then walk through the source code, collecting linear constraints that relate
the variables assigned to the inputs and outputs. For example, if the program
\(P\) contains instruction <code>tick 64</code>, AARA’s type system imposes a linear
constraint that the input potential must be at least 64 plus the leftover
potential after running <code>tick 64</code>. In the example of \(\mathtt{split} \; x\),
we obtain two linear constraints
$$
q_1 + 1 \leq q_0 \qquad q_2 + 1 \leq q_0
$$</p>
<p>Finally, we solve the linear constraints using a linear-program (LP) solver. If
we obtain a solution, we can extract a resource-annotated type of the program
\(P\).</p>
<h2 id="incompleteness">Incompleteness</h2>
<p>The static resource analysis technique AARA is sound but incomplete. Soundness
means that, if AARA returns a candidate cost bound, it is guaranteed to be a
valid worst-case cost bound. However, AARA is incomplete: even if a program
\(P\) has a polynomial cost bound, AARA can fail to infer it. This happens
when the linear constraints collected during type inference are unsolvable. In
fact, this limitation is not unique to AARA. All static resource analysis
techniques suffer incompleteness because resource analysis is undecidable in
general.</p>
<p>To illustrate AARA’s incompleteness, let us consider the median-of-medians-based
linear-time selection algorithm. Given an input list \(x\) and an input
integer \(i\), the selection algorithm returns the \(i\)-th smallest element
in the list \(x\).</p>
<p>In this algorithm, we split input list \(x\) into blocks of five elements and
compute each block’s median (e.g., by brute force). We then recursively call the
algorithm on these medians to compute their median \(m\). The median of
medians \(m\) (hence the name of the algorithm) is used to partition the input
list \(x\) into two lists, \(x_1\) and \(x_2\). To prove the linear time
complexity of this algorithm, we must show that the sublists \(x_1\) and
\(x_2\) are each at most \(7/10\) of the list \(x\). Intuitively, the
median of medians \(m\) partitions the list \(x\) <em>evenly</em> (up to some
factor) even in the worst case.</p>
<p>However, AARA cannot reason about the mathematical properties of medians. As a
result, it cannot conclude that the median of medians \(m\) splits the list
\(x\) evenly. Instead, AARA deduces that, in the worst case, the list \(x\)
is split unevenly into a singleton list (i.e., containing only one element) and
the remaining sublist. If the list \(x\) were split this way, the worst-case
time complexity would be exponential. Hence, AARA is unable to infer a
polynomial cost bound for this algorithm even though it has a linear cost bound.
Furthermore, we are not aware of static analysis techniques that can correctly
infer linear cost bounds for the linear-time selection algorithm.</p>
<h1 id="data-driven-analysis">Data-Driven Analysis</h1>
<p>The second approach to automatic resource analysis is <em>data-driven resource
analysis</em>. It starts with collecting cost measurements of the program \(P\).
Given inputs \( x_1, \ldots, x_n\), we run \( P \; x_i \) for each \(1
\leq i \leq n\) and record its output \( y_{i} \) and execution cost \(c_i
\in \mathbb{R}_{\geq 0}\). This yields a runtime cost dataset \(\mathcal{D}\)
defined as
$$
\mathcal{D} \coloneqq \{ (x_i, y_i, c_i) \mid 1 \leq i \leq n \}
$$
The dataset \(\mathcal{D}\) records output \(y_i\) as well as input
\(x_i\) since we need to know output sizes to calculate a cost bound (i.e.,
the difference between the input and output potential). We then infer a cost
bound of the program \(P\) by analyzing the dataset \(\mathcal{D}\). We have
a variety of choices for data-analysis techniques, ranging from linear
regression to deep learning.</p>
<h2 id="bayesian-resource-analysis">Bayesian Resource Analysis</h2>
<p>This blog post introduces <em>Bayesian resource analysis</em>, where we apply <em>Bayesian
inference</em> to resource analysis. In abstract, the goal of Bayesian inference is
to infer <em>latent variables</em> \(\theta\) (i.e., variables that we want to know
but cannot observe) from <em>observed variables</em> \(D\) (i.e., variables whose
concrete values are available) using Bayes’ rule from probability theory. In
Bayesian resource analysis, latent variables \(\theta\) are resource
annotations (i.e., cost bounds), and observed variables \(D\) are the runtime
cost dataset of the program.</p>
<p>To conduct Bayesian resource analysis, the user first provides a <em>probabilistic
model</em>, which specifies a joint probability distribution \( p (\theta, D) \)
of resource annotations \(\theta\) and dataset \(D\). Next, by Bayes’ rule,
the <em>posterior distribution</em> of the cost bound \(\theta\) conditioned on a
concrete dataset \(\mathcal{D}\) is given by
$$
p (\theta \mid D = \mathcal{D})
= \frac{p (\theta, D = \mathcal{D})}{p (D = \mathcal{D})}
= \frac{p (\theta, D = \mathcal{D})}{\int p (\theta, D = \mathcal{D}) \, \mathrm{d} \theta}
$$
This equation suggests that we can compute the posterior distribution \(p
(\theta \mid D = \mathcal{D})\) by taking the ratio between the joint
distribution \(p (\theta, D = \mathcal{D})\) and the denominator \(\int p
(\theta, D = \mathcal{D}) \, \mathrm{d} \theta\). However, because the
denominator is an integral over the space of the resource annotations
\(\theta\), which may have many dimensions, it is often intractable to compute
the denominator. As a result, we cannot precisely compute the posterior
distribution \(p (\theta \mid D = \mathcal{D})\) by directly applying Bayes’
rule.</p>
<p>Instead, in practice, we run a sampling-based Bayesian inference algorithm,
drawing a large number of samples from the posterior distribution. We then use
these samples, which serve as an approximation of the posterior distribution, to
estimate various properties (e.g., mean, median, variance, etc.) of the
posterior distribution.</p>
<p>Figure 1 displays a schematic diagram of Bayesian resource analysis. We perform
Bayesian inference to infer a posterior distribution of cost bounds (blue lines)
from the runtime cost measurements (black dots).</p>
<figure>
<img src="./bayespc.jpg" alt="schematic diagram for Bayesian resource analysis" width="500"/>
<figcaption>
Figure 1. Schematic diagram of Bayesian resource analysis. We perform Bayesian
inference to infer a posterior distribution of cost bounds (blue lines) from the
runtime cost measurements (black dots).
</figcaption>
</figure>
<p>To illustrate Bayesian resource analysis, consider the function <code>split</code> that was
introduced <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/hybrid-resource-analysis/#aara">earlier</a>. Its resource-annotated type has the form
$$
x : L^{q_0} (\mathtt{int}) \vdash \mathtt{split} \; x : L^{q_1} (\mathtt{int}) \times L^{q_2} (\mathtt{int})
$$
where the resource annotations \(q_0, q_1, q_2 \in \mathbb{R}_{\geq 0}\) are
to be inferred by Bayesian inference.</p>
<p>Let a dataset \(\mathcal{D}\) of runtime cost
measurements be
$$
\mathcal{D} \coloneqq \{(x_i,(y_{i,1}, y_{i,2}),c_i) \mid 1 \leq i \leq n \}
$$
where \(x_i\) is an input list, \((y_{i,1}, y_{i,2})\) is a pair of
two output lists, and \(c_i\) is the cost of running \(\mathtt{split} \;
x_i\).</p>
<p>The user constructs a probabilistic model \(p (\theta, D)\) based on whatever
domain knowledge<sup class="footnote-reference"><a href="#domain_knowledge">1</a></sup> they have. For example, a probabilistic
model for <code>split</code> can be
$$
\begin{aligned}
q_0, q_1, q_2 & \sim \mathrm{Normal}_{[0, \infty)}(0, 5) \\
c_{i, \text{predict}} & = q_0 \lvert x_i \rvert - q_1 \lvert y_{i,1} \rvert - q_2 \lvert y_{i,2} \rvert & \qquad (i = 1, \ldots, n) \\
c_i & \sim \mathrm{Normal}_{[0, c_{i, \text{predict}}]}(c_{i, \text{predict}}, 2) & (i = 1, \ldots, n)
\end{aligned}
$$
The first line states that the resource annotations \(q_0, q_1, q_2\) follow a
normal distribution truncated to the non-negative region. In the second line,
the predicted costs \(c_{i, \text{predict}}\) are defined as \( q_1 \lvert
v_i \rvert - q_2 \lvert y_{i,1} \rvert - q_3 \lvert y_{i,2} \rvert \),
which is the difference between input and output potential. The third line
states that the observed costs \(c_i\) from the dataset \(\mathcal{D}\)
follow a normal distribution truncated to the interval \([0, c_{i,
\text{predict}}]\). The distribution is truncated because the prediction of a
<em>worst-case</em> cost bound must be larger than or equal to the observed cost.</p>
<h2 id="unsoundness">Unsoundness</h2>
<p>Data-driven analysis can infer a cost bound for any program \(P\), provided
that it terminates. When we construct a dataset \(\mathcal{D}\) of the program
\(P\)’s runtime cost measurements, the program must terminate on all inputs;
otherwise, we will never finish collecting runtime cost measurements. Once we
finish data collection, we statistically infer a polynomial cost bound from the
dataset \(\mathcal{D}\). As the dataset \(\mathcal{D}\) is finite, for any
degree \(d \geq 0\), we always have some degree-\(d\) polynomial bound
\(\text{cost}_P\) that lies above all runtime cost measurements of
\(\mathcal{D}\). Therefore, data-driven analysis always returns an inference
result.</p>
<p>However, data-driven analysis lacks the soundness guarantee<sup class="footnote-reference"><a href="#soundness">2</a></sup> of
inference results. Data-driven analysis examines runtime cost data of the input
program \(P\), rather than its source code. Consequently, data-driven analysis
cannot reason about the theoretically worst-case behavior of the program
\(P\), failing to provide a soundness guarantee.</p>
<p>For illustration, we ran Bayesian resource analysis on the
median-of-medians-based linear-time selection algorithm. To collect cost
measurements, we randomly generated input lists. Figure 2 plots the inferred
posterior distribution of cost bounds. Black dots are runtime cost measurements.
The light-blue shade is the 10-90th percentile range of the posterior
distribution, and the blue line is the median cost bound. The red line is the
true worst-case cost bound.</p>
<figure>
<img src="./posterior_distribution_BayesPC.jpg" alt="posterior distributions of Bayesian resource analysis for the linear-time selection algorithm" width="400"/>
<figcaption>
Figure 2. Inference result of Bayesian resource analysis for the linear-time
selection algorithm. Black dots are runtime cost measurements. The light-blue
shade is the 10-90th percentile range of the cost bounds sampled from the
posterior distribution, and the blue line is the median cost bound. The red line
is the true worst-case cost bound.
</figcaption>
</figure>
<p>Although the inferred cost bounds (light-blue shade) all lie above the observed
costs (black dots), they are unsound worst-case cost bounds since they lie below
the true worst-case cost bound (red line). The true cost bound (red line) is
significantly larger than observed costs (black dots) because, when inputs are
randomly generated, the worst-case behavior of the selection algorithm rarely
arises.</p>
<p>Certainly, we can fix this problem by adjusting the probabilistic model such
that it adds more buffer on top of the maximum observed costs in the dataset
\(\mathcal{D}\). However, it is difficult to tell a priori how much buffer we
should add. On the other hand, static analysis is better suited for reasoning
about the worst-case behavior than data-driven analysis. But if we perform
static analysis on the entire source code of the linear-time selection
algorithm, the analysis fails as described <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/hybrid-resource-analysis/#incompleteness">earlier</a>. Can we
perform static analysis on a fragment of the source code and data-driven
analysis on the rest?</p>
<h1 id="hybrid-analysis">Hybrid Analysis</h1>
<p>To overcome the limitations of purely static analysis (e.g., conventional AARA)
and purely data-driven analysis (e.g., Bayesian resource analysis), we integrate
them into a framework called hybrid AARA.</p>
<h2 id="hybrid-aara">Hybrid AARA</h2>
<p>First, a user indicates which part of the source code should be analyzed by
data-driven analysis. For example, if we want expression <code>e</code> to be analyzed by
data-driven analysis, we enclose <code>e</code> with the annotation <code>statistics</code>, resulting
in <code>statistics(e)</code>. The rest of the source code will be analyzed by static
analysis. To construct a dataset \(\mathcal{D}\) of runtime cost measurements,
we run the program \(P\) on many inputs (to \(P\)) and record the inputs,
outputs, and costs of the expression <code>e</code> inside <code>statistics(e)</code>. Here, the input
to the expression <code>e</code> is its evaluation context (i.e., the values of free
variables appearing in <code>e</code>) during the program \(P\)’s execution.</p>
<p>The data-driven analysis of the expression <code>e</code> incorporates its contextual
information at runtime as the dataset \(\mathcal{D}\) captures this contextual
information. For example, suppose <code>statistics(e)</code> appears inside the if-branch
of a conditional expression <code>if ... then ... else ...</code>. If the if-branch
satisfies some invariant (e.g., inside the if-branch, variable <code>x</code> appearing
inside <code>e</code> is even), then all measurements recorded in the dataset
\(\mathcal{D}\) satisfy this invariant. Thus, data-driven analysis does not
analyze the expression <code>e</code> in isolation from its context.</p>
<p>Next, we infer a cost bound in hybrid AARA. Given a program \(P\) containing
<code>statistics(e)</code> for some expression <code>e</code>, we perform data-driven analysis on <code>e</code>
and static analysis on the rest of the source code, and then combine their
inference results. Just like we did for the cost-bound inference of
<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2024/hybrid-resource-analysis/#aara">conventional AARA</a>, we assign variables to inputs and outputs throughout
the source code of the program \(P\), where these variables stand for
yet-to-be-inferred resource annotations. In hybrid AARA, however, we do not
assign variables <em>inside</em> the expression <code>e</code>. That is, the expression <code>e</code> is
treated as a black box whose source code is invisible. Let \(\theta_e\) be the
set of resource annotations in the input and output of the expression
<code>statistics(e)</code>. Also, let \(\theta \supseteq \theta_e\) be a set of all
variables in the program \(P\)’s entire source code.</p>
<p>A key challenge in hybrid AARA is the interface between conventional AARA and
Bayesian resource analysis. Suppose conventional AARA generates a set \(C\) of
linear constraints over the variables \(\theta\). Any solution to the linear
constraints \(C\) is a valid cost bound. Conventional AARA optimizes the
variables \(\theta\) subject to the linear constraints \(C\). This
optimization problem is solved by an LP solver. On the other hand, Bayesian
resource analysis infers a posterior distribution of the variables \(\theta_e
\subseteq \theta\) by running a sampling-based Bayesian inference algorithm.
Thus, conventional AARA and Bayesian resource analysis both involve the
variables \(\theta_e\) in common, but they each use different algorithms for
inference. How do we design their interface?</p>
<p>One idea is to restrict the state space of the sampling algorithm to the
feasible region of the linear constraints \(C\). We first construct a
probabilistic model over all variables \(\theta \supseteq \theta_e\), which
represent resource annotations in the program \(P\)’s source code. We then run
a sampling-based Bayesian inference algorithm over these variables, subject to
the linear constraints \(C\). Thanks to the constraints \(C\), any cost
bound drawn from the posterior distribution is a valid cost bound according to
conventional AARA.</p>
<p>To implement hybrid AARA, we rely on recent advances in the literature of
sampling algorithms. In 2021, a C++ library
<a rel="noopener" target="_blank" href="https://github.com/GeomScale/volesti">volesti</a> started to support
sampling from a user-specified probability distribution subject to arbitrary
linear constraints. This is the first (and so far, only) tool that supports such
sampling. Popular programming languages for Bayesian inference, such as
<a rel="noopener" target="_blank" href="https://mc-stan.org/">Stan</a>, only support box constraints (i.e., upper and
lower bounds) on random variables, but not arbitrary linear constraints that may
involve multiple variables.</p>
<h2 id="evaluation">Evaluation</h2>
<p>We ran Hybrid AARA on the linear-time selection algorithm. Inside the selection
algorithm’s source code, the code fragment <code>partition x m</code>, which partitions
list <code>x</code> around the median of medians <code>m</code>, is analyzed by data-driven analysis.
The rest of the source code is analyzed by static analysis. Figure 3 displays
the inference results.</p>
<figure>
<img src="./posterior_distribution_hybrid_BayesPC.jpg" alt="posterior distributions of hybrid resource analysis for the linear-time selection algorithm" width="400"/>
<figcaption>
Figure 3. Inference results of hybrid AARA for the linear-time selection
algorithm. Black dots are runtime cost measurements. The light-blue shade is the
10-90th percentile range of the cost bounds sampled from the posterior
distribution, and the blue line is the median cost bound. The red line is the
true cost bound.
</figcaption>
</figure>
<p>In Figure 3, the 10-90th percentile ranges (light-blue shade) of inferred cost
bounds now contain or lie above the ground truth (red line). This is a
significant improvement over Bayesian resource analysis (Figure 2), where the
inferred cost bounds are below the true worst-case bounds.</p>
<p>More generally, we have evaluated hybrid AARA on a suite of seven benchmarks:
<code>MapAppend</code>, <code>Concat</code>, <code>InsertionSort2</code>, <code>QuickSort</code>, <code>QuickSelection</code>,
<code>MedianOfMedians</code> (this is the linear-time selection algorithm we have seen in
this blog post), and <code>ZAlgorithm</code>. Conventional AARA fails in all benchmarks as
they each contain a code fragment that cannot be analyzed statically by
conventional AARA. Further, in all benchmarks, hybrid AARA infers cost bounds
closer to the true worst-case cost bounds than purely data-driven resource
analysis. Thus, our evaluation demonstrates the benefits of hybrid resource
analysis: hybrid analysis infers more accurate worst-case cost bounds than
purely data-driven analysis, while overcoming the incompleteness of purely
static analysis. The details of the evaluation can be found in out paper <em>Robust
Resource Bounds with Static Analysis and Bayesian Inference</em>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Hybrid resource analysis combines purely static analysis, which offers soundness
guarantees of worst-case cost bounds but is incomplete, and purely data-driven
analysis, which is not sound but can infer a cost bound for any program. By
combining these two complementary analysis techniques, hybrid resource analysis
successfully infers cost bounds that neither purely static analysis nor purely
data-driven analysis can infer. This is demonstrated by the experiment results
of the linear-time selection algorithm.</p>
<p>Hybrid AARA has a limitation that its data-driven analysis only infers resource
annotations, but not other quantities (e.g., depth of recursion). As a result,
hybrid AARA cannot handle some programs such as bubble sort. In bubble sort,
conventional AARA cannot infer the number of recursive calls, but it can still
correctly infer a cost bound of each recursive call. Therefore, ideally, we
would like to infer (i) the number of recursive calls by data-driven analysis
and (ii) the cost of each recursive call by conventional AARA. This requires a
different hybrid analysis technique from hybrid AARA presented in this blog
post, and we plan to investigate it as future work.</p>
<p>Acknowledgement: hybrid AARA is joint work with <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Efsaad/">Feras
Saad</a> and <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Ejanh/">Jan
Hoffmann</a>. Our paper <em>Robust Resource Bounds with
Static Analysis and Bayesian Inference</em> is currently under review, and hopefully
we can share it soon.</p>
<p>Further reading: the <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/640128.604148">original paper on
AARA</a> targets linear cost
bounds. Subsequently, AARA has been extended to <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1007/978-3-642-11957-6_16">univariate polynomial cost
bounds</a> and <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/1926385.1926427">multivariate
polynomial cost bounds</a>.
<a rel="noopener" target="_blank" href="https://www.raml.co/">Resource-aware ML</a> is an implementation for AARA for
analyzing OCaml programs. Papers on data-driven resource analysis include the
<a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/1287624.1287681">trend profiler</a>, <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/2254064.2254074">algorithmic
profiler</a>, and <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/2254064.2254076">input-sensitive
profiler</a>. They are all
concerned with average-case cost bounds, rather than worst-case cost bounds as
in this blog post. Also, these papers all use optimization instead of Bayesian
inference.</p>
<div class="footnote-definition" id="domain_knowledge"><sup class="footnote-definition-label">1</sup>
<p>In contrast to static analysis, data-driven analysis
(including Bayesian resource analysis) always requires the user’s domain
knowledge to construct a statistical model. This dependency on domain
knowledge is inherent in statistics: the inference result depends on our
choice of data-analysis methodologies (e.g., optimization or Bayesian
inference), statistical models, and hyperparameters.</p>
</div>
<div class="footnote-definition" id="soundness"><sup class="footnote-definition-label">2</sup>
<p>Here, by the lack of soundness guarantee, we mean that the cost
bound inferred by data-driven analysis is not guaranteed to be a valid
worst-case cost bound for all possible inputs to the program. Nonetheless,
data-driven analysis is sound with respect to the probabilistic model and
dataset used, as long as we use a correct inference algorithm.</p>
</div>
Baleen: ML admission & prefetching for flash caches2024-01-12T00:00:00+00:002024-01-12T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/baleen-ml-flash-caching/<!-- After filling in the above "top-matter", as per instructions provided
in the `.md` file, you can write the main body of the blogpost here
onwards. Commonly used examples of syntax are shown below.
You can run `./local_server.sh` at the root of the repository to see
how the final blogpost looks in action. -->
<h1 id="introduction">Introduction</h1>
<p>Large-scale storage is still dominated by hard disks (HDDs) as they are cost-effective. However, HDDs are limited to ~100 IOs per second. Thus, modern storage systems in datacenters widely rely on flash caches to absorb backend load and reduce the number of HDDs required to satisfy the IO workload.</p>
<p>While flash has orders of magnitude higher IOPS, it suffers from wearout as it is written to.
Flash drives lifetime projections assume relatively low average write rates such as “three drive-writes per day (DWPD)”, meaning 3N TB of writes to a N TB flash drive each day. Flash drives with even lower write endurance (e.g., 1 DWPD) are priced correspondingly lower. Given that traditional cache management policies designed for dynamic random-access memory (DRAM) can incur writes exceeding 100 DWPD, there is a need for smart flash admission policies to filter the right items to be written into cache.</p>
<p>Machine learning (ML) policies have been proposed to improve upon historically popular policies, which include random admission and history-based policies that reject items without sufficient recent usage. However, caching is a challenging problem for ML to get right<a rel="noopener" target="_blank" href="https://pdl.cmu.edu/PDL-FTP/BigLearning/2018MachineLearningCDNcache_HOTNETS.pdf">[3]</a>. Furthermore, systems practitioners desire that policies also be understandable in addition to being performant<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a>.</p>
<p>We decompose the flash caching problem into admission, prefetching, and eviction. This helps us align policy decisions to well-understood supervised ML techniques. We also co-design these components, as we show that a policy can have synergistic or antagonistic effects on other parts of the system.</p>
<p>The Baleen flash cache exploits a new cache residency model (which we call episodes) to improve ML training effectiveness. The episodes model also enables a new useful comparison point (OPT). Baleen focuses on optimizing for an end-to-end metric (HDD disk-head time) that balances IOPS and bandwidth, rather than hit rate. We find that a combination of ML-guided admission and ML-guided prefetching works best in reducing peak backend load.</p>
<p>Baleen reduces HDD peak load by 11.8% over state-of-the-art policies on seven recent real-world storage cluster traces collected over 3 years. This work is under submission.</p>
<h1 id="background-bulk-storage-systems">Background: Bulk storage systems</h1>
<p>Bulk storage systems are relied on by hyperscalars to aggregate persistent
storage needs in data centers including blob storage and data warehouses
(such as HDFS<a rel="noopener" target="_blank" href="https://hadoop.apache.org">[7]</a>). Users might not
even know they are using one, as such systems function quietly behind the scenes
at cloud computing platforms like Amazon Web Services, Google Cloud Platform and
Microsoft Azure. In this paper, we use Meta’s Tectonic<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">[2]</a> as an important
and representative example of a bulk storage system. Many other systems have a
similar design (e.g., Google’s Colossus<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a><a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">[5]</a>, YouTube<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi23-song-zhenyu.pdf">[6]</a>). In Tectonic, as in other systems<a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">[5]</a>[6], flash caches are used to reduce load on the backing HDDs and meet throughput requirements.</p>
<p>Accesses are made to byte ranges within blocks. Blocks are mapped to a location on backing HDDs, and subdivided into many smaller units called segments that can be individually cached (Tectonic has 8 MB blocks and 128 KB segments). Upon an access, the cache is checked for all segments needed to cover the request byte range. If any are missing, an IO is made to the backing store to fetch them, at which point they can be admitted into the cache.</p>
<p>The storage system has 10,000s of storage nodes independently serving requests. The ratio of backing HDD space : flash cache : DRAM cache is 37,800 : 40 : 1. We focus on the scope of an individual node.</p>
<h2 id="bulk-storage-limited-by-disk-head-time">Bulk storage limited by disk-head time</h2>
<p>At scale, hard disks (HDDs) remain the choice of backing store as they are cheaper by an order of magnitude per GB than SSDs<a rel="noopener" target="_blank" href="https://web.archive.org/web/20221004225419/https://blocksandfiles.com/2020/08/24/10x-enterprise-ssd-price-premium-over-nearline-disk-drives">[8]</a>. Newer HDDs offer increased storage density, resulting in shrinking IO capacity (IOPS and bandwidth) per GB as more GBs are served by the same disk head.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/disk-head-time_vs_access-size_simple.png" alt="Fig 1: Disk-head Time consists of a seek & transfer time. This reflects disk-head times on our testbed." /></p>
<p style="text-align: center;"><em>Fig 1: <b>Disk-head Time</b> consists of a seek & transfer time. This reflects disk-head times on our testbed.</em></p>
<p>Disk-head time on backing HDDs is a premium resource. The mechanical nature of HDDs results in a high, size-independent access time penalty (e.g., 10 ms) for positioning the read/write head. With a high read rate (e.g., 5.5 ms per MB) and a maximum block size of 8 MB, a request could take 10 to 70 ms. In provisioning bulk storage, peak demand for disk-head time matters most. If the system has insufficient IO capacity, requests queue up, and slowdowns occur. If sustained, clients retry requests and failures occur, affecting user experience. Thus, bulk storage IO requirements are defined by peak load, which in turn affects storage costs.</p>
<h2 id="flash-caches-absorb-backend-load-but-have-limited-write-endurance">Flash caches absorb backend load but have limited write endurance</h2>
<p>Flash caching plays an important role in absorbing backend load, compensating for disk-head time limitations of the underlying HDDs. This setup enables resource-efficient storage for workloads that exceed the throughput requirements of HDDs but which are infeasible to store using flash alone. With the trends towards higher density HDDs and fewer bytes per HDD spindle, flash caches unlock more usable bytes per spindle.</p>
<p>Flash does not have access setup penalties, but does have wearout that translates into long-term average-write-rate limits. SSD manufacturers rate their drives’ endurance in terms of drive-writes per day (DWPD) over their warranty period.</p>
<p>Caching is an especially challenging workload for flash, since items will have widely varying lifetimes, resulting in a usage pattern closer to random I/Os than large sequential writes. Items admitted together may not be evicted at the same time, worsening write amplification. Writing every miss into flash would cause it to wear out prematurely.</p>
<p>Flash caches leverage <strong>admission policies</strong> (APs) to decide if items should be inserted into the flash cache or discarded, and have simple eviction policies (Least Recently Used, First-In First-Out) to minimize write amplification<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a>. Like eviction policies, admission policies weigh the benefit of hits from new items against lost hits from evicted items. They must also weigh the write cost of admitting the new item against other past or future items. Policies have an admission threshold that can be varied to achieve the target flash write rate. We provide some examples.</p>
<ul>
<li><strong>CoinFlip (baseline)</strong> On a miss, segments for an access are either all admitted, or not at all, with probability 𝑝. This simple policy does not need tracking of past items seen.</li>
<li><strong>RejectX (baseline)</strong> rejects a segment the first <em>X</em> times it is seen. Past accesses are tracked using probabilistic data structures similar to Bloom filters. We use X = 1 and vary the window size of past accesses to achieve the desired write rate. Both Meta <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a> and Google <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">[4]</a> used this prior to switching to more complex policies.</li>
<li><strong>ML admission policies</strong> use offline features to make decisions in addition to online features such as past access counts. A ML model can be trained offline based on a trace (as we do), or online using reinforcement learning.</li>
</ul>
<h1 id="baleen-design">Baleen Design</h1>
<h2 id="optimize-for-disk-head-time-not-hits-or-bandwidth">Optimize for Disk-head time, not hits or bandwidth</h2>
<p>We propose that backing store load be measured using disk-head time (DT), which is a throughput metric that balances IOPS and bandwidth.</p>
<p><strong>Definition</strong>: Disk-Head Time (DT) is the cost of serving a single request to the backend. For a single IO that fetches <em>n</em> bytes:</p>
<p>$$ DT(n) = SeekTime + TransferTime * n $$</p>
<p>Policies are then assessed in terms of Disk-Head Time saved, rather than
object-level hits (corresponding to IOPS) or byte-level hits (corresponding to
bandwidth). Disk-Head Time can also be seen as a weighted sum of object-level
hits and byte-level hits.
We use Disk-Head Time to score episodes for OPT (our approximate optimal online
admission policy) and ultimately generate labels for training Baleen’s ML
admission policy. In training Baleen’s prefetcher, we use Disk-Head Time to
assess the benefit of prefetching for a particular episode.</p>
<p>System capacity, such as the number of backend servers, is provisioned to handle peak load in systems that need to meet realtime demand. Therefore, to reduce the backend size, one should minimize peak disk-head time. This introduces the need for scheduling (i.e., when to spend the flash write rate budget) to prioritize the admission of items that contribute to peak disk-head time. As explicitly optimizing for the peak introduces significant complexity, we leave that for future work. For Baleen, we design our methods to minimize average disk-head time, but show that they are successful in reducing peak disk-head time as well.</p>
<h2 id="decomposing-caching-into-admission-prefetching-and-eviction">Decomposing caching into admission, prefetching and eviction</h2>
<p>We define the caching problem as determining which times we should fetch, admit and evict each segment in order to minimize the backend’s DT given a flash write rate limit.</p>
<p>We propose a heuristic decomposition of this problem into three sub-problems: admission, prefetching, and eviction. This makes it easier to reason about the optimal solutions to each sub-problem and the training and behavior of ML solutions for each part.</p>
<p><strong>Admission:</strong> Whether to admit something into cache in anticipation of future hits that reduce disk-head time. We trade off the disk-head time saved against the write rate used from caching an item. We model admission as a binary classifier, where misses are admitted if the outout probability exceeds the policy threshold.</p>
<p><strong>Prefetching:</strong> Whether to prefetch extra segments outside the current access range (which was a miss). We trade off disk-head time saved from hits on the first accesses against the additional time spent in cache, and for incorrect prefetches, the disk-head time wasted and the opportunity cost of the wasted flash write rate. We further decompose the prefetching problem into a) deciding what segments to prefetch and b) when to prefetch (whether the expected benefit exceeds the cost, taking into account the possibility of mispredictions)</p>
<p><strong>Eviction:</strong> Which segment in the cache to pick for eviction upon an admission. One can employ existing approaches for non-flash caches, including ML-based policies. We employ a simple eviction policy (in our case, Least Recently Used) as is used in production systems, leaving ML-based flash-aware eviction policies for future work.</p>
<h2 id="introducing-episodes-an-offline-model-for-flash-caching">Introducing episodes: an offline model for flash caching</h2>
<p>We devised an offline model for flash caching for efficient evaluation of flash caching improvements, and to facilitate the training of ML-based policies. This model revolves around episodes, which are defined as:</p>
<p><strong>Definition</strong> An <strong>episode</strong> is a sequence of accesses that would be hits (apart from the first access) if the corresponding item was admitted. It is defined on a block, and may span multiple segments. As shown in Fig 2, an episode’s size is the number of segments needed to cache it, and its timespan is the length of time between the first access of any segment and the last eviction of a segment.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/episode_with_segments.png" alt="Fig 2: Episodes span space (measured in segments) in addition to time. An episode’s size is the smallest number of segments required to be admitted to get all possible hits within an episode. OPT-Range is (1,3) and (2,3) respectively." /></p>
<p style="text-align: center;"><em>Fig 2: Episodes span space (measured in segments) in addition to time. An episode’s size is the smallest number of segments required to be admitted to get all possible hits within an episode. OPT-Range is (1,3) and (2,3) respectively.</p></em>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/episode.png" alt="Fig 3: An episode is a group of accesses corresponding to a block’s residency. Accesses (in blue) are grouped into two episodes as the interarrival time (in red) exceeds the assumed eviction age." /></p>
<p style="text-align: center;"><em>Fig 3: An episode is a group of accesses corresponding to a block’s residency. Accesses (in blue) are grouped into two episodes as the interarrival time (in red) exceeds the assumed eviction age.</em></p>
<p>We generate episodes by exploiting the model of a LRU (Least Recently Used) cache as evicting items a constant logical time (eviction age) after the last access. In a LRU cache, the eviction age is the logical time between an item’s last access & eviction. As shown in Fig 3, we group accesses into episodes such that all inter-arrival times within episodes are no larger than the assumed eviction age.</p>
<p>Episodes provide a direct mapping to the costs and benefits associated with an admission, and which corresponds directly to the decisions actually being made by admission policies. These benefits and costs are associated with a item’s entire lifespan in cache, and are not obvious from looking at a stream of individual accesses. Moreover, with flash caching, it is optimal to admit as early as possible in the episode, given that the flash writes required are a fixed cost. By shifting the mental model from interdependent accesses to independent episodes, we can reason about decisions more easily.</p>
<p>Decisions on episodes can be made independent by assuming a constant eviction age. This also allows decisions to be made in parallel. The added pressure on cache space via an admission is accounted for via downwards pressure on the eviction age. We determine an appropriate eviction age using simulations that measure the average eviction age.</p>
<p>The episode model also allows for an efficient offline analytical analysis of policies via Little’s Law. Given the arrival rate and assumed eviction age, we can estimate the cache size required, and set eviction age such that the analytical cache size is equal to the cache size constraint. While this is much more efficient than an online simulation and is useful to explore a greater range of parameters than is possible with online simulation, the numbers will differ from simulated ones as the cache size constraint is not enforced all the time, only as a long-term average.</p>
<p>Admission policies can be viewed as partitioning these episodes into those admitted and discarded. This can be done via scoring episodes and ranking them by score.</p>
<h1 id="baleen-system-architecture">Baleen System Architecture</h1>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/architecture.png" alt="Fig 4: Baleen System Architecture." /></p>
<p style="text-align: center;"><em>Fig 4: Baleen System Architecture.</em></p>
<p>We describe Baleen’s architecture in terms of what happens at training time and when deployed with a CacheLib[<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">[1]</a> implementation. At training time, episodes are generated and used to train Baleen’s ML admission and ML prefetching policies. At deployment time, the trained models are supplied to CacheLib which uses them to make decisions on the fly.</p>
<h2 id="opt-approximates-optimal-online-admission-policy">OPT approximates optimal online admission policy</h2>
<p>We devise an online admission policy, <strong>OPT</strong>, that we train Baleen’s ML policy to imitate. In OPT, first, each block’s accesses are grouped into episodes using an assumed eviction age. Second, all episodes are scored using the equation below and sorted. Last, the maximum number of episodes are admitted such that the total flash writes required do not exceed the write rate budget. During online simulation, accesses will be admitted if they belong to episodes that were marked as admitted during the offline process.</p>
<p>$$ Score(Episode)= \frac{DTSaved(Episode)}{Size(Episode)} $$</p>
<h2 id="training-baleen-to-imitate-opt">Training Baleen to imitate OPT</h2>
<p>We use OPT’s decisions as binary labels for training Baleen. Training examples are added for the first k (k = 6) accesses of each episode (to avoid biasing the training set towards popular but easy episodes). Features include offline metadata provided by the bulk storage system (which help identify which application the request originates from) and online history-based counts (how many hits the object has received in the last 1,2,3,4,5,6 hours).</p>
<h2 id="training-baleen-to-predict-what-and-when-to-prefetch">Training Baleen to predict what and when to prefetch</h2>
<p>By default, on a miss, the smallest IO that covers all missed segments is made, i.e., no prefetching occurs. It is possible to extend this IO and preemptively admit more segments. If done correctly, this reduces the total no of IOs needed and thus reduces Disk-head Time.</p>
<p>Baleen has two prefetching models: ML-Range, and ML-When.</p>
<h2 id="learning-what-to-prefetch-ml-range-learns-from-opt-range">Learning what to prefetch: ML-Range learns from OPT-Range</h2>
<p>OPT-Range is the minimal range of segments that will cover all accesses in an episode. Using the episodes model, we generate OPT-Range for each episode and use these as labels for ML-Range.</p>
<p>ML-Range is a ML model that predicts a range of segments for prefetching. We use the same features as the ML admission model, but add size-related features (access start index, access end index, access size). We train two regression models to predict the episode range start and end. Each episode is represented once in the training data, with only episodes that meet the score cutoff for the target write rate included</p>
<h2 id="learning-when-to-prefetch-ml-when">Learning when to prefetch: ML-When</h2>
<p>Fetching insufficient segments results in minimal or no Disk-head Time reduction. On the other hand, fetching excess segments results in a high write rate. To balance these tradeoffs, we need to know our confidence in our range prediction.</p>
<p>Mispredictions by the ML admission policy and in ML-Range can easily cause prefetching to hurt instead of help. In reality, the expected benefit will be lower than OPT prefetching and the cost can only be higher. The disk-head time saved from prefetching ML-Range may not be realized. Moreover, prefetching mispredictions are costly in terms of disk-head time consumed to fetch unused segments and the opportunity cost of flash writes used to store them. ML-When aims to address this and exclude epsiodes that do not have a high probability of benefiting from prefetching.
The exact equations are provided in our paper.</p>
<h1 id="evaluation">Evaluation</h1>
<p>We evaluate Baleen using a testbed and a simulator. We validate both with counters from production deployments. Each node in our testbed has a 400 GB flash drive and 2 4TB HDDs.</p>
<p>We report results on 7 Meta production traces collected in 2019, 2021 and 2023 and take an average across the traces.
These traces show week-long workloads on 7 Tectonic clusters from 3 different years,
with each cluster serving the storage needs of an entire data center<a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">[2]</a>.
Each trace represents 7 days of production traffic from a single
cluster (except for Region3, which has 3 days), with traffic sampled at every
node (each cluster has thousands of nodes) and later aggregated into a trace.
The Region1 and Region2 traces were recorded from different clusters over the same 7 days in Oct 2019, while the Region3 trace was recorded from another cluster over 3 days in Sep 2019. Region4 was recorded over 7 days in Oct 2021, and the remaining traces (Region5, Region6, Region7) were collected in Mar 2023.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_wr-34-01.png" alt="Fig 5: Baleen reduces Peak Disk-head Time (DT) by an average of 11.8% over the best non-ML policy (RejectX), and 18.6% over random admission on 7 production traces from Meta under flash write rate constraints." /></p>
<p style="text-align: center;"><em>Fig 5: Baleen reduces Peak Disk-head Time (DT) by an average of 11.8% over the best non-ML policy (RejectX), and 18.6% over random admission on 7 production traces from Meta under flash write rate constraints.</em></p>
<p>Fig 5 shows Baleen reduces Peak DT over RejectX by an average of 11.8% across all traces.
In our paper, we show this ranges from 4.8% to 22.6% across the 7 traces,
with 3 regions deriving most of their gains from prefetching.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_csize.png" alt="Fig 6a: Baleen delivers improvements at higher cache sizes." /></p>
<p style="text-align: center;"><em>6a) Cache Sizes</em></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/baleen-ml-flash-caching/peak-st-ratio_wr.png" alt="Fig 6b: Baleen delivers improvements at higher cache sizes." /></p>
<p style="text-align: center;"><em>6b) Write Rates</em><br>
<em>Fig 6: Baleen continues to deliver improvements at higher cache sizes and write rates.</em></p>
<p>Fig 6 shows that the benefits of Baleen are consistent at higher cache sizes and write rates, with Baleen enabling a reduction in cache size by 55% while keeping
the same Peak DT as RejectX, or alternatively a reduction in Peak DT equivalent
to a 4X increase in cache size. As expected, increasing write rate or cache size has diminishing returns in reducing Peak DT. Also, the different admission policies (without prefetching) start to converge, indicating that admission by itself is insufficient to drive further reductions in Peak DT. We provide graphs for all 7 traces in our
paper.</p>
<p>Further results are available in our paper, such as:</p>
<ol>
<li><strong>Prefetching should be selective and in tandem with admission</strong>
We show both ML-Range and ML-When are effective in reducing Peak DT over static baselines, and contribute to Baleen’s robustness across the multiple traces.
We also show that prefetching must be paired with a good admission policy; if not, the same prefetching policy can hurt rather than help.</li>
<li><strong>Optimizing the right metric: Peak DT</strong> We show how optimizing for IO
hit ratio can be misleading, as doing so is optimal for reducing seeks, not
Disk-head Time.</li>
<li><strong>Validation of simulator and testbed.</strong> We validated Baleen on our simulator
against Baleen on our testbed. We took the additional step of showing that our
testbed is consistent with production counters.</li>
<li><strong>Trace analysis</strong> We show distributions for block popularity, interarrival times, access sizes and the compulsory miss trace, one-hit-wonder trace (fractions of blocks
with no reuse) and Admit-All flash write rate.</li>
</ol>
<p>In our paper, we described a few lessons gleaned from 3 years of deploying ML
in production caches at Meta. These lessons were that 1) optimizing the wrong metric is an easy misstep, 2) ML model performance does not always translate to production system performance, 3) to rethink the use of DRAM in flash caching, and that 4) ML-based caching should aim for encapsulation of ML, caching, and storage.
To read more, please see Section 6 (Lessons from deploying ML in production) of our paper.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Baleen is a flash cache that uses ML to guide both prefetching and cache admission, reducing peak storage backend load on real workload traces from Meta. Baleen’s design arose from a number of false-step lessons and a cache residency (episodes) formulation that improves training effectiveness, provides an ideal (OPT) target, and exposed the particular value of ML-guided prefetching. As such, Baleen is an important step forward in flash caching for disk-based storage systems.</p>
<p>More details are available in our paper, which <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast24/presentation/wong">has been accepted to FAST 2024</a>. Please
direct any correspondence to <a href="mailto:wonglkd@cmu.edu">Daniel Wong</a>.</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>This post is based on the paper <em>Baleen: ML Admission & Prefetching for Flash Caches</em>. I would like to thank my collaborators and the CacheLib and Tectonic teams at Meta: Hao Wu (Meta), Carson Molder (UT Austin), Sathya Gunasekar (Meta), Jimmy Lu (Meta), Snehal Khandkar (Meta), Abhinav Sharma (Meta), Daniel S. Berger (Microsoft Research & University of Washington), Nathan Beckmann (CMU), and Greg Ganger (CMU).
I would also like to thank the reviewers of this post: George Amvrosiadis, Rashmi Vinayak, and Thomas Kim.</p>
<h1 id="references">References</h1>
<ol>
<li>Benjamin Berg, Daniel S Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, et al. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-berg.pdf">The CacheLib caching engine: Design and experiences at scale.</a> In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020.</li>
<li>Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, et al. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-pan.pdf">Facebook’s Tectonic filesystem: Efficiency from exascale.</a> In 19th USENIX Conference on File and Storage Technologies (FAST 21), 2021</li>
<li>Daniel S Berger. <a rel="noopener" target="_blank" href="https://pdl.cmu.edu/PDL-FTP/BigLearning/2018MachineLearningCDNcache_HOTNETS.pdf">Towards lightweight and robust machine learning for CDN caching.</a> In Proceedings of the 17th ACM Workshop on Hot Topics in Networks (HotNets), 2018.</li>
<li>Tzu-Wei Yang, Seth Pollen, Mustafa Uysal, Arif Merchant, and Homer Wolfmeister. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/atc22-yang-tzu-wei.pdf">CacheSack: Admission optimization for google datacenter flash caches.</a> In USENIX Annual Technical Conference (USENIX ATC 22), 2022.</li>
<li>Dean Hildebrand and Denis Serenyi. <a rel="noopener" target="_blank" href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">Colossus under the hood: a peek into google’s scalable storage system</a>, 2021</li>
<li>Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Altınbüken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, Pawel Jurczyk, Richard Schooler, and Ramki Gummadi. <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi23-song-zhenyu.pdf">Halp: Heuristic aided learned preference eviction policy for youtube content delivery network.</a> In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023.</li>
<li>Apache Software Foundation. (2010). <a rel="noopener" target="_blank" href="https://hadoop.apache.org">Hadoop</a>.</li>
<li><a rel="noopener" target="_blank" href="https://web.archive.org/web/20221004225419/https://blocksandfiles.com/2020/08/24/10x-enterprise-ssd-price-premium-over-nearline-disk-drives">Chris Mellor. Enterprise ssds cost ten times more than nearline disk drives.</a> Accessed: 2022-10-04.</li>
</ol>
Mimir: Finding cost-efficient storage configurations in the public cloud2023-12-15T00:00:00+00:002023-12-15T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/mimir/<p>In today’s landscape of diverse public cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, organizations are increasingly turning to cloud computing with pay-as-you-go pricing models. <a rel="noopener" target="_blank" href="https://www.cloudzero.com/blog/cloud-computing-statistics">Many businesses</a> are adopting public cloud services to simplify data center management or leverage the scalability and elasticity offered by these providers.</p>
<p>A pressing question that accompanies this shift to public cloud adoption is how to optimize the overall cost of utilizing cloud resources. While researchers have recently delved into cost optimization for virtual machine (VM) instances used in computational workloads, there has been limited focus on optimizing storage choices. Frequently, companies require high-performance storage clusters to efficiently operate their workloads in public clouds. However, the costs associated with these storage clusters cannot be underestimated, given that VMs and block storage options can strain budgets.</p>
<p><strong>Thus, companies need to carefully select resources for storage clusters to reduce their Total Cost of Ownership.</strong> If organizations opt for only inexpensive resources to minimize costs, their storage clusters may fail to meet performance requirements. Conversely, selecting solely high-performance Virtual Machines and storage types can lead to substantial spending compared to an optimized resource selection approach.</p>
<p><strong>Nonetheless, choosing the cost-efficient set of resources for storage clusters in public clouds remains a challenging task and there is no existing system that helps this provisioning decision.</strong> The multitude of available VM and storage types adds complexity. For instance, AWS alone offers over a <a rel="noopener" target="_blank" href="https://www.amazonaws.cn/en/ec2/instance-types/">hundred different instance types</a> and various block storage options, including <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized-instances.html">locally attached (LocalSSD)</a> and <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html">remotely disaggregated (EBS)</a>. Each resource option comes with <a rel="noopener" target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html">distinct pricing and performance models</a>, and the performance also varies based on workload characteristics. This necessitates a deep understanding of both cloud resource attributes and workload characteristics to make informed selections. If we factor in the potential use of heterogeneous storage cluster configurations, the problem’s search space becomes significantly larger and more intricate.</p>
<p><strong>To address these challenges, we introduce Mimir, a resource auto-selection tool designed to identify the most cost-efficient set of resources for storage clusters in public clouds, all while meeting specified performance requirements.</strong> Our system assesses all available VM types, block storage options, and even combinations of these options (heterogeneous configurations) to determine the optimal solution. <strong>As a result, Mimir can yield storage cluster configurations up to 81% cheaper than those generated by the current state-of-the-art resource auto-selection tools.</strong> In our evaluations, we demonstrate that Mimir can also serve as a resource selector for mixed workloads (comprising multiple workloads with distinct characteristics) and dynamic workloads, efficiently identifying cost-effective cluster configurations within a reasonable time.</p>
<h2 id="challenges-navigating-diverse-resource-options-and-heterogeneity"><strong>Challenges: navigating diverse resource options and heterogeneity</strong></h2>
<h3 id="challenge-1-diverse-storage-options-characteristics"><strong>Challenge 1: diverse storage options’ characteristics</strong></h3>
<p>Public cloud providers have established unique performance characteristics for their block storage options, setting them apart from traditional solutions like SSDs and HDDs. Workload attributes such as access pattern (random/sequential), read ratio, and I/O unit size can exert significant influence on the performance of cloud block storage. Overlooking these factors or assuming that cloud storage behaves analogously to traditional storage can result in erroneous storage cluster configurations. To illustrate this, we present two examples showcasing how workload characteristics impact storage performance.</p>
<table><thead><tr><th align="center"><img src="./storage-io.png" alt="alt text" /></th><th align="center"><img src="./storage-rw.png" alt="alt text" /></th></tr></thead><tbody>
<tr><td align="center">(a) by I/O unit size</td><td align="center">(b) by read ratio</td></tr>
</tbody></table>
<p><strong>Fig. 1:</strong> Performance characteristics of public cloud storage volume types by (a) I/O unit size and (b) workload read ratio. In (a), both volume types have throughput limits defined by AWS (horizontal lines).</p>
<p>In our tests, we employed the <a rel="noopener" target="_blank" href="https://fio.readthedocs.io/en/latest/fio_doc.html">fio benchmark</a> to assess cloud block storage performance on AWS, using three different storage types: local NVMe SSD, remote SSD (gp2), and remote HDD (st1). We varied access patterns, read ratios, and I/O unit sizes. Fig. 1 provides insights into the characteristics of 1 TiB gp2 and 1 TiB st1 volumes, each having performance of 3000 IOPS and 40 MiB/s following the performance model provided by AWS, along with local SSD attached to i3.xlarge.</p>
<p>Fig. 1(a) shows <em>how performance characteristics vary with I/O unit size and access pattern for each storage type</em>. For gp2 performance, which is defined in IOPS, increasing the I/O unit size results in higher throughput, eventually reaching the maximum limit set by AWS. Also, it remains consistent regardless of the access pattern. In contrast, st1’s performance, defined in MiB/s, should ideally maintain a throughput of 40 MB/s regardless of the I/O unit size. However, it exhibits reduced throughput for workloads featuring random access patterns and I/O units smaller than 1 MiB, different to the behavior observed with sequential accesses.</p>
<p>In Fig. 1(b), we examine the <em>impact of read ratios on each volume type’s throughput</em>. EBS volumes remain unaffected by the read ratio, as it lies outside their performance models. Conversely, local SSD exhibits considerably higher throughput than EBS and is notably influenced by the read ratio.</p>
<p>As highlighted above, in public clouds, storage performance characteristics differ from traditional storage. For instance, remote SSD throughput remains consistent regardless of read-to-write ratios, while performance of different storage options changes differently for I/O unit size changes. This can confuse users configuring cloud storage clusters, as they may erroneously assume that cloud storage exhibit conventional storage behavior. However, by accurately considering these pricing and performance models, <strong>Mimir can mathematically deduce performance specifications from allocated resources, aiding in cost-efficient cloud storage configurations that meet performance needs</strong>.</p>
<p>It is worth to note that local SSD’s highest throughput in Fig. 1 does not always make it the best choice. Throughput of each storage option varies with its configuration; larger gp2 and st1 volumes can outperform local SSD. st1 and gp2 come with lower per-byte costs, making them cost-efficient alternatives when high throughput is not crucial.</p>
<h3 id="challenge-2-heterogeneity-is-important-for-cost-efficiency"><strong>Challenge 2: heterogeneity is important for cost-efficiency</strong></h3>
<p>One easy way of selecting resources for a storage cluster in the public cloud is configuring a homogeneous storage cluster by using a single storage option.
However, we found that there is no single storage option that is the most cost-efficient for every workload, and sometimes, even a mix of storage options is needed to minimize the cost.</p>
<p><img src="./motiv.png" alt="alt text" /></p>
<p><strong>Fig. 2:</strong> No volume type is most cost-efficient for every workload, and a mix of volume types may be the most cost-effective option.</p>
<p>Fig. 2 demonstrates the need to consider various volume types and configurations for selecting a cost-efficient Virtual Storage Cluster (VSC) configuration. For each of the three workloads, it shows the cost for the best VSC configuration under three constraints: using only local SSD volume types, only remote storage (EBS) volume types, and arbitrary mixes of both.</p>
<p>For Workload 1, which demands high storage throughput per GB of data, opting for EBS volume types leads to over-provisioning capacity, making it an expensive choice due to the 3 IOPS per provisioned-GB limit. Conversely, Workload 2, with lower storage throughput requirements, renders local SSD an expensive option due to over-provisioning storage performance. Workload 3 combines varying performance needs, necessitating a mix of storage options to minimize costs.</p>
<p><strong>Therefore, it is crucial to consider a heterogeneous VSC configuration for the cost-efficiency.</strong> However, this introduces complexity, making it impractical to explore the search space using naive methods. So users can use Mimir as a solution to efficiently navigate this complex search space by using dynamic programming and integer-linear programming.</p>
<h2 id="mimir-resource-auto-selector-for-storage-cluster-in-public-clouds"><strong>Mimir: resource auto-selector for storage cluster in public clouds</strong></h2>
<p>To tackle these challenges, we introduce Mimir, a resource auto-selector that identifies the cost-efficient set of VMs and storage volumes for a storage cluster. Mimir takes into account workload characteristics (such as read/write request ratio and data access locality) and user-defined requirements (including request rate and capacity). Next, we will provide an overview of Mimir’s workflow and delve into our main optimization algorithm.</p>
<h3 id="mimir-design-and-workflow"><strong>Mimir design and workflow</strong></h3>
<p>Fig. 3 outlines Mimir’s workflow, which begins by inputting characteristics from multiple workloads requiring cluster storage. Each storage cluster’s workload profiler profiles these attributes, and the Resource Profiler assesses them to determine resource needs for cost-effective cloud operations. This involves resource utilization profiling using micro-benchmarks, considering given data access patterns like request rate, access locality, and read/write ratios. The Resource Predictor uses this resource profiling data to identify efficient container sizes (i.e., storage/network bandwidth, CPU count, memory) for each workload, as Mimir utilizes containers to run multiple storage servers in the same VM with resource isolation. Finally, the VSC Cost Optimizer combines these insights with the public cloud’s cost model to optimize the Virtual Storage Cluster (VSC) configuration for the distributed storage system.</p>
<p><img src="./mimir_design.png" alt="alt text" />
<strong>Fig. 3:</strong> Mimir’s workflow for optimizing the price of public cloud resources. Initially, Mimir profiles the provided workloads, learning the precise resource requirements (such as CPU and memory). Using this trained module and a cost model encompassing public cloud resources, the VSC Cost Optimizer then identifies the most cost-efficient Virtual Storage Cluster (VSC) configuration.</p>
<p>Mimir assumes that users provide or profile the workload characteristics, which the system uses as input for its optimization process. This modular approach makes Mimir adaptable to any storage system capable of profiling sufficient workload information. Next, we provide a brief overview of the optimization algorithm used by Mimir to minimize costs for the given workload characteristics. Further details regarding other components, such as the Resource Profiler and Resource Predictor, can be found in <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3579370.3594776">our paper</a>.</p>
<h3 id="optimization-algorithm-dynamic-programming"><strong>Optimization algorithm: dynamic programming</strong></h3>
<p>The VSC Cost Optimizer addresses the following question: <em>What resource configuration minimizes costs while meeting performance requirements and accommodating storage workload characteristics?</em></p>
<p>In this optimization problem, we identified an optimal substructure property. This means that if Mimir determines the most cost-efficient virtual storage cluster configuration for the given workloads, then any subset of storage servers from the entire cluster (a sub-cluster) must also represent the cost-efficient configuration for the workloads running on that specific sub-cluster.</p>
<p><img src="./dynamic_programming.png" alt="alt text" />
<strong>Fig. 4:</strong> Mimir’s optimization problem has an optimal substructure property. If we find the most cost-efficient configuration for the entire virtual storage cluster, then any sub-cluster of that configuration must also be cost-efficient for the portion of data stored within that sub-cluster.</p>
<p>Figure 4 exemplifies the optimal substructure property. Suppose Machines 1-4 represent the most cost-efficient VSC configuration for a given workload. We contend that any sub-cluster should also be the most cost-efficient for the portion of the workload it handles. To prove this, we use a proof by contradiction. Let’s assume that Machines 1 and 2 are not the most cost-efficient sub-cluster configuration for 3/10 of the workload. This would imply the existence of another sub-cluster (in this case, Machine 5) that’s cheaper than Machines 1 and 2. However, this contradicts the fact that Machines 1-4 (total VSC cost: $28) constitute the most cost-efficient VSC configuration for the entire workload, given that a cheaper configuration involving Machines 3-5 (total VSC cost: $26) exists.</p>
<p>Based on the optimal substructure property, we use dynamic programming to break down a large search space into manageable segments. For a more in-depth understanding of our approach, including how we use mixed-integer programming for the base case and how Mimir integrates other components (e.g., resource profiler, resource predictor, and cost model) into its optimization algorithm, please refer to our paper.</p>
<h2 id="mimir-can-find-up-to-81-cheaper-storage-cluster-over-sota"><strong>Mimir can find up to 81% cheaper storage cluster over SOTA</strong></h2>
<p>We evaluated Mimir using Apache BookKeeper as the distributed storage backend and six different <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast20/presentation/cao-zhichao">Meta’s RocksDB (MR) key-value workloads</a>.
The results of our evaluation demonstrate significant cost savings achieved by Mimir compared to state-of-the-art solutions, showing its ability to consider a wide range of volume types. We compared Mimir to three baseline configurations, each focusing on a limited subset of instance or storage types, in contrast to Mimir’s comprehensive consideration of all instance and block storage types:</p>
<ol>
<li><strong>i3.xlarge-only:</strong> The simplest way to configure a VSC is using a single instance type (storage-optimized instance, i3.xlarge) and determining the number of instances based on the storage server performance.</li>
<li><strong>Mimir-LocalOnly:</strong> Another way is to use only instance types that have local SSDs, including some compute or memory optimized instance types like m5d, c5d, and r5d.</li>
<li><strong>Mimir-EBSonly/OptimusCloud-like:</strong> Yet another way of configuring VSC is using only EBS volumes that can persist data independently from the instance status, but if the workload requires high-performance, it can be more expensive
than local SSD. <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/atc20/presentation/mahgoub">OptimusCloud</a>, the previous work we consider as the state-of-the-art, restricts the volume type to EBS volumes because of their persistent nature, but our results show that this approach is often much more costly.</li>
</ol>
<p><img src="./evaluation.png" alt="alt text" />
<strong>Fig. 5:</strong> The cost-efficiency analysis of the optimization results of the workloads of Meta’s RocksDB key-value workloads. Throughput-intensive workloads (MR-A,C,E,F) prefer local SSD as its storage type. In contrast, other
workloads (MR-B,D) that do not require high throughput prefer EBS volume to local SSD.</p>
<p>Fig. 5 shows the VSC costs of the most cost-efficient VSC configurations under different resource constraints. Overall, Mimir successfully identifies the most cost-efficient VSC configuration for any workload in Fig. 5, and achieves up to 81% cost savings compared to the <em>OptimusCloud-like</em> baseline. We also demonstrate that depending on the workload characteristics, different workloads prefer different storage types to store data cost-efficiently.</p>
<p>For instance, MR-D, a capacity-intensive workload, does not require high performance. Thus, local SSD proves costly as it under-utilizes its storage bandwidth, and gp2’s throughput (3 IOPS per provisioned GiB) suffices for MR-D.</p>
<p>Conversely, MR-F, with the second highest throughput needs among the six workloads, benefits from local SSD, making Mimir-LocalOnly more cost-efficient than Mimir-EBSonly. Interestingly, for MR-F, compute-optimized instance types like c5d are more economical than storage-optimized i3.xlarge. This is because MR-F demands high computing power for its high data request rate. This evaluation implies that not only considering various storage options, but also selecting the right instance type is important.</p>
<p>Our paper also covers additional evaluations, including the optimization overhead and Mimir’s effectiveness for dynamic workloads. For comprehensive details, please refer to our research paper.</p>
<h2 id="conclusion"><strong>Conclusion</strong></h2>
<p>Mimir finds the cost-efficient virtual storage cluster configurations for distributed storage backends.
By using provided workload information and performance requirements, Mimir predicts resource requirements and explores the complex, heterogeneous set of block storage offerings to identify the lowest cost VSC configuration that satisfies the customer’s need.
Experiments show that no single allocation type is best for all workloads and that a mix of allocation types is the best choice for some workloads.
Compared to a state-of-the-art approach, Mimir finds the VSC configurations that satisfy requirements at up to 81% lower cost for Meta’s RocksDB workloads.</p>
<p>You can find more detailed information in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3579370.3594776">published paper</a>.</p>
Transfer Learning within a Heterogeneous Graph2023-10-31T00:00:00+00:002023-10-31T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/ktn/<h3 id="can-we-transfer-knowledge-between-different-data-types-using-their-connectivity-information">Can we transfer knowledge between different data types using their connectivity information?</h3>
<p>Ecosystems in industry are commonly composed of various data types in terms of data modalities or feature distributions. <strong>Heterogeneous graphs</strong> (HGs) present these multimodal data systems in a unified view by defining multiple types of nodes and edges — for instance, e-commerce networks with (<em>user, product, review</em>) nodes or video platforms with (<em>channel, user, video, comment</em>) nodes. <strong>Heterogeneous graph neural networks</strong> (HGNNs) learn node embeddings, which summarize each node’s heterogeneous local structure into a vector. Unfortunately, there is a <strong>label imbalance</strong> issue between different node types in real-world HGs. For instance, publicly available content node types such as product nodes are abundantly labeled, whereas labels for user or account nodes may not be available due to privacy restrictions. Because of this, label-scarce node types cannot exploit HGNNs, hampering the broader applicability of HGNNs.</p>
<p>In this blog, we introduce how to pre-train an HGNN model on label-abundant node types and then transfer the model to label-scarce node types using relational information given in HGs. You can find details of the work in our paper “<em>Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks</em>” [1], presented at NeurIPS 2022.</p>
<h2 id="what-is-a-heterogeneous-graph-hg">What is a heterogeneous graph (HG)?</h2>
<figure>
<img src="./figure1.png" alt="e-commerce heterogeneous graph" width="400"/>
<figcaption>Figure 1. E-commerce heterogeneous graph: Can we transfer knowledge from label-abundant node types (e.g., products) to zero-labeled node types (e.g., users) through relational information given in a heterogeneous graph?
</figcaption>
</figure>
<p>An HG is composed of multiple node and edge types. Figure 1 shows an e-commerce network presented as an HG. In e-commerce, “users” purchase “products” and write “reviews”. HG presents this ecosystem using three node types (“user”, “product”, “review”) and three edge types (“user-buy-product”, “user-write-review”, review-on-product”). Individual products, users, and reviews are then presented as nodes and their relationships as edges in the HG with the corresponding node/edge types.</p>
<p>In addition to all relational information, HGs are commonly provided with <em>input node attributes</em> that summarize each node’s information. For instance, product nodes could have product images as input node attributes, while review nodes could have review texts as their input attributes. As in the example, input node attributes could have different modalities across different node types. The goal is to predict <em>node labels</em> on each node, such as the category of each product or the category each user is most interested in.</p>
<p>In the following section, we introduce the main challenge we face while training HGNNs to predict labels using input node attributes and relational information from HGs.</p>
<h2 id="heterogeneous-graph-neural-networks-hgnns-and-label-scarcity-issues">Heterogeneous graph neural networks (HGNNs) and label scarcity issues</h2>
<p>HGNNs compute node embeddings that summarize each node’s local graph structures including the node and its neighbor’s input attribute distributions. Node embeddings are then fed into a classifier to predict each node’s label. To train an HGNN model and a classifier to predict labels for a specific node type, we require a good amount of labels for the node type.</p>
<p>A common issue in real-world applications of deep learning is label scarcity. With their diverse node types, HGNNs are even more likely to face this challenge. For instance, publicly available content node types are abundantly labeled, whereas labels for user nodes may not be available due to privacy restrictions. This means that in most standard training settings, HGNN models can only learn to make good inferences for a few label-abundant node types and can usually not make any inferences for the remaining node types, given the absence of any labels for them.</p>
<p>To solve this label scarcity issue, we will use a technique called zero-shot transfer learning that improves the performance of a model on a zero-labeled domain.</p>
<h2 id="transfer-learning-on-heterogeneous-graphs">Transfer Learning on Heterogeneous Graphs</h2>
<p>To improve the performance on a zero-labeled “target” domain, transfer learning exploits the knowledge earned from a related “source” domain, which has adequate labeled data. For instance, transfer learning on heterogeneous graphs first trains an HGNN model on the source domain using their labels, then reuses the HGNN model on the target domain.</p>
<p>In order to apply transfer learning on heterogeneous graphs to solve the label scarcity issue we described above, it is clear the target domain should be the zero-labeled node types. The question remained of what would be the source domain. Previous works commonly set the source domain as the same type of nodes but located in an external HG, assuming those nodes are abundantly labeled (Figure 2). For instance, the source domain is user nodes in the Yelp review graph, while the target domain is user nodes in the Amazon e-commerce graph. This approach, also known as <em>graph-to-graph transfer learning</em>, pre-trains an HGNN model on the external HG and then runs the model on the original label-scarce HG [2, 3].</p>
<center>
<figure>
<img src="./figure2.png" alt="graph-to-graph transfer learning" width="600"/>
<figcaption>Figure 2. Illustration of graph-to-graph transfer learning on heterogeneous graph.</figcaption>
</figure>
</center>
<p>However, this approach is not applicable in many real-world scenarios for three reasons. First, any external HG that could be used in a graph-to-graph transfer learning setting would almost surely be <em>proprietary</em>, thereby, making it hard to get access to. Second, even if practitioners could obtain access to an external HG, it is unlikely that the <em>distribution of the external HG</em> would match our target HG well enough to apply transfer learning. Finally, node types suffering from <em>label scarcity</em> are likely to suffer the same issue on other HGs. For instance, user nodes on the external HG also have scarce labels with privacy constraints.</p>
<h2 id="our-approach-transfer-learning-between-node-types-within-a-heterogeneous-graph">Our approach: transfer learning between node types within a heterogeneous graph</h2>
<p>To overcome the limitation of usage of external HGs for transfer learning, we introduce a practical source domain, <em>other node types with abundant labels located on the same HG</em>. Instead of using extra HGs, we transfer knowledge across different types of nodes within a single HG assumed to be fully owned by the practitioners. More specifically, we first pre-train an HGNN model and a classifier on a label-abundant “source” node type. Then, we reuse the models on the zero-labeled “target” node types located in the same HG without additional finetuning. The one requirement for this approach is that the source and target node types share the same label set. This requirement is frequently satisfied in real-world settings. For instance, product nodes have a label set describing product categories, and user nodes share the same label set describing their favorite shopping categories in e-commerce HGs.</p>
<h2 id="main-technical-challenge">Main technical challenge</h2>
<p>We now describe the main challenge in realizing our approach. We cannot directly reuse the pretrained HGNN and classifier on the target node type as described above because HGNN maps the source and target embeddings into the different embedding spaces.</p>
<figure>
<img src="./figure3.png" alt="l2 norm of gradients passed to each module in the HGNN" width="450"/>
<figcaption>
Figure 3. The L2 norm of gradients passed to each module in the HGNN while pretraining on the source node type. Green and Red lines show large amounts of gradients passed to source node type-specific modules, while blue and orange lines show little or no gradients passed to target type-specific modules.
</figcaption>
</figure>
<p>This happens because of one crucial characteristic of HGNNs — HGNNs are composed of modules specialized to each node type and use distinct sets of modules to compute embeddings for each node type. During pretraining HGNNs on the source node type, modules specialized to the source node type are well-trained, while modules specialized to the target node are untrained or under-trained. In Figure 3, we can observe the source modules (green and red lines) receive high L2 norms of gradients during pretraining. On the other hand, because of the specialization, the target modules (orange and blue lines) receive little or no gradients. With under-trained modules for the target node type, the pretrained HGNN model outputs poor node embeddings for the target node type, and, consequently, poor performance on the node prediction task.</p>
<h2 id="ktn-trainable-cross-type-transfer-learning-for-hgnns">KTN: Trainable Cross-Type Transfer Learning for HGNNs</h2>
<p>Now, we introduce a method to transform the under-trained poor embeddings of the target node type to follow source embeddings. This allows us to reuse the classifier that was trained on source node types. In order to derive the transformation in a principled manner, let us look into how HGNNs compute node embeddings and analyze the relationship between source and target embeddings.</p>
<figure>
<img src="./figure4.png" alt="HGNN structure" width="600"/>
<figcaption>
Figure 4. (left) In HGNNs, the final L-layer node embeddings for both source and target types are computed using the same input, the previous (L-1)-layer’s node embeddings. (right) The L-layer node embeddings of the source type (product, blue) can be represented by the L-layer node embeddings of the target type (user, red) using (L-1)-layer node embeddings as intermediate values.
</figcaption>
</figure>
<p>In each layer, HGNNs aggregate connected nodes’ embeddings from the previous layer to update each target node’s embeddings. Node embeddings for both source and target node types are updated using the same input: the previous layer’s node embeddings of any connected node types (Figure 4, left). This means that they can be represented by each other using the previous layer embeddings as intermediate values (Figure 4, right).</p>
<p>We prove there is a mapping matrix from the target domain to the source domain, which is defined by HGNN parameters (Theorem 1 in [1]). Based on this theorem, we introduce an auxiliary network, named Knowledge Transfer Networks (KTN), that learns the mapping matrix from scratch during pretraining HGNN on the source domain. At test time, we first compute target embeddings using the pretrained HGNN, then map the target embeddings to the source domain using our trained KTN. Finally, we can reuse the classifier with transformed target embeddings.</p>
<h2 id="experimental-results">Experimental results</h2>
<figure>
<img src="./figure5.png" alt="zero-shot transfer learning results on OAG and Pubmed" width="600"/>
<figcaption>
Figure 5. Zero-shot transfer learning performance measured in NDCG on Open Academic Graph (OAG) and Pubmed datasets. Higher is better. Our proposed method KTN (red bar) shows the highest accuracy among all baselines.
</figcaption>
</figure>
<p>To examine the effectiveness of our proposed KTN, we ran 18 different zero-shot transfer learning tasks on two public heterogeneous graphs, Open Academic Graph [4] and Pubmed [5]. We compare KTN with 8 state-of-the-art transfer learning methods. We show our results in Figure 5. Our proposed method KTN consistently outperforms all baselines on all tasks by up to 73.3%. The naive approach we discussed earlier — reuse the pretrained models directly on the target domain without any transfer learning — is presented as blue bar. We see our method KTN provides relative gains of up to 340% higher than the naive approach without using any labels from the target domain.</p>
<figure>
<img src="./figure6.png" alt="KTN with 6 different HGNN models" width="450"/>
<figcaption>
Figure 6. KTN can be applied to 6 different HGNN models and improve their zero-shot performance on target domains. Performance is measured in NDCG. Higher is better.
</figcaption>
</figure>
<p>KTN can be applied to almost all HGNN models that have node/edge type-specific modules and improve their zero-shot performance on target domains. In Figure 6, KTN improves accuracy on zero-labeled node types across 6 different HGNN models by up to 960%.</p>
<h2 id="takeaway">Takeaway:</h2>
<p>Various real-world applications can be presented as heterogeneous graphs. Heterogeneous graph neural networks (HGNNs) are an effective technique for summarizing heterogeneous graphs into concise embeddings. However, label scarcity issues on certain types of nodes have prevented the broader application of HGNNs. In this post, we introduced KTN, the first cross-type transfer learning method designed for HGNNs. With KTN, we can fully exploit the rich relational information of heterogeneous graphs with HGNNs on any nodes regardless of their label scarcity.</p>
<p>For more details about KTN, check out our paper [1].</p>
<p>[1] Minji Yoon, John Palowitch, Dustin Zelle, Ziniu Hu, Ruslan Salakhutdinov, Bryan Perozzi. <em>Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks</em>, Neural Information Processing Systems (NeurIPS) 2022.</p>
<p>[2] Tiancheng Huang, Ke Xu, and Donglin Wang. <em>Da-hgt: Domain adaptive heterogeneous graph transformer.</em> arXiv preprint arXiv:2012.05688, 2020.</p>
<p>[3] Shuwen Yang, Guojie Song, Yilun Jin, and Lun Du. <em>Domain adaptive classification on heterogeneous information networks.</em> International Joint Conferences on Artificial Intelligence (IJCAI) 2021.</p>
<p>[4] Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, et al. <em>Oag: Toward linking large-scale heterogeneous entity graphs.</em> In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2019.</p>
<p>[5] Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. <em>Heterogeneous network representation learning: A unified framework with survey and benchmark.</em> IEEE Transactions on Knowledge and Data Engineering, 2020.</p>
FIFO is Better than LRU: the Power of Lazy Promotion and Quick Demotion2023-09-20T00:00:00+00:002023-09-20T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/fifo-lru/<blockquote>
<p><strong>TL;DR:</strong>
Historically FIFO-based algorithms are thought to be less efficient (having higher miss ratios) than LRU-based algorithms.
In this blog, we introduce two techniques, <strong>lazy promotion</strong>, which promotes objects only at eviction time, and <strong>quick demotion</strong>, which evicts most new objects quickly. We will show that</p>
<ul>
<li>Conventional-wisdom-suggested “weak LRUs”, e.g., FIFO-Reinsertion, is actually more efficient (having lower miss ratios) than LRU;</li>
<li>Simply evicting most new objects quickly can improve the state-of-the-art algorithm’s efficiency.</li>
<li>Eviction algorithms can be designed like building with LEGOs by adding <strong>lazy promotion</strong> and <strong>quick demotion</strong> on top of FIFO.</li>
</ul>
</blockquote>
<h2 id="introduction">Introduction</h2>
<p>Caching is a well-known and widely deployed technique to speed up data access, reduce repeated computation and data transfer.
A core component of a cache is the eviction algorithm, which chooses the objects stored in the limited cache space.
Two metrics describe the performance of an eviction algorithm: efficiency measured by the miss ratio and throughput measured by the number of requests served that can be served per second.</p>
<p>The study of cache eviction algorithms has a long history, with a majority of the work centered around LRU (that is, to evict the least-recently-used object).
Generally, LRU maintains a doubly-linked list, promoting objects to the head of the list upon cache hits and evicting the object at the tail of the list when needed.
<a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3399709">Belady and others found</a> that memory access patterns often exhibit temporal locality — “the most recently used pages were most likely to be reused in the immediate future”. Thus, LRU using <em>recency</em> to promote objects was found to be better than FIFO.</p>
<p>Most eviction algorithms designed to achieve high efficiency start from LRU.
For example, many algorithms, such as <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/fast-03/arc-self-tuning-low-overhead-replacement-cache">ARC</a>, <a rel="noopener" target="_blank" href="https://research.facebook.com/publications/an-analysis-of-facebook-photo-caching/">SLRU</a>, <a rel="noopener" target="_blank" href="https://www.vldb.org/conf/1994/P439.PDF">2Q</a>, <a rel="noopener" target="_blank" href="https://www.usenix.org/legacy/events/usenix01/full_papers/zhou/zhou.pdf">MQ</a>, and <a rel="noopener" target="_blank" href="https://lwn.net/Articles/856931/">multi-generational LRU</a>, use multiple LRU queues to separate hot and cold objects. Some algorithms, e.g., <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/511399.511340?casa_token=x3My6rber5UAAAAA%3A7Gbpkgt2k6RMf95GUwvxrsY0-R-q5EpEN_uXRAfF4loxK2vo9yFtFh6Vo5R-30Vlkv1_3BtwnJiomlw">LIRS</a>, maintain an LRU queue but use different metrics to promote objects. While other algorithms, e.g., <a rel="noopener" target="_blank" href="https://www.computer.org/csdl/journal/tc/2001/12/t1352/13rRUxBJhES">LRFU</a>, <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/301464.301486">EE-LRU</a>, <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/hotstorage18/hotstorage18-paper-vietri.pdf">LeCaR</a>, and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/fast21-rodriguez.pdf">CACHEUS</a>, augment LRU’s recency with different metrics. In addition, many recent works, e.g., <a rel="noopener" target="_blank" href="https://ieeexplore.ieee.org/abstract/document/7056022">Talus</a>, improve LRU’s ability to handle scan and loop requests.</p>
<p>Besides efficiency (miss ratio), there have been fruitful studies on enhancing the cache’s execution performance and thread scalability. Each cache hit in LRU promotes an object to the head of the queue, which requires updating at least six pointers guarded by locks.
These overheads are not acceptable in many deployments that need high performance.
Thus, performance-centric systems often use FIFO-based algorithms to avoid LRU’s overheads.
For example, FIFO-Reinsertion and variants of CLOCK have been developed, which serve as LRU approximations.
<em>It is often perceived that these algorithms trade miss ratio for better throughput and scalability.</em></p>
<p>In this blog, I am going to show that FIFO is in-fact better than LRU not only because of higher throughput, better scalability, but also because of improved effectiveness (having lower miss ratios).</p>
<h2 id="why-fifo-and-what-it-needs">Why FIFO and What it needs</h2>
<p>FIFO has many benefits over LRU.
For example, FIFO has <em>less metadata</em> and requires no metadata update on each cache hit, and thus is <em>faster and more scalable</em> than LRU. In contrast, LRU requires updating six pointers on each cache hit, which is not friendly for modern computer architecture due to random memory accesses. Moreover, FIFO is always the first choice when implementing a flash cache because it does not incur write amplification. Although FIFO has throughput and scalability benefits, it is conventional wisdom that FIFO is less effective (having higher miss ratio) than LRU.</p>
<p align="center">
<figure class="image"><img src="cacheAbs.svg" alt="A cache abstraction" style="width:80%; display: block; margin-left: auto; margin-right: auto;">
<figcaption>A cache can be viewed as a logically ordered queue with four operations: insertion, removal, promotion and demotion. Most eviction algorithms can be viewed as promotion algorithms because they focus on how to promote objects. </figcaption>
</figure>
</p>
<p>To understand the various factors that affect the miss ratio, we introduce a cache abstraction.
A cache can be viewed as a logically total-ordered queue with four operations: <span style="font-family:monaco;">insertion</span>, <span style="font-family:monaco;">removal</span>, <span style="font-family:monaco;">promotion</span>, and <span style="font-family:monaco;">demotion</span>.
Objects in the cache can be compared and ordered based on some metric (e.g., time since the last request), and the eviction algorithm evicts the least valuable object based on the metric.
<span style="font-family:monaco;">Insertion</span> and <span style="font-family:monaco;">removal</span> are user-controlled operations, where <span style="font-family:monaco;">removal</span> can either be directly invoked by the user or indirectly via the use of time-to-live (TTL).
<span style="font-family:monaco;">Promotion</span> and <span style="font-family:monaco;">demotion</span> are internal operations of the cache used to maintain the logical ordering between objects.</p>
<p>We observe that most eviction algorithms use <span style="font-family:monaco;">promotion</span> to update the ordering between objects.
For example, LRU-based algorithms promote objects to the head of the queue on cache hits, which we call <span style="font-family:monaco;">eager promotion</span>.
Meanwhile, <span style="font-family:monaco;">demotion</span> is performed implicitly: when an object is promoted, other objects are passively demoted.
We call this process <span style="font-family:monaco;">passive demotion</span>, a slow process as objects need to traverse through the cache queue before being evicted.
However, we will show that instead of eager promotion and passive demotion, eviction algorithms should use <strong>lazy promotion</strong> and <strong>quick demotion</strong>.</p>
<h2 id="lazy-promotion">Lazy Promotion</h2>
<p>To avoid popular objects from being evicted while not incurring much performance overhead, we propose adding <strong>lazy promotion</strong> on top of FIFO (called <span style="font-family:arial; font-variant-cap:petite-caps"> LP-FIFO</span>), which <em>promotes objects only when they are about to be evicted</em>.
<strong>lazy promotion</strong> aims to retain popular objects with minimal effort.
An example is FIFO-Reinsertion (note that FIFO-Reinsertion, 1-bit CLOCK, and Second Chance are different implementations of the same eviction algorithm): an object is reinserted at eviction time if it has been requested while in the cache. </p>
<p><span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> has several benefits over eager promotion (promoting on every access) used in LRU-based algorithms.
First, <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> inherits FIFO’s throughput and scalability benefits because few metadata operations are needed when an object is requested. For example, FIFO-Reinsertion only needs to update a Boolean field upon the <em>first</em> request to a cached object without locking.
Second, performing promotion at eviction time allows the cache to make better decisions by accumulating more information about the objects, e.g., how many times an object has been requested.</p>
<style>
td,th {
font-size: 96%;
}
</style>
<table><thead><tr><th>Trace</th><th>approx time</th><th align="right">#trace</th><th align="right">cache type</th><th align="right">#req (millions)</th><th align="right">#obj (millions)</th></tr></thead><tbody>
<tr><td>MSR</td><td>2007</td><td align="right">13</td><td align="right">block</td><td align="right">410</td><td align="right">74</td></tr>
<tr><td>FIU</td><td>2008</td><td align="right">9</td><td align="right">block</td><td align="right">514</td><td align="right">20</td></tr>
<tr><td>Cloudphysics</td><td>2015</td><td align="right">106</td><td align="right">block</td><td align="right">2,114</td><td align="right">492</td></tr>
<tr><td>Major CDN</td><td>2018</td><td align="right">219</td><td align="right">object</td><td align="right">3,728</td><td align="right">298</td></tr>
<tr><td>Tencent Photo</td><td>2018</td><td align="right">2</td><td align="right">object</td><td align="right">5,650</td><td align="right">1,038</td></tr>
<tr><td>Wiki CDN</td><td>2019</td><td align="right">3</td><td align="right">object</td><td align="right">2,863</td><td align="right">56</td></tr>
<tr><td>Tencent CBS</td><td>2020</td><td align="right">4030</td><td align="right">block</td><td align="right">33,690</td><td align="right">551</td></tr>
<tr><td>Alibaba</td><td>2020</td><td align="right">652</td><td align="right">block</td><td align="right">19,676</td><td align="right">1702</td></tr>
<tr><td>Twitter</td><td>2020</td><td align="right">54</td><td align="right">KV</td><td align="right">195,441</td><td align="right">10,650</td></tr>
<tr><td>Social Network</td><td>2020</td><td align="right">219</td><td align="right">KV</td><td align="right">549,784</td><td align="right">42,898</td></tr>
</tbody></table>
<p>To understand <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span>’s efficiency,
we performed a large-scale simulation study on 5307 production traces from 10 data sources, which include open-source and proprietary datasets collected between 2007 and 2020.
The 10 datasets contain 814 billion (6,386 TB) requests and 55.2 billion (533 TB) objects, and cover different types of caches, including block, key-value (KV), and object caches.
We further divide the traces into block and web (including Memcached and CDN).
We choose small/large cache size as 0.1%/10% of the number of unique objects in the trace.</p>
<p>We compare the miss ratios of LRU with two <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> algorithms:
FIFO-Reinsertion and 2-bit CLOCK.
2-bit CLOCK tracks object frequency up to three, and an object’s frequency decreases by one each time the CLOCK hand scans through it. Objects with frequency zero are evicted.</p>
<p>Common wisdom suggests that these two <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> examples are LRU approximations and will exhibit higher miss ratios than LRU.
However, we found that <strong><span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> often exhibits miss ratios lower than LRU</strong>.</p>
<div style="display: flex; justify-content: space-around;">
<img src="multi_LRU_FIFO_Reinsertion_1.svg" alt="small cache" style="width:40%">
<img src="multi_LRU_FIFO_Reinsertion_3.svg" alt="large cache" style="width:40%">
</div>
<div style="width: 88%; margin: 0 auto;">
Comparison of FIFO-Reinsertion and LRU on 10 datasets with 5307 traces. Left: small cache, right: large cache.
</div>
<div style="display: flex; justify-content: space-around;">
<img src="multi_LRU_Clock-2_1.svg" alt="small cache" style="width:40%">
<img src="multi_LRU_Clock-2_3.svg" alt="large cache" style="width:40%">
</div>
<div style="width: 88%; margin: 0 auto;">
Comparison of 2-bit CLOCK and LRU on 10 datasets with 5307 traces. Left: small cache, right: large cache. A longer bar means the algorithm is more efficient (having lower miss ratios on more traces). Note that we do not consider the overhead of LRU metadata in these evaluations.
</div>
<p>The figure above shows that FIFO-Reinsertion and 2-bit CLOCK are better than LRU on most traces.
Specifically, FIFO-Reinsertion is better than LRU on 9 and 7 of the 10 datasets using a small and large cache size, respectively.
Moreover, on half of the datasets, more than 80% of the traces in each dataset favor FIFO-Reinsertion over LRU at both sizes.
On the two social network datasets, LRU is better than FIFO-Reinsertion (especially at the large cache size). This is because most objects in these two datasets are accessed more than once, and using one bit to track object access is insufficient. Therefore, when increasing the one bit in FIFO-Reinsertion (CLOCK) to two bits (2-bit CLOCK), we observe that the number of traces favoring <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> increases to around 70%.
Across all datasets, 2-bit CLOCK is better than FIFO on all datasets at the small cache size and 9 of the 10 datasets at the large cache size.</p>
<figure class="image">
<img src="LP.svg" alt="Lazy promotion leads to quick demotion" style="width:64%">
<figcaption>FIFO-Reinsertion demotes new objects faster than LRU because objects requested before the new object also pushes it down the queue.</figcaption>
</figure>
<p>Two reasons contribute to <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span>’s high effectiveness.
First, <strong>lazy promotion</strong> often leads to <strong>quick demotion</strong>. For example, under LRU, a newly-inserted object <em>G</em> is pushed down the queue only by (1) new objects and (2) cached objects that are requested after <em>G</em>. However, besides the objects requested after <em>G</em>, the objects requested before <em>G</em> (but have not been promoted, e.g., <em>B</em>, <em>D</em>) also push <em>G</em> down the queue when using FIFO-Reinsertion.
Second, compared to promotion at each request, object ordering in <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> is closer to the insertion order, which we conjecture is better suited for many workloads that exhibit popularity decay — old objects have a lower probability of getting a request.</p>
<p>While <span style="font-family:arial; font-variant-cap:small-caps">LP-FIFO</span> surprisingly wins over LRU in miss ratio, it does not outperform state-of-the-art algorithms. We next discuss another building block that bridges this gap.</p>
<h2 id="quick-demotion">Quick Demotion</h2>
<p>Efficient eviction algorithms not only need to keep popular objects in the cache but also need to evict unpopular objects fast. In this section, we show that <strong>quick demotion</strong> (QD) is critical for an efficient eviction algorithm, and it enables FIFO-based algorithms to achieve state-of-the-art efficiency.</p>
<p>Because demotion happens passively in most eviction algorithms, an object typically traverses through the cache before being evicted. Such traversal gives each object a good chance to prove its value to be kept in the cache.
However, cache workloads often follow Zipf popularity distribution, with most objects being unpopular.
This is further exacerbated by (1) the scan and loop access patterns in the block cache workloads, and (2) the vast existence of dynamic and short-lived data, the use of versioning in object names, and the use of short TTLs in the web cache workloads.
We believe the <em>opportunity cost of new objects demonstrating their values is often too high</em>: the object being evicted at the tail of the queue may be more valuable than the objects recently inserted.</p>
<figure class="image">
<img src="QD.svg" alt="An examplf of quick demotion" style="width:64%">
<figcaption>An example of quick demotion: adding a small FIFO to filter most new objects that do not have a request soon after insertion.</figcaption>
</figure>
<p>To illustrate the importance of <strong>quick demotion</strong>, we add a simple QD technique on top of state-of-the-art eviction algorithms.
The QD technique consists of a small probationary FIFO queue storing cached data and a ghost FIFO queue storing metadata of objects evicted from the probationary FIFO queue.
The probationary FIFO queue uses 10% of the cache space and acts as a filter for unpopular objects: objects not requested after insertion are evicted early from the FIFO queue. The main cache runs a state-of-the-art algorithm and uses 90% of the space.
And the ghost FIFO stores as many entries as the main cache.
Upon a cache miss, the object is written into the probationary FIFO queue unless it is in the ghost FIFO queue, in which case, it is written into the main cache.
When the probationary FIFO queue is full, if the object to evict has been accessed since insertion, it is inserted into the main cache. Otherwise, it is evicted and recorded in the ghost FIFO queue.</p>
<p>We add this FIFO-based QD technique to five state-of-the-art eviction algorithms, ARC, LIRS, CACHEUS, LeCaR, and LHD.
We used the open-source LHD implementation from the authors, implemented the others following the corresponding papers, and cross-checked with open-source implementations.
We evaluated the QD-enhanced and original algorithms on the 5307 traces.
Because the traces have a wide range of miss ratios, we choose to present each algorithm’s miss ratio reduction from FIFO calculated as <em>(mr<sub>FIFO</sub> - mr<sub>algo</sub>) / mr<sub>FIFO</sub></em>. Therefore, higher values are better. </p>
<div style="display: flex; justify-content: space-around;">
<img src="block_1.svg" alt="block cache traces, small cache size" style="width:48%">
<img src="block_3.svg" alt="block cache traces, large cache size" style="width:48%">
</div>
<div style="display: flex; justify-content: space-around;">
<img src="web_1.svg" alt="web cache traces, small cache size" style="width:48%">
<img src="web_3.svg" alt="web cache traces, large cache size" style="width:48%">
</div>
<div style="width: 88%; margin: 0 auto;">
On the block (first row) and web traces (second row), quick demotion can improve most state-of-the-art algorithm's efficiency. Left: small cache, right: large cache.
</div>
<p>The figures above show that the QD-enhanced algorithms further reduce the miss ratio of each state-of-the-art algorithm on almost all percentiles. For example, QD-ARC (QD-enhanced ARC) reduces ARC’s miss ratio by up to 59.8% with a mean reduction of 1.5% across all workloads on the two cache sizes, QD-LIRS reduces LIRS’s miss ratio by up to 49.6% with a mean of 2.2%, and QD-LeCaR reduces LeCaR’s miss ratio by up to 58.8% with a mean of 4.5%.
Note that achieving a large miss ratio reduction on a large number of diverse traces is non-trivial. For example, the best state-of-the-art algorithm, ARC, can only reduce the miss ratio of LRU by 6.2% on average.</p>
<p>The gap between the QD-enhanced algorithm and an original algorithm is wider (1) when the state-of-the-art is relatively weak, (2) when the cache size is large, and (3) on the web workloads.
First, With a weaker state-of-the-art, the opportunity for improvement is larger, allowing QD to provide more prominent benefits. For example, QD-LeCaR reduces LeCaR’s miss ratios by 4.5% on average, larger than the reductions on other state-of-the-art algorithms.
Second, when the cache size is large, unpopular objects spend more time in the cache, and <strong>quick demotion</strong> becomes more valuable.
For example, QD-ARC and ARC have similar miss ratios on the block workloads at the small cache size. But QD-ARC reduces ARC’s miss ratio by 2.3% on average at the large cache size.
However, when the cache size is too large, e.g., 80% of the number of objects in the trace,
adding QD may increase the miss ratio (not shown).
Third, QD provides more benefits on the web workloads than the block workloads, especially when the cache size is small. We conjecture that web workloads have more short-lived data and exhibit stronger popularity decay, which leads to a more urgent need for <strong>quick demotion</strong>.
While <strong>quick demotion</strong> improves the efficiency of most state-of-the-art algorithms, for a small subset of traces, QD may increase the miss ratio when the cache size is small because the probationary FIFO is too small to capture some potentially popular objects.</p>
<p>Although adding the probationary FIFO improves efficiency, it further increases the complexity of the already complicated state-of-the-art algorithms.
To reduce complexity, we add the same QD technique on top of 2-bit CLOCK and call it <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span>.
<span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> uses two FIFO queues to cache data and a ghost FIFO queue to track evicted objects.
It is not hard to see <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> is simpler than all state-of-the-art algorithms — it requires at most one metadata update on a cache hit and no locking for any cache operation. Therefore, we believe it will be faster and more scalable than all state-of-the-art algorithms.
Besides enjoying all the benefits of simplicity, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> also achieves lower miss ratios than state-of-the-art algorithms.
For example, compared to LIRS and LeCaR, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> reduces miss ratio by 1.6% and 4.3% on average, respectively, across the 5307 traces.
While the goal of this work is not to propose a new eviction algorithm, <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> illustrates how we can build simple yet efficient eviction algorithms by adding <strong>quick demotion</strong> and <strong>lazy promotion</strong> techniques to a simple base eviction algorithm such as FIFO.</p>
<h2 id="discussion">Discussion</h2>
<p>We have demonstrated reinsertion as an example of LP and the use of a small probationary FIFO queue as an example of QD. However, these are not the only techniques.
For example, reinsertion can leverage different metrics to decide whether the object should be reinserted. Besides reinsertion, several other techniques are often used to reduce promotion and improve scalability, e.g., periodic promotion, batched promotion, promoting old objects only, and promoting with try-lock.
Although these techniques do not fall into our strict definition of <strong>lazy promotion</strong> (promotion on eviction), many of them effectively retain popular objects from being evicted.
On the <strong>quick demotion</strong> side, besides the small probationary FIFO queue, one can leverage other techniques to define and discover unpopular objects such as <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/atc17/atc17-blankstein.pdf">Hyperbolic</a> and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/conference/nsdi18/nsdi18-beckmann.pdf">LHD</a>.
Moreover, admission algorithms, e.g., <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/3149371">TinyLFU</a>, Bloom Filter, probabilistic, and <a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/nsdi19-eisenman.pdf">ML-based admission algorithms</a>, can be viewed as a form of QD — though some of them are too aggressive at demotion (rejecting objects from entering the cache).</p>
<p>Note that QD bears similarity with some <a rel="noopener" target="_blank" href="https://wiki.c2.com/?GenerationalGarbageCollection">generational garbage collection algorithms</a>, which separately store short-lived and long-lived data in young-gen and old-gen heaps.
Therefore, ideas from garbage collection may be borrowed to strengthen cache eviction algorithms.</p>
<p>The design of <span style="font-family:arial; font-variant-cap:small-caps">QD-LP-FIFO</span> opens the door to designing simple yet efficient cache eviction algorithms by innovating on LP and QD techniques. And we envision future eviction algorithms can be designed like building LEGO — adding <strong>lazy promotion</strong> and <strong>quick demotion</strong> on top of a base eviction algorithm.</p>
Verus: A tool for verified systems code in Rust2023-08-03T00:00:00+00:002023-08-03T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/rust-verification-with-verus/<p>Part of the challenge (and fun) of low-level systems code is in the optimizations they employ:
developers might use manual memory management, they might use bit-packing and bit-twiddling optimizations,
or they might use multi-threading to speed up their code.
When dealing with such things for critical software, though, it can be difficult to ensure their correctness.
This is why my research group is interested in the formal verification of systems software:
ensuring through computer-checked mathematical proofs that software does what it is supposed to,
and ideally not compromising on these optimizations.</p>
<p>For this purpose, we have been developing <a rel="noopener" target="_blank" href="https://github.com/verus-lang/verus">Verus</a>,
a verification tool for <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/stable/book/">the Rust programming language</a>.
Rust is increasingly popular as a systems programming language today,
but we didn’t (just) choose it because of its popularity.
Rather, it turns out that the properties that make it attractive as a systems programming language—most notably,
that it allows manual memory management while simultaneously guaranteeing memory-safety—<em>also</em> make it excellent
in the setting of formal verification: in some ways straightforward,
and in some ways surprising. In this blog post, I’ll explain what these ways are.</p>
<h1 id="verification-mutable-memory-and-rust">Verification, mutable memory, and Rust</h1>
<p>First, we’re interested in proving code to be “correct”. What does that mean exactly?
Let’s get our feet wet in verification with some simple examples and then talk about a challenge that Rust helps us solve.</p>
<h2 id="intro-to-verus">Intro to Verus</h2>
<p>The key idea behind Verus is to check additional properties of programs that Rust doesn’t check on its own.
For example, consider the following valid Rust program, operating over an 8-bit integer.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">double</span><span>(i: </span><span style="color:#fffb9d;">u8</span><span>) -> </span><span style="color:#fffb9d;">u8 </span><span>{
</span><span> </span><span style="color:#fed6af;">return</span><span> i </span><span style="color:#ececec;">* </span><span style="font-weight:bold;color:#87d6d5;">2</span><span>;
</span><span>}
</span></code></pre>
<p>Though it’s a valid program, it (potentially) has a problem: if the argument <code>i</code> is more than 127, then the multiplication will overflow the 8-bit integer.
If you run Verus on it (which you can <a rel="noopener" target="_blank" href="https://play.verus-lang.org/?version=stable&mode=basic&edition=2021&code=use+vstd%3A%3Aprelude%3A%3A*%3B%0A%0Averus%21+%7B%0A%0A++++fn+double%28i%3A+u8%29+-%3E+u8+%7B%0A++++++++return+i+*+2%3B%0A++++%7D%0A++++%0A%7D%0A%0Afn+main%28%29+%7B%7D%0A%0A">try yourself at the Verus playground</a>),
Verus reports this error:</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>error: possible arithmetic underflow/overflow
</span><span> --> /playground/src/main.rs:6:16
</span><span> |
</span><span>6 | return i * 2;
</span><span> | ^^^^^
</span></code></pre>
<p>To remedy this, the programmer can declare their <em>intent</em>: namely, that the <code>double</code> function should never be called with any argument greater than 127.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">double</span><span>(i: </span><span style="color:#fffb9d;">u8</span><span>) -> </span><span style="color:#fffb9d;">u8
</span><span> requires i <= 127
</span><span>{
</span><span> return i </span><span style="color:#ececec;">* </span><span style="font-weight:bold;color:#87d6d5;">2</span><span>;
</span><span>}
</span></code></pre>
<p>The <code>requires</code> clause is not a standard Rust feature, but a feature of Verus: in general, Verus source code comprises both Rust code and extra directives for Verus, such as this
<code>requires</code> clause, also known as a <em>precondition</em>. In any case, Verus now accepts the program (<a rel="noopener" target="_blank" href="https://play.verus-lang.org/?version=stable&mode=basic&edition=2021&code=use+vstd%3A%3Aprelude%3A%3A*%3B%0A%0Averus%21+%7B%0A%0A++++fn+double%28i%3A+u8%29+-%3E+u8%0A++++++++requires+i+%3C%3D+127%0A++++%7B%0A++++++++return+i+*+2%3B%0A++++%7D%0A++++%0A%7D%0A%0Afn+main%28%29+%7B%7D%0A%0A">playground link</a>) because with the new assumption, Verus can determine that this arithmetic operation never overflows.</p>
<p>Furthermore, any time the developer calls <code>double</code> from elsewhere in the program, Verus will check that the call satisfies the precondition.
Keep in mind, also, that this is a check done statically, checked for all possible executions of the program, not a runtime check.</p>
<h2 id="specifications-and-program-correctness">Specifications and program correctness</h2>
<p>With Verus, we are actually interested in correctness criteria that go beyond simple arithmetic bounds checks.
Usually, we are interested in proving that a program’s behavior meets some <em>specification</em>, as in this function that computes the maximum of two integers:</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">max</span><span>(a: </span><span style="color:#fffb9d;">u64</span><span>, b: </span><span style="color:#fffb9d;">u64</span><span>) -> (result: </span><span style="color:#fffb9d;">u64</span><span>)
</span><span> ensures
</span><span> result </span><span style="color:#ececec;">==</span><span> a </span><span style="color:#ececec;">||</span><span> result </span><span style="color:#ececec;">==</span><span> b,
</span><span> result </span><span style="color:#ececec;">>=</span><span> a,
</span><span> result </span><span style="color:#ececec;">>=</span><span> b,
</span><span>{
</span><span> </span><span style="color:#fed6af;">if</span><span> a </span><span style="color:#ececec;">></span><span> b {
</span><span> </span><span style="color:#fed6af;">return</span><span> a;
</span><span> } </span><span style="color:#fed6af;">else </span><span>{
</span><span> </span><span style="color:#fed6af;">return</span><span> b;
</span><span> }
</span><span>}
</span></code></pre>
<p>Again, let’s highlight the Verus-specific parts:
first, we have the <code>ensures</code> clause (also known as a <em>postcondition</em>) serving as the function’s specification, along with the name <code>result</code> on the return type,
which is referenced from said postcondition.
Once again, the body of the <code>max</code> function is the Rust code we are verifying.</p>
<p>The <code>ensures</code> clause denotes a predicate that should hold true at the end of the call to <code>max</code>.
This determines what it means for an implementation of <code>max</code> to be “correct”: it is correct if every execution of its code
returns a result that meets its specification.</p>
<p>So, how does Verus actually check that this property holds?
To do this, Verus (and similar tools) encode the correctness of <code>max</code> as logical formulae called <em>verification conditions</em>:</p>
<p>\[ a > b \implies result = a \implies (result = a \lor result = b) \land (result \ge a) \land (result \ge b) \]</p>
<p>\[ \lnot(a > b) \implies result = b \implies (result = a \lor result = b) \land (result \ge a) \land (result \ge b) \]</p>
<p>These conditions are simplified a bit for presentation, but they are close enough for intuition.
The first of these would be read as, “if \( a < b \) (i.e., the first branch is taken), and if \( result \) is set to the return value \( a \), then the conditions of
the ensures clause hold”. The second condition is similar, but for the <code>else</code> side of the branch.</p>
<p>If we prove the verification conditions are correct, this implies the correctness of the program according to its specification.
To do so, Verus uses an automated theorem prover—in this case, <a rel="noopener" target="_blank" href="https://github.com/Z3Prover/z3">Z3</a>—to prove the verifification conditions hold for all
values of <em>a</em>, <em>b</em>, and <em>result</em>. This example is simple enough that Z3 can validate the conditions quickly, though for more complicated examples, the developer may need to write additional proofs
to help it out. If Z3 is unable to prove the theorem, either because it is wrong, or because it needs additional help to prove, then Verus outputs an error message like the one from
the previous section.</p>
<p>Specification-checking is extremely useful for situations where an implementation is optimized and handles low-level details, but we would like to provide a higher-level, mathematically precise specification.
For example:</p>
<ul>
<li>A program uses the bitwise operation <code>(x & (x - 1)) == 0</code> to determine if <code>x</code> is a power-of-2, but uses a more mathematically precise specification, \( \exists b.~ 2^b = x \).</li>
<li>A data-structure implements a hash table or a red-black tree, but has a specification stating that its operations are equivalent to those of a mathematical set.</li>
<li>A replicated data structure with a sophisticated synchronization algorithm uses a specification that it acts indistinguishably from a single copy of the data structure.</li>
</ul>
<h2 id="challenge-handling-mutable-memory">Challenge: handling mutable memory</h2>
<p>One such “low-level detail” we often have to reason about is <em>mutable heap state</em>.
To see why this is challenging without Rust’s help, let us set aside Rust for a moment,
and imagine we designed a programming language with general pointer types, like in C.
Consider a simple function that takes two pointers and updates one of them:</p>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> Imagined C-like verification language
</span><span style="color:#fffb9d;">void </span><span style="color:#fffd87;">compute_boolean_not</span><span>(</span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x, </span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x_not)
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">bool</span><span> tmp </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>This program looks straightforward at first, but it actually has a problem: what if <code>x</code> and <code>x_not</code> point to the same memory?
Then <code>*x</code> would be updated when we update <code>*x_not</code>. Therefore, a tool would never be able to prove this code matches its specification—it simply isn’t true.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/compute_boolean_not_graphical.png" alt="Visual representation of the above example" /></p>
<p align="center"><i><b>Left:</b> what the developer imagines happening. <b>Right:</b> what might actually happen.</i></p>
<p>One solution is to specify that the pointers do not <em>alias</em> with each other, i.e., that they don’t point to the same memory location:</p>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> Imagined C-like verification language
</span><span style="color:#fffb9d;">void </span><span style="color:#fffd87;">compute_boolean_not</span><span>(</span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x, </span><span style="color:#fffb9d;">bool</span><span style="color:#ececec;">* </span><span>x_not)
</span><span> requires x </span><span style="color:#ececec;">!=</span><span> x_not </span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> This line has been added
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">bool</span><span> tmp </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>Recall the <code>requires</code> clause here indicates an assumption the function can make at the beginning of its execution.
By making this assumption, Verus can now check that the specification holds, although now every call to <code>compute_boolean_not</code>
will need to uphold this contract.</p>
<p>Unfortunately, adding these “non-aliasing conditions” gets unwieldy fast, as data structures increase both in breadth and depth.
This was our experience when we wrote the first version of <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">VeriBetrKV</a>, a key-value store developed in <a rel="noopener" target="_blank" href="https://dafny.org/">Dafny</a>, which has a similar aliasing situation to our C-like language.
Not only were the conditions difficult to write manually, but getting them wrong often led to error messages that were difficult to diagnose.</p>
<h2 id="rust-to-the-rescue">Rust to the rescue</h2>
<p>In Rust, it isn’t common to use general-purpose pointer types. Instead, Rust uses more restricted <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/book/ch04-02-references-and-borrowing.html"><em>reference</em> types</a>. In Rust, the types <code>&T</code> and <code>&mut T</code>
each denote a reference to a value of type <code>T</code>.
In the case of <code>&mut T</code>, which is specifically a <em>mutable</em> reference, the user is able
to modify the value behind the pointer.
Thus, in Rust/Verus, our boolean-negation example would look like this, with the <code>x_not</code> parameter marked as a mutable reference.</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">fn </span><span style="color:#fffd87;">compute_boolean_not</span><span>(x: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">bool</span><span>, x_not: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">mut bool</span><span>)
</span><span> ensures (</span><span style="color:#ececec;">*</span><span>x_not) </span><span style="color:#ececec;">== !</span><span>(</span><span style="color:#ececec;">*</span><span>x)
</span><span>{
</span><span> </span><span style="color:#fffb9d;">let</span><span> tmp: </span><span style="color:#fffb9d;">bool </span><span style="color:#ececec;">= *</span><span>x;
</span><span> </span><span style="color:#ececec;">*</span><span>x_not </span><span style="color:#ececec;">= !</span><span>tmp;
</span><span>}
</span></code></pre>
<p>At the machine code level, these references are just like pointers, but the Rust type system enforces additional properties: namely, a <code>&mut</code> reference to a piece of data can never coexist
with another reference to that data. Rust enforces this property because it is crucial to Rust’s guarantees about memory safety.</p>
<p>However, this property is also a huge boon for software verification. Because the non-aliasing property is checked by Rust’s type system,
the developer no longer has to write the non-aliasing conditions
manually. Furthermore, Rust’s type system is fast and often presents high-quality error messages when the property is violated.</p>
<p>One can think of this as if these non-aliasing conditions are
inserted automatically, so the developer doesn’t have to worry about it, but in fact, the situation is even better: the verification tool can simplify the verification conditions to not include any
notion of pointer addresses in the first place! In fact, some of my colleagues have <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">published a paper</a> quantifying the gains from this kind of simplification.</p>
<h1 id="are-reference-types-all-we-need">Are reference types all we need?</h1>
<p>The fact that Rust works as a language at all is evidence that reference types are sufficient
<em>most</em> of the time. Unfortunately, most of the time isn’t good enough. The non-aliasing
reference problem gets in the way for implementing any of the following:</p>
<ul>
<li>Doubly-linked lists</li>
<li>Reference-counted pointers (e.g., Rust’s <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/std/rc/struct.Rc.html"><code>Rc</code></a>, similar to C++’s <code>shared_ptr</code>)</li>
<li>Any manner of concurrent algorithm: locks, message-passing queues, memory allocators, systems with domain-specific logic for avoiding data races</li>
</ul>
<p>The reason these examples give difficulty is because Rust’s type system enforces that any object have a unique “owner” (unless those owners are immutable references).
However, these examples seemingly need to violate the restriction:</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/dlist.png" alt="Visual representation of a doubly-linked list. Each node has two incoming pointers from its neighbors, and two outgoing pointers to its neighbors." /></p>
<p align="center"><i>In a doubly-linked list, each node has two neighbors which point to it. Thus, these nodes do not have unique owners.</i></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/rc.png" alt="Visual representation of reference-counted smart pointer, Rc. The shared object has multiple reference objects pointing to it." /></p>
<p align="center"><i>When working with reference-counted smart pointers, each object may have multiple reference objects. These objects need to coordinate via the reference count to drop the given object at the appropriate time. This counter does not have a unique owner.</i></p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/queue.png" alt="Visual representation of message-passing queue. The producer thread and the consumer thread each have a pointer to a shared queue buffer." /></p>
<p align="center"><i>In a message passing queue, the producer thread and the consumer thread have to share a queue buffer to store in-flight messages. This buffer does not have a unique owner.</i></p>
<p>So how can we tackle these kinds of problems?</p>
<p>For such things, Rust programmers need to use Rust’s notorious <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/stable/book/ch19-01-unsafe-rust.html">“unsafe code”</a> which opts in to various Rust features that
the type system is unable to validate are used safely. As such, the burden goes from the
type-checker to the programer to ensure they are used correctly.
Applications like the above are generally considered low-level, and they are often
relegated to time-tested libraries. It’s these kinds of low-level systems, though,
that we are especially interested in verifying! So what do we do?</p>
<h2 id="unsafe-code-in-verus-or-condititionally-safe-code">Unsafe code in Verus, or: “condititionally safe code”</h2>
<p>With Verus, we can recover the ability to implement such things while having a computer-checked guarantee of memory safety.
A Rust feature being “unsafe” really just means that the developer has to uphold a certain contract to use it safely, which Rust cannot check.
It is for this reason that I like to call unsafe code <em>conditionally safe</em>—i.e., it is safe subject to meeting
certain conditions. Rust cannot check these conditions, but Verus <em>can</em>.</p>
<p>Here is a simple example: Rust’s common vector indexing operation performs a bounds-check to ensure there is no memory corruption from an out-of-bounds access.
Therefore, this function is <em>unconditionally</em> “safe” to call, no matter what index the caller provides: even if the caller provides something out-of-bounds, the program might panic and exit, but it will never corrupt memory.
However, there is a lesser-used <a rel="noopener" target="_blank" href="https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked"><code>get_unchecked</code></a> operation which performs <em>no</em> such bounds check.
Thus, <code>get_unchecked</code> is only safe to call if the index is
<em>already known to be in-bounds</em>, thus making it unsafe (conditionally safe).
This condition can be codified as a Verus <code>requires</code> clause:</p>
<pre data-lang="rust" style="background-color:#393939;color:#dedede;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#fffb9d;">unsafe fn </span><span style="color:#fffd87;">get_unchecked</span><span><T>(vec: </span><span style="color:#ececec;">&</span><span style="color:#fffb9d;">Vec</span><span><T>, idx: </span><span style="color:#fffb9d;">usize</span><span>) -> </span><span style="color:#ececec;">&</span><span>T
</span><span> requires idx < vec</span><span style="text-decoration:underline;font-weight:bold;font-style:italic;color:#ffccee;">.</span><span>len()
</span><span> </span><span style="color:#ececec;">...
</span></code></pre>
<p>Now, Verus will check that the index is in-bounds whenever <code>get_unchecked</code> is called.
Thus, we can regain assurance in code that uses this function, provided that Verus is able to validate the code.</p>
<h2 id="handling-unsafe-ownership">Handling unsafe ownership</h2>
<p>Bounds-checking makes for an easy example, but when we consider programs like the ones
diagrammed above, the situation gets a little more complicated.
Recall that what characterizes these systems is that the objects may be pointed to
from multiple owners, which have to coordinate their access somehow.</p>
<p>As a result, the “conditions” of the conditionally safe operations become
substantially more involved. For example, accessing data through a pointer is only safe if there is no
<em>data race</em>, i.e., another thread trying to access it at the same time. Such a condition seems inherently “non-local” as it involved talking about all threads at once,
and therefore is much harder to check than that of a simple index being in bounds.</p>
<p>However, we have already discussed that Rust’s type system allows us to ensure the unique ownership of data, which then rules out illegal operations such as data races.
Therefore, the kind of “condition” we need to check is already the exact kind of condition that Rust’s type system is designed to ensure.
The problem here is just that these particular data structures do not use the specific types that are designed to ensure this. So how can we apply Rust’s philosophy anyway?</p>
<p>Since the data structures we want to verify use objects that don’t obey Rust’s unique ownership, our trick is to add <em>new</em> objects that <em>do</em>.
However, we don’t want to bog down the program with extra data—that would defeat the point of writing optimized code—so these new objects are merely “conceptual proof objects.”
In verification languages, such objects are often called <em>ghost</em> objects, not because they are spooky, but because they have no influence on the “physical world.” The real data structures in the compiled binary
would be the ones diagramed above, but Verus treats the program as if the ghost objects were really there when generating its verification conditions.</p>
<p>For example, for a program that uses pointers, Verus programs can use a ghost object that represents “the right to read or write memory from the given location.”
Just like for ordinary (“real”) data, Rust’s type system ensures that ownership this object is unique. Verus in turn ensures that such an object is
present when the program accesses the data behind the pointer. Combining both results, we can be confident that such an access really is data-race-free.
Even while multiple owners might point to the same piece of data, in the sense of physically having a pointer to it, only one owner at a time can have the <em>right</em> to manipulate that data.</p>
<p>To verify a doubly-linked list, then, we would arrange nodes with pointers in the usual way, but in addition to the “real” nodes, we would have an additional collection of ghost objects
that represent the right to access those nodes. By writing additional Verus annotations, we can explain, mathematically, how these ghost objects relate to the structure of the linked list,
and as a result we can use the ghost objects to traverse the list.
For more details, you can see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/rust-verification-with-verus/#further-reading">our paper</a>, where we present the doubly-linked list in detail.</p>
<h1 id="further-reading">Further reading</h1>
<p>There is currently one paper on Verus available, which introduces Verus and works out
the doubly-linked list example in detail, among others. (If you compare to this blog post,
you may notice Verus’ syntax has evolved a bit since this paper was written.)</p>
<p><a rel="noopener" target="_blank" href="https://arxiv.org/abs/2303.05491">Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. <em>Verus: Verifying Rust Programs Using Linear Ghost Types.</em> (OOPSLA 2023)</a></p>
<p>Before Verus, we explored this space of verification techniques through a language
we developed called <em>Linear Dafny</em>, an extension of the verification langauge <a rel="noopener" target="_blank" href="https://dafny.org/">Dafny</a>. Verus incorporates a lot of our
learnings from Linear Dafny, which there are several papers on.
We first introduced Linear Dafny in this paper on VeriBetrKV, a verified key-value store:</p>
<p><a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi20-hance.pdf">Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. <em>Storage Systems are Distributed Systems (So Verify Them That Way!).</em> (OSDI 2020)</a></p>
<p>Some of my colleagues quantified the utility of Linear Dafny’s type system via direct comparison:</p>
<p><a rel="noopener" target="_blank" href="https://homes.cs.washington.edu/%7Ejlli/papers/oopsla2022.pdf">Jialin Li, Andrea Lattuada, Yi Zhou, Jonathan Cameron, Jon Howell, Bryan Parno, and Chris Hawblitzel. <em>Linear Types for Large-Scale Systems Verification.</em> (OOPSLA 2022)</a></p>
<p>Finally, we explored the combination of ghost objects and ownership types to verify
some sophisticated concurrent systems in a Linear Dafny framework called IronSync:</p>
<p><a rel="noopener" target="_blank" href="https://www.usenix.org/system/files/osdi23-hance.pdf">Travis Hance, Yi Zhou, Andrea Lattuada, Reto Achermann, Alex Conway, Ryan Stutsman, Gerd Zellweger, Chris Hawblitzel, Jon Howell, and Bryan Parno. <em>Sharding the State Machine: Automated Modular Reasoning for Complex Concurrent Systems.</em> (OSDI 2023)</a></p>
<h1 id="related-work">Related work</h1>
<p>Verus is far from the only Rust verification tool around.
<a rel="noopener" target="_blank" href="https://plv.mpi-sws.org/rustbelt/popl18/">RustBelt</a> is a framework for verifying unsafe code within a precise mathematical formalism of a model of the Rust langauge.
It is notable because it can prove general memory-safety theorems about Rust’s type system, even in the presence of libraries that use unsafe code.
However, it does not take <em>advantage</em> of Rust’s type system for the sake of verification, and it doesn’t target developers writing actual Rust code.</p>
<p>Other tools which, like Verus, target developers include <a rel="noopener" target="_blank" href="https://www.pm.inf.ethz.ch/research/prusti.html">Prusti</a>,
<a rel="noopener" target="_blank" href="https://github.com/model-checking/kani">Kani</a>,
<a rel="noopener" target="_blank" href="https://arxiv.org/abs/2206.07185">Aeneas</a>,
and <a rel="noopener" target="_blank" href="https://github.com/xldenis/creusot">Creusot</a>.
Of these, the one most similar to Verus is likely Creusot, which takes advantage of the Rust type system in a similar way to generate simple verification conditions.
Creusot is also notable for its “prophecy encoding” of mutable references, which is more general than Verus’ current mutable reference support.
What distinguishes Verus, by contrast, is its support for these ghost objects and especially their use in concurrency.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Rust’s type system, and similar type systems that enforce unique ownership over data,
are enormously helpful in designing a verification language for low-level code.
Just as Rust guarantees memory safety, thus taking the burden off the developer in the common case,
Verus takes advantage of the same to remove the burden of complex aliasing conditions for verification developers.
More surprisingly, though, we can apply Rust’s type system even for code that initially seems very un-Rust-like, which is common in highly-optimized systems code.
Specifically,
by utilizing ghost objects
we recover the ability to use Rust’s ownership system
(together with Verus to check conditionally safe code)
to check code where the type system would not help in ordinary Rust.</p>
Provably-Safe Sandboxing with WebAssembly2023-07-25T00:00:00+00:002023-07-25T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/provably-safe-sandboxing-wasm/<blockquote>
<p>What if you could run untrusted code and still be able to sleep at night, safe and sound?</p>
</blockquote>
<p></p>
<p>Disclaimer: our award-winning work <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> can only calm your unsafe-software related fears; we recommend complementing this by additionally checking for monsters under your bed, and leaving a night light on, for any fears of things that go bump in the night.</p>
<figure><a name="fig1"></a><br>
<p><img src="./intra-process-sandboxing.svg" alt="A block diagram, representing intra-process sandboxing. Multiple sandboxes are shown inside a single host process, each of which interact via an API with the runtime. The runtime itself interacts with the kernel via syscalls. Multiple sandboxes can run within a single process, and multiple processes can run on the same OS kernel." /></p>
<figcaption>Figure 1: Intra-process sandboxing</figcaption>
<p><br></figure></p>
<p>Whether you want to include third party libraries in your code, support software plugins, use a smart content delivery network, or just browse the Web, you might need to execute untrusted code, which creates a risk that it will compromise its environment. Intra-process software sandboxing (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#fig1">Figure 1</a>), such as with Software Fault Isolation (SFI), is a useful primitive that allows for safe and lightweight execution of such untrusted code in the same process as its environment. Unfortunately, despite being a well-studied technique with a rich and long history, previous efforts to deploy it in production have failed, due to technical and marketplace hurdles, such as requiring access to original source code, complex binary rewriting, or only being supported by a single vendor.</p>
<p><a rel="noopener" target="_blank" href="https://webassembly.org/">WebAssembly</a> (Wasm) is ideally positioned to provide this crucial primitive and support such applications, since Wasm promises both safety <em>and</em> performance, while serving as a popular compiler target for many high-level languages. As a virtual architecture designed with sandboxing in mind, it has clean, succinct, and well-defined semantics, allowing for safe execution of high-performance code on the Web. However, this same design can also benefit non-Web applications, since the Wasm standard explicitly separates the core Wasm language from the specific API provided to each Wasm module by the runtime or other modules. For example, instead of offering a Web-oriented API, (say) for manipulating the DOM, many runtimes offer the WebAssembly System Interface (WASI) API to run Wasm beyond the Web. All of this has made Wasm an attractive compilation target, and compilers for most popular languages, such as C, C++, Rust, Java, Go, C#, PHP, Python, TypeScript, Zig, and Kotlin, now support it as a target. Thus, a single compiler <em>from</em> Wasm to executable code is sufficient to immediately support sandboxed code execution for all such languages. This makes Wasm an attractive narrow waist to provide high-performance lightweight sandboxing.</p>
<p>However, Wasm’s safety guarantees are only as strong as the implementation that enforces them. While Wasm might seem to immediately provide sandboxing, note that the actual implementation of the compiler from Wasm is a critical part of the trusted computing base (TCB) for the guarantee of sandboxing. In particular, any bug in the compiler could threaten the sandboxing protections, and indeed such bugs have been found in existing runtimes, and would lead to arbitrary code execution by an adversary. For example, using carefully crafted Wasm modules, an attacker could achieve:</p>
<ul>
<li>a memory-out-of-bounds read in Safari/WebKit using a logic bug (CVE-2018-4222),</li>
<li>memory corruption in Chrome/V8 using an integer overflow bug (CVE-2018-6092),</li>
<li>an arbitrary memory read in Chrome/V8 using a parsing bug (CVE-2017-5088),</li>
<li>arbitrary code execution in Safari/WebKit using an integer overflow bug (CVE-2021-30734),</li>
<li>a sandbox escape in both Lucet <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[6]</a> and Wasmtime <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[7]</a> using an optimization bug (CVE-2021-32629),</li>
<li>a memory-out-of-bounds read/write in Wasmtime (CVE-2023-26489),</li>
<li>and many others.</li>
</ul>
<p>A plausible explanation for such disastrous sandbox-compromising bugs—even in code designed with sandboxing as an explicit focus, is that the correct (let alone secure) implementation of high-performance compilers is difficult and remains an active area of research, despite decades of work.</p>
<p><span style="color:rgba(65,120,150,1);font-size:1.3rem;margin:0.5em 1em 0.5em 1em;display:block;">Upon reviewing the design space for executing Wasm code, we identified a crucial gap: Wasm implementations that provide <em>both</em> strong security and high performance. In our work, we thus propose, explore, and implement two distinct techniques, with varying performance and development complexity, which guarantee safe sandboxing using provably-safe compilers.</span> The first draws on traditional formal methods to produce mathematical, machine-checked proofs of safety. The second carefully embeds Wasm semantics in safe Rust code such that the Rust compiler can emit safe executable code with good performance. We describe each of these techniques in the upcoming sections, but additionally refer the interested reader to our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> for further details.</p>
<h2 id="vwasm-a-formally-verified-sandboxing-compiler">vWasm: A Formally Verified Sandboxing Compiler</h2>
<p>The first of our techniques, implemented as an open-source compiler, vWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[2]</a>, achieves provably-safe sandboxing via formal verification. Formal verification of software consists of writing a formal (mathematical) statement of the property we wish to prove about the software, and then writing a formal proof that shows that this statement is true for our software. The proof is machine-checked and thus provides the highest degree of assurance in its correctness. In contrast to techniques such as software testing, fuzzing, and manual reviews, formal verification is able to reason about all execution paths, thereby accounting for any possible input. This means that behaviors like buffer overflows, use-after-frees, etc. are completely ruled out. We describe vWasm’s top-level property, as well as our proof strategy, shortly.</p>
<p>Our choice of verification tool, F* <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[4]</a>, is a general-purpose functional programming language with effects, built for formal verification. Syntactically, it is closest to languages from the ML family (such as OCaml, F#, or SML). It has the full expressive power of dependent types, and has proof automation backed by Z3, an SMT solver. Code written in F* can be extracted to multiple languages, and for vWasm, we use F*’s OCaml extraction. Proofs are written within vWasm as a combination of pre-/post-conditions, extrinsic lemmas, intrinsic dependently-typed values, and layered effects.</p>
<p>vWasm is implemented as a compiler from Wasm to x86-64 (abbreviated as x64 henceforth), but it is designed to keep most of its code and proofs generic with respect to the target architecture. Here, we describe the process of compiling to x64, but the techniques generalize in a straightforward way to other architectures such as ARM. In compiling from Wasm to x64, there are three important conceptual stages: (i) a frontend which compiles Wasm to an architecture-parametric intermediate representation (IR), (ii) a sandboxing pass which acts upon the architecture-parametric IR, and (iii) a printer which outputs the x64 assembly code.</p>
<p>The frontend for the compiler is both untrusted and unverified. This means that one neither needs to trust its correctness for the overall theorem statement to be true, nor does one need to write proofs about it. Note that this is in stark contrast with traditional compiler verification, such as with CompCert <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[5]</a>, where any stage of the compilation must either be trusted or verified. This means that we are free to use any compiler technology for the compiler’s frontend, including arbitrarily complicated optimizations, as long as it outputs code within our architecture-parametric IR. Since compiler optimization is orthogonal to our primary goal, for vWasm’s frontend, we implemented only a simple register allocator and a basic peep-hole optimizer. We leave other optimizations for future work.</p>
<p>On the other end of the compilation pipeline is the x64 assembly printer, which is trusted to be correct. This means it is included in vWasm’s overall TCB, but we note that the printer is largely a straightforward one-to-one translation of our IR to strings, making it fairly simple to audit.</p>
<p>Finally, the sandboxing pass, which lies between the above two, is untrusted but verified to be correct. We define this formally below, but informally, this means that the sandboxing code has been proven (and the proof mechanically checked) to produce safely sandboxed code, given any input. Within the sandboxing pass, all accesses (reads or writes) into the Wasm module’s linear memory, indirect function call table, imports, globals, etc. are proven (sometimes after suitable transformations) to be safe. To prove sandbox safety, we additionally prove that the sandboxing pass also guarantees (a restricted form of) Control-Flow Integrity (CFI) that ensures that any checks performed for sandboxing cannot be bypassed, and thus must be obeyed.</p>
<p>Formally reasoning about the safety of sandboxing requires first defining a machine model, and then defining what sandbox safety is in that model. Our machine model covers the subset of x64 targeted by the compiler. A simplified version of this model can be found in our paper, while the complete model can be found in our open-sourced code. We define the semantics for x64 as small-step semantics, allowing for reasoning about even potentially infinitely running code. Within this machine model, the program state contains an <code>ok</code> field, which is set to the value <code>AllOk</code> if and only if, until that point in execution, nothing invalid has occurred. Crucially, this also means that no accesses outside the memory allocated to the module have occurred. Sandboxing is safe if and only if, informally, starting from any initial <code>AllOk</code> state, executing the sandboxed code for any number of steps leads to an <code>AllOk</code> state.</p>
<p>Written more formally in F*, but still slightly simplified for easier reading:</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>val sandbox_compile
</span><span> (a:aux) (c:code) (s:erased state): Err code
</span><span> (requires (
</span><span> (s.ok = AllOk) /\
</span><span> (reasonable_size a.sandbox_size s.mem) /\
</span><span> (s.ip `in_code` c) /\ ...))
</span><span> (ensures (fun c' ->
</span><span> forall n. (eval_steps n c' s).ok = AllOk))
</span></code></pre>
<br>
<p>This statement, written as pre- and post-conditions for the sandboxing pass <code>sandbox_compile</code>, shows that any code (<code>c'</code>) output by the sandboxer is formally guaranteed via the machine-checked proof to be safe. The pass takes two arguments <code>a</code> (auxiliary data) and <code>c</code> (the input program), and a computationally-irrelevant argument <code>s</code> (the initial state of the program, which is used for reasoning in our proofs, but that is erased when running the compiler), and returns output code <code>c'</code> under the custom effect <code>Err</code> (which allows the compiler to quit early upon error, for example if it finds a call to a non-existent function). The statement guarantees that as long as the pre-conditions in the requires clause are satisfied, the post-condition in the ensures clause provably holds on the produced output code. The pre-conditions say that the initial state must be safe, have a reasonable sandbox size, and start from a valid location in the code; if these conditions are met, the output code <code>c'</code> will be safe when executed for any number of steps <code>n</code>.</p>
<p>The proofs for this theorem span approximately 3,500 lines of F* code, not including the machine model or any of the supporting framework we built to write this proof. In total, vWasm consists of approximately 15,000 lines of F* code and proofs, and required approximately two person-years of development effort.</p>
<h2 id="rwasm-high-performance-informally-proven-safe-sandboxing">rWasm: High-Performance Informally-Proven-Safe Sandboxing</h2>
<p>Our second technique, implemented as an open-source compiler, rWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[3]</a>, achieves provably-safe sandboxing via a careful embedding of Wasm semantics into safe Rust, such that the Rust compiler can then emit high-performance, safe machine code. This approach provides multiple benefits, such as portability across architectures, performance that is competitive with other unsafe compilers, and the ability to introduce runtime extensions (such as inline reference monitors—IRMs) that can be optimized in-tandem with the executed code.</p>
<p>Our insight for this approach is that the specific property of safe sandboxing is heavily intertwined with memory safety. In particular, code written in a memory-safe language cannot escape the confines of the memory provided to it. Informally, this means that by lifting (potentially unsafe) code to a memory-safe language, and then compiling that lifted code to machine code, the generated machine code must be safely sandboxed, due to the memory safety of the intermediate memory-safe language.</p>
<p>While other memory-safe languages would also suffice to obtain safe sandboxing, we pick Rust as our memory-safe language of choice for rWasm, since it is a non-garbage-collected systems-oriented language, which allows us to obtain predictable performance. While Rust <em>does</em> have a non-memory-safe escape hatch via the <code>unsafe</code> keyword (since certain scenarios, such as writing an operating system, might need more control than directly allowed by the language), as long as this keyword is not used (ensured by the declaration <code>#![forbid(unsafe)]</code>), Rust guarantees memory safety. Given the prevalence of Rust in industry, and how seriously the Rust team takes unsoundness bugs, safe Rust is thus battle-tested to be memory safe, even if not (yet) proven to be so. Early efforts towards formalization of Rust and its security guarantees have already begun, such as with the RustBelt and Oxide projects.</p>
<p>We implement all stages of rWasm in safe Rust, but note that none of it needs to be trusted or verified. This means we do not need to depend upon the safety or correctness of any part of rWasm for the safety of the produced executable machine code. Instead, the safety of the produced code simply comes from the lack of any <code>unsafe</code> in the generated Rust code (and that unsafe-free Rust guarantees memory safety, as mentioned before). Contrast this with say, wasm2c, which requires either trusting (in addition to the C compiler itself) the wasm2c compiler, or its generated C code, since C does not guarantee memory safety.</p>
<p>Astute readers will note that sandbox safety in any type-safe language also depends on the language’s runtime libraries. Fortunately, rWasm imports nothing, uses only allocation-related features (for <code>Vec</code>), and even eliminates dependency on the Rust standard library via the <code>#![no_std]</code> directive. As with any sandbox, care is required when exposing an API to sandboxed code (e.g., to avoid APIs enabling sandbox bypasses directly or via confused deputies), but such concerns are orthogonal to sandbox construction.</p>
<h2 id="evaluation">Evaluation</h2>
<p>How do vWasm and rWasm perform in practice? We measure both techniques on a collection of quantitative and qualitative metrics, and while more details can be found in our full paper, we show some selected results here.</p>
<figure><a name="fig2"></a><br>
<p><img src="./execution-time.svg" alt="A graph, plotting normalized slowdown (on a log scale) on the y-axis against the Wasm runtimes on the x-axis. A summary of the graph is in the upcoming text." /></p>
<figcaption>Figure 2: Mean execution time of PolyBench-C benchmarks across the Wasm runtimes, normalized to pure native execution. Interpreters have square brackets; just-in-time (JIT) compilers have braces; the rest are ahead-of-time (AOT) compilers. vWasm* disables sandboxing.</figcaption>
<p><br></figure></p>
<p>Run-time performance is critical for practical adoption in most applications. Hence, we benchmark our compilers and various baselines using the PolyBench-C benchmark suite, which consists of thirty programs and has been a standard benchmark suite for Wasm since its inception. <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#fig2">Figure 2</a> summarizes our results, showing the normalized execution time of the benchmarks on the Wasm runtimes. Each point in the chart is the ratio of the mean time taken to execute the benchmark with the particular runtime vs. the mean time taken to execute by compiling the C code directly to non-sandboxed x64, skipping Wasm entirely.</p>
<p>The results indicate that, unsurprisingly, compiled code strictly outperforms interpreted code for run-time performance. <span style="color:rgba(65,120,150,1);font-size:1.3rem;margin:0.5em 1em 0.5em 1em;display:block;">With respect to our compilers, we see that vWasm consistently outperforms the interpreters on all benchmarks, and that rWasm is competitive even with the compilers which are optimized for speed, and not necessarily safety.</span> We note that the relative performance amongst the compilers can vary drastically based upon the workload (for example, on some of the longer-running programs in the benchmark suite, rWasm is more than twice as fast as WAVM <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[8]</a>, which itself is twice as fast as rWasm on other benchmarks). Looking at vWasm and vWasm* (which is vWasm but with the sandboxing pass disabled), we find that the run time is marginally affected (by only 0.2%), indicating that almost all of the slowdown for vWasm, compared to other compilers, is due to the unverified portion of the compiler, which can be improved without needing to write any new proofs or even impacting existing proofs.</p>
<p>Next, we quantify the development effort needed to implement both vWasm and rWasm. The former took approximately two person-years to develop, including both code and proofs, while the latter took one person-month. This stark contrast is a testament to the daunting amount of work formal verification requires, even with modern, automated tools like F*. It also illustrates the significant benefit of rWasm’s carefully leveraging Rust’s investment in safety.</p>
<p>Finally, provable safety is an important property of a verified sandboxing compiler, but one might wish to prove other properties, such as traditional compiler correctness. Here, vWasm has the upper hand, as this is feasible to do in F*, and we have even structured the compiler to make such proofs possible. In contrast, proving correctness for rWasm would be a challenging task, since one would need to formally model the Rust language, show that rWasm preserves Wasm semantics in compiling to Rust, and then implement a semantics-preserving Rust compiler (or prove <code>rustc</code> as semantics-preserving). The nature of the provable sandboxing property is what puts it into the sweet spot where we obtain it “for free” when compiling to Rust, and we believe there may be other such properties where one can obtain provable guarantees in a similar fashion. However, all these properties are a strict subset of what might be proven for an implementation like vWasm, which is built in a full-blown verification-oriented language.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this work, we have explored two concrete points in the design space for implementing a sandboxing execution environment, with a focus on WebAssembly. We proposed designs for these two points, implemented them as open-source tools, vWasm and rWasm, and evaluated them on a collection of both quantitative and qualitative metrics. We show that run-time performance and provable safety are not in conflict, and indeed rWasm is the first Wasm runtime that is both provably-sandboxed and fast.</p>
<p>We refer the interested reader to our paper <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[1]</a> and to our open-source tools vWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[2]</a> and rWasm <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/provably-safe-sandboxing-wasm/#references">[3]</a>.</p>
<hr />
<p>A version of this blogpost was previously posted as an <a rel="noopener" target="_blank" href="https://www.usenix.org/publications/loginonline/provably-safe-multilingual-software-sandboxing-using-webassembly">article in USENIX ;login:</a>.</p>
<hr />
<p><a name="references"></a>
<small>
[1] Provably-Safe Multilingual Software Sandboxing using WebAssembly. Jay Bosamiya, Wen Shih Lim, and Bryan Parno. In Proceedings of the USENIX Security Symposium, August, 2022. Distinguished Paper Award <em>and</em> Internet Defense Prize. <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/usenixsecurity22/presentation/bosamiya">https://www.usenix.org/conference/usenixsecurity22/presentation/bosamiya</a><br>
[2] vWasm: A formally-verified provably-safe sandboxing Wasm-to-native compiler. <a rel="noopener" target="_blank" href="https://github.com/secure-foundations/vWasm/">https://github.com/secure-foundations/vWasm/</a><br>
[3] rWasm: A cross-platform high-performance provably-safe sandboxing Wasm-to-native compiler. <a rel="noopener" target="_blank" href="https://github.com/secure-foundations/rWasm/">https://github.com/secure-foundations/rWasm/</a><br>
[4] F*: A Proof-Oriented Programming Language. <a rel="noopener" target="_blank" href="https://fstar-lang.org/">https://fstar-lang.org/</a><br>
[5] Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. CompCert - a formally verified optimizing compiler. In Embedded Real Time Software and Systems (ERTS). SEE, 2016.<br>
[6] Announcing Lucet: Fastly’s native WebAssembly compiler and runtime. <a rel="noopener" target="_blank" href="https://www.fastly.com/blog/announcing-lucet-fastly-native-webassembly-compiler-runtime">https://www.fastly.com/blog/announcing-lucet-fastly-native-webassembly-compiler-runtime</a>, March 2019.<br>
[7] Wasmtime: A small and efficient runtime for WebAssembly & WASI. <a rel="noopener" target="_blank" href="https://wasmtime.dev/">https://wasmtime.dev/</a><br>
[8] WAVM: WebAssembly virtual machine. <a rel="noopener" target="_blank" href="https://wavm.github.io/">https://wavm.github.io/</a><br>
</small></p>
Code Conversion in Distributed Storage Systems2023-07-19T00:00:00+00:002023-07-19T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/convertible-codes/<p><a name="fig-intro"></a>
<img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/conversion_intro.png" alt="Code conversion" />
<em>Figure 1: diagram showing the code conversion process in a distributed storage system.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Today’s society is data-driven, and many of the applications that society relies on require storing ever-increasing amounts of data.
To this end, distributed storage systems have become the foundation of the data infrastructure, for example cloud storage systems.
These large-scale systems are typically run on massive clusters which have thousands to millions of disks, and store amounts of data on the scale of exabytes (\(10^{18}\) bytes).
At this large scale, failures become a common ocurrence.
Given the fundamental role that distributed storage systems play in supporting other applications, they must guarantee high levels of reliability despite these failures.
One common way to ensure reliability is through replication.
However, duplicating (or triplicating) the amount space used with replication is prohibitively expensive.
Instead, most current large-scale storage systems primarily employ <em>erasure codes</em>.
An erasure code encodes data in a way that makes it resilient against failures with lower overhead than replication.</p>
<p>The level of fault tolerance and the storage space consumed by an erasure code are determined by its parameters.
For example, the popular Reed-Solomon codes (and other traditional maximum-distance separable codes) have two main parameters: code length (\(n\)) and dimension (\(k\)).
These parameters are typically set based on the failure rate of storage devices, the required degree of reliability, and some additional requirements on system performance and storage overhead.</p>
<p>In practice, there are multiple reasons which necessitate changing the parameters of an erasure code for <em>already-encoded data</em>.
The process of transforming the data from the old encoding to the new encoding is known as <em>code conversion</em> (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#fig-intro">figure 1</a>).
One of the reasons for doing code conversions is <em>disk-adaptive redundancy</em> (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2019cluster">Kadekodi et al. 2019</a>):
it has been shown that the failure rate of disks can vary drastically across make/models and over time (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#schroeder2007disk">Schroeder and Gibson, 2007</a>), and that significant savings in storage space (and hence operating costs) can be achieved by tuning the code parameters to the observed failure rates.
For example, on production cluster traces from Google, disk-adaptive redundancy can lead to up to 25% space savings (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2020pacemaker">Kadekodi et al. 2020</a>).
Due to the large scale, this translates to savings of millions of dollars and significant reductions in the carbon footprint.
Another reason for code conversion is the changes in the popularity of data.
More popular data is typically encoded with higher redundancy (to support faster reads) and less popular data is encoded with less redundancy (for storage efficiency).
Thus, as popularity of the data changes, it is beneficial to change the parameters of the code used to encode the data (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#xia2015tale">Xia et al. 2015</a>).
In this case, data needs to be redistributed to make use of the new disks I/O throughput.</p>
<p>The default approach to code conversion is to read all data, re-encode it, and write it back.
This approach requires a large amount of disk I/O access and bandwidth which, due to inherent physical limitations of hard disk drives, are very scarce resources.
The following figure (from <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#kadekodi2020pacemaker">Kadekodi et al. 2020</a>) shows the fraction of the total cluster I/O used by code conversion in a simulation using real-world cluster traces.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/heart_mono_front.png" alt="Conversion IO in simulated cluster" />
<em>Figure 2: Trace-based simulation of a cluster using the default approach to code conversion. X-axis represents calendar date, left Y-axis represents the total fraction of the IO used by conversion, right Y-axis shows the size of the cluster in terms of the number of disks.</em></p>
<p>Observe that the total IO used by code conversion can use up to 100% of the cluster IO during significant periods of time.
Therefore, the default approach to code conversion can easily overwhelm a storage system, interfering with other important (foreground) tasks such as serving user requests.
In this post, we summarize our work on <strong>convertible codes</strong>, which are erasure codes that can be converted more efficiently than the default approach.
So far the information theory community has extensively studied various aspects of storage codes such as rate, update cost, repair bandwidth, and repair locality.
The conversion problem opens up a new dimension to optimize for when designing storage codes.
There are several open problems in this new design space, with a high potential for real-world impact.</p>
<p>We start by providing some <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#background">background</a> about storage systems, erasure codes, and the way erasure codes are used in storage systems.
Then, we introduce and formally define <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#conversion">code conversion and convertible codes</a>.
Afterwards, we provide a <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#access-opt">summary of our results</a> and showcase some examples that show how convertible codes can reduce conversion cost.
Finally, we conclude with some <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#conclusion">open problems</a>.</p>
<h2 id="background">Background on storage systems</h2>
<p>Many modern applications require storing large amounts of data; amounts which far exceed the capacity of a single disk drive.
In such cases, data needs to be distributed across many disks, attached to many different machines.
One immediate problem that emerges in this scenario is that, as the number of components in the system increases, the probability that at least one component fails becomes very high.
Because distributed systems need to keep running despite of these failures, reliability is an essential part of their design.</p>
<p>The simplest way of adding reliability to a distributed storage system is to use <em>replication</em>: each piece of data has multiple copies each stored in a different disk, so that if any disk fails, no data is permanently lost.
However, replication significantly increases the total amount of space used by the system, which makes it very expensive to use in large-scale systems.
For example, three-replication (which tolerates up to two failures) is used in some systems, and it uses 200% additional storage.
Storage cost is normally measured as the ratio of the total space used to the size of the original data, and is called <em>storage overhead</em>.
So three-replication incurs a storage overhead of 3.</p>
<p>Given the high cost of replication, nowadays most storage systems use <em>erasure coding</em> instead, which can offer the same (or even higher) reliability guarantees while using much less storage space.
For example, an \([5, 3]\) MDS code (explained in detail in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#erasure-codes">Background on erasure codes</a>) can tolerate up to two failures (same as three-replication) and has a storage overhead of \(\frac{5}{3} = 1.66\), i.e., only 66.6% extra storage.</p>
<p>Storage overhead is one of the main cost metrics for distributed storage systems.
This is because the top costs of running a system come from the cost of buying all the necessary hardware, and operating it: i.e. providing infrastructure, power, cooling, networking, etc.
Such is the scale of these systems, that even a single-digit percentage reduction in storage overhead is significant.</p>
<p>There are many other costs and metrics apart from storage overhead.
Among them, disk I/O resources come first, because it is important for the performance of the system.
HDDs offer relatively low I/O bandwidth compared to the total amount of storage space, so this is often the bottleneck of the system’s throughput.
Due to the mechanical overheads involved in moving the read head to the right place within a disk, the number of I/O operations (called accesses) is also an important metric.
Similarly, the amount of network I/O operations, CPU, and memory usage are also important.</p>
<h3 id="design">Distributed storage system design</h3>
<p>One of the most well-known types of distributed storage systems is <abbr title="Redundant Array of Inexpensive Disks">RAID</abbr>.
A RAID system typically consists of an array of \(n\) disks with same capacity attached to a single machine.
Data is encoded with an \([n, k]\) MDS (maximum distance separable) code and for each codeword, each of the \(n\) codeword symbols is placed on a different disk.</p>
<p>Modern distributed storage systems require to scale past a single machine, and thus have a different design from RAID.
An example of such a system is <a rel="noopener" target="_blank" href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a>, which also supports erasure coding.
These systems manage a large number of nodes and disks: sometimes thousands or tens of thousands of them.
As in RAID, data is encoded with an \([n, k]\) MDS code, and each codeword symbol is placed on a different disk, but \(n\) is much smaller than the number of disks (typically \(n \leq 50\)).
The disks where a codeword is placed are chosen by a semi-random placement algorithm, which tries to avoid choosing disks that might fail at the same time (e.g., by choosing disks in different racks).</p>
<h3 id="erasure-codes">Background on erasure codes</h3>
<p>While many types of erasure codes exists, in this post we will focus specifically on <em>linear</em> erasure codes with the <em>MDS</em> property, which we explain in the following.
An \([n, k]\) MDS (maximum-distance separable) erasure code takes \(k\) symbols of data, and encodes them into \(n\) code symbols with the property that any \(k\) out of the \(n\) code symbols can recover the original \(k\) data symbols.
Symbols are elements from a specific finite field (denoted \(\mathbb{F}\)).
Many practical applications use the finite field \(\mathrm{GF}(2^8)\), where each symbol is represented as a single byte.
Let \(r \coloneqq n - k\), and let \([i] = \{1,2,\ldots,i\}\).
Mathematically, an \([n, k]\) erasure code over \(\mathbb{F}\) is a function \(\mathcal{C}: \mathbb{F}^{k} \to \mathbb{F}^{n}\).
Elements in the image of \(\mathcal{C}\) are called <em>codewords</em>.</p>
<p>A linear code can be described by the mapping \(\mathbf{m} \to \mathbf{m} \mathbf{G}\), where \(\mathbf{G}\) is the \(k \times n\) <em>generator matrix</em>.
A code is MDS iff for any codeword it is possible to recover the original data after erasing any \(r\) arbitrary symbols.
For linear codes, this is equivalent to the property that the \(k \times k\) matrix formed by the columns corresponding to any \(k\) code symbols is invertible.
In practice, <em>systematic</em> codes are often used, which permit reading data without decoding.
A linear code is said to be <em>systematic</em> if its generator matrix contains a \(k \times k\) identity matrix as a submatrix; for such codes, we refer to the \(k \times r\) submatrix defined by the remaining columns as the <em>parity matrix</em> \(\mathbf{P}\).</p>
<h2 id="conversion">The code conversion problem</h2>
<p>The problem of changing data encoded under an initial code \(\mathcal{C}^I\) to its corresponding encoding under a final code \(\mathcal{C}^F\) is called <em>code conversion</em>.
In this section, we describe <em>convertible codes</em>, which are capable of efficiently changing the erasure code parameters from \([n^I, k^I]\) to \([n^F, k^F]\).
Let us start by showing an example of convertible codes in action.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/convertible_codes_example.png" alt="Code conversion" />
<em>Figure 2: Example of code conversion from a \([7,4]\) MDS code to a \([11,8]\) MDS code.</em></p>
<blockquote>
<p><a name="ex-merge"></a>
<strong>Example 1.</strong> Consider conversion from \([n^I = 7, k^I = 4]\) to \([n^F = 11, k^F = 8]\).
In this example, both codes are systematic (i.e. the data symbols are contained in the codewords), and each box represents a symbol.
Two codewords from the \([7,4]\) code are combined to obtain a single codeword from the \([11,8]\) code.
The first observation we make is that since both codes are systematic, we can simply keep the data symbols where they are (i.e., unchanged) through the conversion (dashed arrows).
Thus, in this case, we only need to define how the new parities are computed.</p>
<p>The default approach to conversion would be to read all of the data symbols \((a_1,\ldots,a_8)\), and use those to compute the new parities \(q_1, q_2\).
However, it is possible to reduce that cost.
Let the field \(\mathbb{F}\) be the integers modulo 17.
When we define the code, we want to ensure two properties: (1) the initial and final codes are MDS, and (2) the new parities can be computed efficiently.
To ensure this, we build the code from a <em>Vandermonde matrix</em>.
In a Vandermonde matrix, each column is determined by an <em>evaluation point</em> (an element from the field), and row \(i\) corresponds to the evaluation point raised to the power of \(i - 1\).
We can carefully choose the evaluation points to ensure the MDS property holds (it does not suffice to just choose distinct point).
Choosing evaluation points \((\theta_1 = 1, \theta_2 = 2, \theta_3 = 6)\) we have:
\[
\mathbf{P}^F =
\begin{bmatrix}
1 & 1 & 1 \\
\theta_1 & \theta_2 & \theta_3 \\
\theta_1^2 & \theta_2^2 & \theta_3^2 \\
\theta_1^3 & \theta_2^3 & \theta_3^3 \\
\theta_1^4 & \theta_2^4 & \theta_3^4 \\
\theta_1^5 & \theta_2^5 & \theta_3^5 \\
\theta_1^6 & \theta_2^6 & \theta_3^6 \\
\theta_1^7 & \theta_2^7 & \theta_3^7 \\
\end{bmatrix} =
\begin{bmatrix}
1 & 1 & 1 \\
1 & 2 & 6 \\
1 & 4 & 2 \\
1 & 8 & 12 \\
1 & 16 & 4 \\
1 & 15 & 7 \\
1 & 13 & 8 \\
1 & 9 & 14 \\
\end{bmatrix}.
\]
Let \(\mathbf{P}^I\) denote matrix defined by the first 4 rows of \(\mathbf{P}^F\).
The parities for the initial code are computed using \(\mathbf{P}^I\), and the parities of the final code are computed using \(\mathbf{P}^F\), i.e., the \((p,p^{\prime},q)\) elements in Figure 2 are defined as:
\[
(p_1,p_2,p_3) = (a_1,\ldots,a_4) \mathbf{P}^I \\
(p^{\prime}_1,p^{\prime}_2,p^{\prime}_3) = (a_5,\ldots,a_8) \mathbf{P}^I \\
(q_1,q_2,q_3) = (a_1,\ldots,a_8) \mathbf{P}^F.
\]
It is straightforward (although tedious) to check that the codes defined with these matrices have the MDS property.
During conversion, instead of reading all data we can compute simply compute the new parities from the old ones by scaling them by the appropriate powers of the chosen evaluation points:
\[
(a_1,\ldots,a_4)\mathbf{P}^I +
(a_5,\ldots,a_8)\mathbf{P}^I
\begin{bmatrix}
\theta_1^4 & 0 & 0 \\
0 & \theta_2^4 & 0 \\
0 & 0 & \theta_3^4 \\
\end{bmatrix} = (a_1,\ldots,a_8)\mathbf{P}^F.
\]
Notice that this is possible due to the Vandermonde structure of the matrices, which allows us to turn \(\mathbf{P}^I\) into the bottom half of \(\mathbf{P}^F\) by scaling each column.
This allows us to compute the final parities by using the existing initial parities, without the need to read the data.</p>
<p>By doing this, we can achieve code conversion by reading (and transferring) just 6 symbols in total.
In comparison, the default approach of read-reencode-write would require reading (and transferring) 8 symbols (i.e., all the original data symbols).</p>
</blockquote>
<h3 id="framework">The convertible codes framework</h3>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/framework_example.png" alt="Diagram of code conversion" />
<em>Figure 3: Abstract depiction of a conversion from an \([n^I,k^I]\) MDS code to a \([n^F,k^F]\) MDS code.</em>
<em>Each box represents a symbol, and the boxes are grouped into codewords.</em>
<em>The top row represents initial codewords and the bottom row represents final codewords.</em>
<em>Some symbols are kept unchanged, and reused in the final codewords (denoted with dashed arrows).</em>
<em>The converter (the node labeled “c”), reads data from some symbols in the initial codewords, and computes the new symbols in the final codewords (denoted with solid arrows).</em></p>
<p>Convertible codes focus on conversions where \(\mathcal{C}^I\) is an \([n^I,k^I]\) code and \(\mathcal{C}^F\) is an \([n^F,k^F]\) code.
In this post, we focus on the case where \(\mathcal{C}^I\) and \(\mathcal{C}^F\) are linear and MDS.
To achieve the change in code dimension from \(k^I\) to \(k^F\) the conversion procedure needs to consider multiple codewords at a time.
Let \(\lambda^I\) be the number of codewords of \(\mathcal{C}^I\) taken as input, and let \(\lambda^F\) be the number of codewords of \(\mathcal{C}^F\) produced as output.
To preserve the amount of data, we must have \(\lambda^I k^I = \lambda^F k^F\).
In particular, we define \(\lambda^I\) and \(\lambda^F\) as the smallest possible integers that satisfy the above equation, i.e.:
\[
\lambda^I \coloneqq \frac{\mathrm{lcm}(k^I,k^F)}{k^I}
\text{ and }
\lambda^F \coloneqq \frac{\mathrm{lcm}(k^I,k^F)}{k^F}.
\]
For example, in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">Example 1</a> we have \(k^I = 4\) and \(k^F = 8\), which means that we consider a total of \(\mathrm{lcm}(k^I,k^F) = 8\) data symbols in total, which at the beginning form \(\lambda^I = 2\) codewords, and at the end form \(\lambda^F = 1\) codeword.</p>
<p>Since multiple codewords are being converted, we also need to specify how data is distributed across different codewords.
This is specified through an <em>initial partition</em> \(\mathcal{P}^I\) and <em>final partition</em> \(\mathcal{P}^F\) of the set \([\mathrm{lcm}(k^I,k^F)]\), which indicate the \(k^I\) data symbols encoded by each initial codeword, and \(k^F\) data symbols encoded by each final codeword.
Let \(\mathbf{m} \in \mathbb{F}^{\mathrm{lcm}(k^I,k^F)}\) be the data to be stored, let \(P \subseteq [\mathrm{lcm}(k^I,k^F)]\) be a subset of indexes, and let \(\mathbf{m}_{P} \in \mathbb{F}^{|P|}\) be the entries of \(\mathbf{m}\) indexed by \(P\).
Then, the set of <em>initial codewords</em> is \(\{\mathcal{C}^I(\mathbf{m}_P) \mid P \in \mathcal{P}^I\}\), and the set of <em>final codewords</em> is \(\{\mathcal{C}^F(\mathbf{m}_P) \mid P \in \mathcal{P}^F\}\).
In the case of <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">Example 1</a>, the initial partition is \(\mathcal{P}^I = \{\{1,2,3,4\},\{5,6,7,8\}\}\), and the final partition is \(\mathcal{P}^F = \{\{1,2,3,4,5,6,7,8\}\}\), and thus the initial codewords are \(\{\mathcal{C}^I(a_1,\ldots,a_4), \mathcal{C}^I(a_5,\ldots,a_8)\}\) and the final codeword is \(\mathcal{C}^F(a_1,\ldots,a_8)\).</p>
<p>The <em>conversion procedure</em> takes the initial codewords as input, and output the final codewords.
Formally, a convertible code is defined as follows.</p>
<blockquote>
<p><strong>Definition (Convertible code).</strong>
A convertible code over \(\mathbb{F}\) is defined by:</p>
<ol>
<li>a pair of codes \((\mathcal{C}^I, \mathcal{C}^F)\) where \(\mathcal{C}^I\) is an \([n^I, k^I]\) code over \(\mathbb{F}\) and \(\mathcal{C}^F\) is an \([n^F, k^F]\) code over \(\mathbb{F}\);</li>
<li>a pair of partitions \(\mathcal{P}^I, \mathcal{P}^F\) of \([\mathrm{lcm}(k^I, k^F)]\) such that each subset in \(\mathcal{P}^I\) is of size \(k^I\) and each subset in \(\mathcal{P}^F\) is of size \(k^F\); and</li>
<li>a conversion procedure that on input \({\mathcal{C}^I(\mathbf{m}_P) \mid P \in \mathcal{P}^I}\) outputs \({\mathcal{C}^F(\mathbf{m}_P) \mid P \in \mathcal{P}^F}\), for any \(\mathbf{m} \in \mathbb{F}^{\mathrm{lcm}(k^I,k^F)}\).</li>
</ol>
</blockquote>
<h3 id="conversion-procedure">Conversion procedure</h3>
<p>The objective of the conversion procedure is to convert the initial codewords into the final codewords efficiently.
This is modeled with a <em>converter</em> which reads data from some symbols in the initial codewords, and computes new symbols in the final codewords.
As seen in the figure above, not all symbols in the final codewords need to be new; some symbols can be kept unchanged from the initial codewords, which incurs no cost.
Since our objective is to minimize cost, we will focus only on the so-called <em>stable</em> convertible codes, which have \(k^F\) unchanged symbols in each final codeword (which was proven in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi, 2022a</a> to be the maximum possible).</p>
<p>To decide whether a conversion procedure is efficient, we need to measure its cost.
Since each final codeword has exactly \(r^F\) new symbols, the cost of writing the new symbols is fixed, regardless of the conversion procedure.
Therefore, we will focus only the read costs of conversion.
Two types of cost have been considered in the literature.</p>
<blockquote>
<p><strong>Definition (Access cost).</strong>
The total number of symbols read by the converter.</p>
</blockquote>
<p></p>
<blockquote>
<p><strong>Definition (Conversion bandwidth).</strong>
The total size of the data read by the converter (note that the converter may read only part of a symbol).</p>
</blockquote>
<p></p>
<p>In this post, we will focus only on <em>access cost</em>.</p>
<h3 id="conversion-regimes">Conversion regimes</h3>
<p>To facilitate the study of convertible codes, two special subcases have been identified in the literature.</p>
<blockquote>
<p><strong>Definition (Merge regime).</strong>
Code conversions which merge multiple codewords into a single one, i.e., \(\lambda^I \geq 2\), \(\lambda^F = 1\), and \(k^F = \lambda^I k^F\), with arbitrary \(n^I\) and \(n^F\).</p>
</blockquote>
<p></p>
<blockquote>
<p><strong>Definition (Split regime).</strong>
Code conversions which split a single codeword into a multiple one, i.e., \(\lambda^I = 1\), \(\lambda^F \geq 2\), and \(k^I = \lambda^F k^F\), with arbitrary \(n^I\) and \(n^F\).</p>
</blockquote>
<p></p>
<p>The case where all parameters \((n^I,k^I,n^F,k^F)\) are arbitrary is referred to as the <em>general regime</em>.</p>
<p>The benefit of studying the merge regime and the split regime separately, is that in these two subcases one need not worry about defining the partitions \(\mathcal{P}^I\) and \(\mathcal{P}^F\), which simplifies the analysis.
This is because in these two subcases all data gets mapped to the same codeword (either in the initial or final configuration).
Thus, all partitions are equivalent by just relabeling the symbols.</p>
<h2 id="access-opt">Minimizing the access cost of conversion</h2>
<p>The following table shows the known lower bounds for the access cost of conversion.
In this section, we will describe the constructions that achieve each of the non-trivial bounds.</p>
<p><a name="table-access-lb"></a>
<strong>Table.</strong> <em>Summary of known lower bounds on access cost (assuming \(r^F = n^F - k^F \leq \min\{k^I, k^F\}\)).</em></p>
<table><thead><tr><th>Regime</th><th>\(r^I < r^F\)</th><th>\(r^I \geq r^F\)</th></tr></thead><tbody>
<tr><td>Merge</td><td>\( \lambda^I k^I \) <sup class="footnote-reference">(1)</sup></td><td>\( \lambda^I r^F \) <sup class="footnote-reference">(1)</sup></td></tr>
<tr><td>Split</td><td>\( \lambda^F k^F \) <sup class="footnote-reference">(2)</sup></td><td>\( (\lambda^I - 1) k^F + r^F \) <sup class="footnote-reference">(2)</sup></td></tr>
<tr><td>\( k^I = k^F \)</td><td>\( k^I \)</td><td>0</td></tr>
<tr><td>\( k^I \neq k^F\)</td><td>\(\mathrm{lcm}(k^I,k^F)\) <sup class="footnote-reference">(2)</sup></td><td>\(\lambda^I r^F + (\lambda^I \bmod \lambda^F) (k^I - \max\{k^F \bmod k^I, r^F\})\) <sup class="footnote-reference">(2)</sup></td></tr>
</tbody></table>
<p><em><sup class="footnote-reference">(1)</sup>: <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi, 2022a</a>.</em>
<em><sup class="footnote-reference">(2)</sup>: <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2020access">Maturana et al., 2020</a>.</em></p>
<h3 id="merge-regime">Merge regime</h3>
<p>Recall that, in this case, \(\lambda^I\) codewords are merged into a single one.
During conversion, all the data nodes are kept unchanged.
To meet the bound in the table above, the converter can access \(r^F\) symbols from each initial codeword.
As we saw in <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">example 1</a>, this is possible by designing the parity matrices in a way that allows the converter to compute the new parities using only the old parities.
This can be done, for example, by using a Vandermonde matrix, although Vandermonde parity matrices are not guaranteed to produce MDS codes.
However, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022convertible">Maturana & Rashmi (2022a)</a> provide a method for constructing access-optimal codes that are guaranteed to be MDS over large enough field sizes.</p>
<h3 id="split-regime">Split regime</h3>
<p>Achieving the bound in table above is simple: during conversion, the converter reads the data symbols from all initial codewords except one, along with \(r^F\) initial parity symbols.
Then, the read data symbols are used to compute the corresponding parity symbols, and to remove their interference from the read initial parities.</p>
<blockquote>
<p><strong>Example 2.</strong> <a name="ex-access-split"></a>
Consider the conversion from \([n^I = 11, k^I = 8]\) to \([n^F = 7, k^F = 4]\) over \(\mathrm{GF}(17)\).
Suppose we use the same \(\mathbf{P}^I\) and \(\mathbf{P}^F\) from <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#ex-merge">example 1</a> but swapped.
During conversion, the converter reads \((a_5,\ldots,a_8)\) and the 3 initial parities \((a_1,\ldots,a_8)\mathbf{P}^I\).
The parity symbols of the second final codeword can be computed directly from the data;
the parity symbols of the first final codeword are computed as follows:
\[
(a_1,\ldots,a_8)\mathbf{P}^I -
(a_5,\ldots,a_8)
\begin{bmatrix}
1 & 16 & 4 \\
1 & 15 & 7 \\
1 & 13 & 8 \\
1 & 9 & 14 \\
\end{bmatrix}
=
(a_1,\ldots,a_4)\mathbf{P}^F.
\]
Thus, in total 7 symbols are read, compared to the default approach of reading all 8 data symbols.</p>
</blockquote>
<h3 id="general-regime">General regime</h3>
<p>In the general regime, partitions need to specified; <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2020access">Maturana et al., 2020</a> show the optimal way of choosing them.
At a high level, the optimal partition keeps data from the same initial codeword together in the final codeword whenever possible; that way, parity symbols can be used more effectively.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/general_regime_example.png" alt="General regime example" />
<em>Figure 4: Code conversion from \([6,5]\) MDS code to \([13,12]\) MDS code.</em></p>
<blockquote>
<p><strong>Example 3.</strong>
Consider conversion from \([n^I=6, k^I=5]\), to \([n^F=13, k^F=12]\).
Thus, there are a total of \(\mathrm{lcm}(5,12)=60\) data symbols, organized into \(\lambda^I=12\) initial codewords or \(\lambda^F=5\) final codewords.
The parity matrices of the codes are designed as if the final code was \([16,15]\) (which combines 3 codewords into 1).
The conversion procedure splits two of the initial codewords into “intermediate codewords” (which are not materialized, but only used to describe the construction).
Then, two initial codewords are merged along with two data symbols from the intermediate codewords.
The split and merge are executed with the same techniques we showcased for the merge and split regime, and thus only 18 symbols need to be read (marked by a dot in the figure).
Compare this the default approach of reading all 60 data symbols.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>The code conversion problem adds a new dimension to the design of codes.
This new dimension not only opens a variety of interesting theoretical questions, but has a huge potential for real-world impact in distributed storage systems.
In this post, we only scratched the surface of the code conversion problem:
other work on code conversion has focused on minimizing conversion bandwidth instead of access cost (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2022bandwidth">Maturana & Rashmi, 2022</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2023bandwidth">Maturana & Rashmi, 2023</a>) and on codes with better repair properties (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#xia2015tale">Xia et al., 2015</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#wu2022optimal">Wu et al. 2022</a>, <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/convertible-codes/#maturana2023locally">Maturana & Rashmi, 2023</a>)
Even when considering these additional works, there are still many open questions in this nascent area of research.</p>
<h2 id="references">References</h2>
<ul>
<li><a name="kadekodi2019cluster"></a> S. Kadekodi, K. V. Rashmi, and G. R. Ganger, “Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity,”
<em>FAST 2019</em>.</li>
<li><a name="kadekodi2020pacemaker"></a> S. Kadekodi, F. Maturana, S. J. Subramanya, J. Yang, K. V. Rashmi, and G. R. Ganger, “PACEMAKER: Avoiding HeART attacks in storage clusters with disk-adaptive redundancy,”
<em>OSDI 2020</em>.</li>
<li><a name="maturana2020access"></a> F. Maturana, V. S. C. Mukka, and K. V. Rashmi, “Access-optimal linear MDS convertible codes for all parameters,” <em>ISIT 2020</em>.</li>
<li><a name="maturana2023bandwidth"></a> F. Maturana and K. V. Rashmi, “Bandwidth cost of code conversions in distributed storage: fundamental limits and optimal constructions,” <em>IEEE Transactions on Information Theory</em>, 2023.</li>
<li><a name="maturana2022convertible"></a> F. Maturana and K. V. Rashmi, “Convertible codes: enabling efficient conversion of coded data in distributed storage,” <em>IEEE Transactions on Information Theory</em>, 2022.</li>
<li><a name="maturana2022bandwidth"></a> F. Maturana and K. V. Rashmi, “Bandwidth cost of code conversions in the split regime,” <em>ISIT 2022</em>.</li>
<li><a name="maturana2023locally"></a> F. Maturana and K. V. Rashmi, “Locally repairable convertible codes: erasure codes for efficient repair and conversion,” <em>ISIT 2023</em>.</li>
<li><a name="schroeder2007disk"></a> B. Schroeder and G. A. Gibson, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?,” <em>FAST 2007</em>.</li>
<li><a name="wu2022optimal"></a> S. Wu, Z. Shen, P. P. C. Lee, and Y. Xu, “Optimal repair-scaling trade-off in locally repairable codes: analysis and evaluation,” <em>IEEE Transactions on Parallel and Distributed Systems</em>, 2022.</li>
<li><a name="xia2015tale"></a> M. Xia, M. Saxena, M. Blaum, and D. Pease, “A tale of two erasure codes in HDFS,” <em>FAST 2015</em>.</li>
</ul>
The Quantum Physicist's Method of Resource Analysis2023-06-06T00:00:00+00:002023-06-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/quantum-physicists-method/<p>The physicist’s method is a powerful framework for cost analysis that
many a computer scientist will learn at some point in their undergraduate career.
However, its high-level description leaves some practical gaps, especially concerning
how to actually bookkeep its finer details, and these details become important
when trying to build a more explicit accounting framework.
This post explains how to fill in these gaps with
the <em>quantum</em> physicist’s method, a refinement of the physicist’s method
that is robust enough for automatic program analysis, as in
my paper <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>. (Quick disclaimer: There is
no quantum computing in here, despite the name.) To do explain the new
bookkeeping devices of the quantum physicist’s method,
this post will first explain the classical physicist’s method
for algorithm analysis, then describe the difficulties it encounters when
adapted to the domain of program analysis, and finally lay out the
solution granted by bookkeeping with the quantum physicist’s method.</p>
<h1 id="the-classical-physicist-s-method">The Classical Physicist’s Method</h1>
<p>To make sense of the physicist’s method (and the later refinements we’ll make to it), it is
good to start by recalling the physical reasoning behind it. Think back to your highschool physics class
where you learned about energy. If you drop a 1 kilogram ball from
1 meter above the Earth, and drop an identical ball from the top of a 1 meter high ramp, how do their speeds compare
when they hit the ground?</p>
<p>It might seem like I haven’t given you enough information, but a neat little physical
principle called <em>conservation of energy</em> tells you all you need to know. At the start, both balls
have the same amount<sup class="footnote-reference"><a href="#grav">1</a></sup> of (gravitational) potential energy since they are the same height, same mass, and subject to the same
gravity. And at the end, both balls have none, since their distance from the ground is 0. Because the
total energy is <em>conserved</em>, we know that all that energy must still be around, just in some other form - in this case,
as kinetic energy in the balls’ speeds. So even though both
balls took different routes to the ground, the same energy goes into their speed, and thus the speeds are the same<sup class="footnote-reference"><a href="#speed">2</a></sup>.
Let me emphasize that point: <em>As long as we know energy is conserved,
we can measure expenditure with the difference between starting and ending energy</em>.</p>
<p>Eventually Robert Tarjan and Danny Sleator brought this idea to computer science.
They introduced it to define
<em>amortized</em> algorithm costs (see <a rel="noopener" target="_blank" href="https://epubs.siam.org/doi/pdf/10.1137/0606031?casa_token=cR8nppnD8MQAAAAA%3AgK8XhJzUtPvkIVXTHIe299HSRuczuwiYVM74VDBjOMpHDlLcZLIVlziYWpRQMHeuN3lz84b9kIUg&">here</a>).
However, the idea of amortization itself is much older, and comes from the financial industry.
Amortization is used to express a notion of average cost where occasional
spikes of high cost are prepaid over longer periods of time, like paying off a loan
in installments instead of all at once at the due date. However, if we think about this
prepayment as storing extra potential energy for later, the reasoning becomes exactly the
same as reasoning about conservation of energy. Hence, Tarjan and Sleator suggested
calling the approach “the physicist’s method”<sup class="footnote-reference"><a href="#personal">3</a></sup>.</p>
<p>To see how this all comes up in the analysis of algorithms, consider implementing an arbitrarily sized list using
fixed-size arrays<sup class="footnote-reference"><a href="#list">4</a></sup>. In particular, lets look at the list’s insertion function, and measure its
cost in how many array write accesses it uses. The common case is that list insertion will just be able to directly write a new
element into the next unused slot in the array, for a cost of 1 write. But eventually, the array will be full with no unusued slots.
When this happens, our implementation will:</p>
<ol>
<li>allocate a new array double the size of the old one</li>
<li>write all the elements from the old array into the new one</li>
<li>write the new element into the next empty space of the new array</li>
</ol>
<p>If you count them, you’ll find this uncommon case uses a number of array writes equal to the length of the list plus one,
which is a far cry from the common case’s constant cost. Worst-case cost analysis thus makes this algorithm look much more
inefficient than it usually is.</p>
<p>If instead we think through a lens of amortization, we find that insertion is, morally-speaking, a constant cost operation.
Essentially, insertion is cheap enough often enough that prepaying a little extra at each common case
can offset the high cost of the uncommon case. We can see how that looks in the graph below<sup class="footnote-reference"><a href="#graph">5</a></sup>, where the
black spikes of cost
never exceed the red constant-per-step payment.</p>
<p><img src="./amortizedgraph.jpeg" alt="a graph showing a constant-per-step bound over spiky costs" /></p>
<p>To show this formally, we define a suitable <em>potential</em> function \(\Phi\) giving the amount of prepaid potential energy stored in the
program state. Specifically, our desired \(\Phi(A)\) will be equal to twice the number of
filled slots past
the halfway point in the array \(A\).
We can think of this like attaching a 2-charge battery to each individual array cell past the halfway point, so that
we deal with that battery’s energy if and only if we access that array cell.
The amortized cost of an operation \(o\) is then defined as \(\Phi(o(A)) - \Phi(A) + C(A,o)\), which is the difference in potential induced by \(o\) plus \(C(A,o)\) its true cost on the array \(A\).
If we account for this potential energy alongside our normal costs, suddenly the cost profile becomes much smoother:</p>
<ul>
<li>
<p>In the common case, insertion just writes into the next unused slot of \(A\). We still pay the true cost of
\(C(A,\mathsf{insert}) = 1\) for the that write, but now might also need to pay 2 more to “charge the battery”
if the element is past the halfway point in the array. This “battery-charging” is how we pay for the difference in
potential as given by \(\Phi\). In the worst case, the total amortized cost is therefore 3.</p>
</li>
<li>
<p>In the uncommon case, our array \(A\) is now full. Thus, we have stored 2 units of potential with half the elements of \(A\),
which works out to one unit of potential for each element. So, potential can pay for each element’s write into the new array,
with none leftover. The new array itself then has no potential, because it is exactly half full. At this point, stored
potential has exactly covered the cost of all writes and all the state’s potential, accruing a running cost of 0. Finally
list insertion behaves like its common case again, giving worst-case amortized cost of 3.</p>
</li>
</ul>
<p>Thus, through mediating energy expenditure with potential, we find that
insertion into these array-backed lists takes an amortized constant number of writes. The magic
happened when we prepaid for two extra writes in the common case to “charge the battery”.
Eventually, that prepayment gets spent in the uncommon case to cover the writes into the
new array.</p>
<p>Now that you’ve seen an example, we can look at the general case:</p>
<blockquote>
<p>Given:</p>
<ul>
<li>a set of operations \(\mathsf{op} = \mathsf{state} \rightarrow \mathsf{state} \)</li>
<li>a true cost function \(C : \mathsf{state} \times \mathsf{op} \rightarrow \mathbb{R}\)</li>
</ul>
<p>If you can find:</p>
<ul>
<li>a potential function \(\Phi : \mathsf{state} \rightarrow \mathbb{R}_{\geq 0}\) </li>
<li>amortized cost \(a_o\) for each operation \(o\)</li>
</ul>
<p>such that \(\Phi(S) + a_{o_i} \geq \Phi(o_i(S)) + C(S, o_i)\)
for any state \(S\),</p>
<p>Then for any sequence of \(n\) operations \((o_i)\) and the sequence of states \((S_i)\) that they induce :</p>
<p>\[\sum_{i=0}^{i<n} a_{o_i} + \Phi(S_{0}) - \Phi(S_{n}) \geq \sum_{i=0}^{i<n} C(S_i, o_i)\]</p>
<p>i.e., the total amortized cost plus change in potential covers the total true cost.</p>
</blockquote>
<p></p>
<p>The condition placed on \(\Phi\) and \(a_{o_i}\) is what corresponds to conservation of energy<sup class="footnote-reference"><a href="#technically">6</a></sup>.
The potential in the state \(\Phi(S)\), and the extra energy paid \(a_{o_i}\) are sufficient to
cover the potential stored in the resulting state \(\Phi(o_i(S))\) and the energy
expenditure \(C(S, o_i)\) – no new energy is created. With that condition in place, just like in physics,
we can forget about intermediate states and just focus on the initial and ending states \(S_0\) and \(S_{n}\).
Hence the conclusion of the theorem, that the potential difference between \(\Phi(S_{0})\) and \(\Phi(S_{n})\)
plus all the total supplied extra energy can pay for the total energy expenditure.</p>
<p>In the above formalization, you might notice that the form of the potential function \(\Phi\) is left abstract.
The function <em>could</em> be any sort of complicated, non-uniform, ugly function. But it is no coincidence that
the \(\Phi\) we chose in our above example was “nice”. Specifically, this “niceness” amounts to potential being
<em>local</em> – one can think of the state \(S\) as broken up into many pieces (our array cells),
each with their own local amount of potential (our “batteries”).
Then \(\Phi\) just gives the sum of potential stored on these different pieces,
and adjusts the potential on a piece only when that piece is directly operated on.
In fact, this appears to be exactly how Tarjan intended the
bookkeeping for the physicist’s method to be conceptualized:</p>
<blockquote>
<p>In order to keep track of saved or borrowed credits [potential], it is generally convenient to
store them in the data structure. Regions of the structure containing credits are
unusually hard to access or update (the credits saved are there to pay for extra work);
regions containing “debits” are unusually easy to access or update. It is important to
realize that this is only an accounting device; the programs that actually manipulate
the data structure contain no mention of credits or debits.</p>
</blockquote>
<p>– Tarjan in <a rel="noopener" target="_blank" href="https://epubs.siam.org/doi/pdf/10.1137/0606031?casa_token=cR8nppnD8MQAAAAA%3AgK8XhJzUtPvkIVXTHIe299HSRuczuwiYVM74VDBjOMpHDlLcZLIVlziYWpRQMHeuN3lz84b9kIUg&"><em>Amortized Computational Complexity</em></a></p>
<p>This local-view of potential has been time-tested, and is basically the only form of potential
you will find in the literature. As such, our goal throughout the rest of this
post will be to keep our definition of potential as local as possible.</p>
<h1 id="building-a-program-analysis">Building a Program Analysis</h1>
<p>To build a program analysis based on the physicist’s method, we first need to
adapt the framework above. This is because some of the assumptions made
above are simply not applicable in our programmatic setting. The differences
are mostly technical, but accounting for them does lead to a slightly
different-looking theorem.</p>
<ol>
<li>
<p>The above framework assumes that operations can be executed in any order.
This makes sense when treating the collection of operations like an
interface – you don’t know what order an external user might call operations, so
your analysis needs to be prepared for anything. However this assumption
is wrong for analyzing a program (like the implementation of such an interface).
The program itself dictates specific sequences of operations, and the
analysis must take this into account to get sensible results<sup class="footnote-reference"><a href="#timesensitive">7</a></sup>.</p>
</li>
<li>
<p>The above framework assumes that extra energy \(a_o\) is
paid out on a per-operation basis.
Again, this makes sense when reasoning about an interface, since an external
user pays for each operation they call. However, when a program executes an operation,
there is no external user to introduce extra energy into the system, so costs
must be paid solely out of the energy supply internal to the program, i.e., the potential
of the state<sup class="footnote-reference"><a href="#pool">8</a></sup>.</p>
</li>
</ol>
<p>After adapting the theorem from the previous section to account for these
differences we are left with something
like the statement below. The main changes are that we consider only certain
sequences of operations, and that we drop amortized costs.</p>
<blockquote>
<p>Given:</p>
<ul>
<li>a set of operations \(\mathsf{op} = \mathsf{state} \rightarrow \mathsf{state} \)</li>
<li>a collection of possible sequences of such operations \(\mathsf{seq}\)</li>
<li>a true cost function \(C : \mathsf{state} \times \mathsf{op} \rightarrow \mathbb{R}\)</li>
</ul>
<p>If you can find:</p>
<ul>
<li>a potential function \(\Phi : \mathsf{state} \rightarrow \mathbb{R}_{\geq 0}\) </li>
</ul>
<p>such that \(\Phi(S_i) \geq \Phi(S_{i+1}) + C(S_i, o_i)\)
across all state sequences induced by \(\mathsf{seq}\)
from any initial state \(S_0\)</p>
<p>Then for any sequence of \(n\) operations \((o_i)\) prefixing \(\mathsf{seq}\)
and the sequence of states \((S_i)\) that they induce:</p>
<p>\[\Phi(S_{0}) - \Phi(S_{n}) \geq \sum_{i=0}^{i<n} C(S_i, o_i)\]</p>
<p>i.e., difference in energy bounds the total cost at every point<sup class="footnote-reference"><a href="#corollary">9</a></sup></p>
</blockquote>
<p></p>
<p>With this framework, our program analysis just needs to find a suitable \(\Phi\).
We are currently only considering a <em>local</em> definition of \(\Phi\), so our
task is really just finding way of
locally assigning potential
to the parts of each individual data structure at each point in our program.</p>
<p>There might be many ways to find such a local \(\Phi\),
but one simple option is to type the data structures. These
types can then include some annotation indicating how much potential the data structure
stores where, like “list but with 2 unit of potential per element”. This tells
you exactly how much potential each piece holds, making it easy to recover a
locally-definable \(\Phi\).</p>
<p>If you run
with this idea, you might eventually get something that looks similar to
the type system called Automatic Amortized Resource Analysis (AARA).
AARA can infer a valid \(\Phi\) through the inference of
potential-carrying types, and is fully automatable (as its name suggests).
See <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/640128.604148">here</a> for AARA’s origin
and <a rel="noopener" target="_blank" href="https://www.raml.co/">here</a> for an up-to-date implementation.</p>
<p>There are also a lot of different ways to approach this problem
apart from AARA. Some approaches are more manual
(like <a rel="noopener" target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-89884-1_19">this</a>
verification framework using separation logic). Some use potential with
other traditional techniques (like <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3408979">this</a>
adaptation of recurrence solving). And some are designed for different
programming environments (like <a rel="noopener" target="_blank" href="https://drops.dagstuhl.de/opus/volltexte/2020/12355/pdf/LIPIcs-FSCD-2020-33.pdf">this</a>
one for client-server interactions). I’m certain there are many more options still,
but the reason I bring up AARA in particular is that,
while all of these approaches <em>could</em> potentially employ the quantum phyisicist’s method in
the future, AARA is the only one that <em>has</em> (and I’m the one that did it).</p>
<h1 id="trouble-in-paradise">Trouble in Paradise</h1>
<p>This localized-potential approach happens to work rather well in many cases. For instance, AARA
can analyze sorting functions and many list manipulations without issue. Nonetheless, it is not hard to confound this approach.
Consider a simple loading function that populates one of our array-backed list from one of two other lists.
When called, the load function first executes some code (e.g. <code>shouldComeFromList1</code>) to decide which list the data should
come from, and then inserts it all one element at a time. Here we see what this might look like in pseudo-code<sup class="footnote-reference"><a href="#python">10</a></sup>.</p>
<pre data-lang="python" style="background-color:#393939;color:#dedede;" class="language-python "><code class="language-python" data-lang="python"><span style="color:#fed6af;">def </span><span style="color:#fffd87;">load</span><span>(target, list1, list2):
</span><span> </span><span style="color:#fed6af;">if </span><span>shouldComeFromList1():
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span>list1:
</span><span> target.insert(i)
</span><span> </span><span style="color:#fed6af;">else</span><span>:
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span>list2:
</span><span> target.insert(i)
</span></code></pre>
<p>If we assume that <code>shouldComeFromList1</code> has no array writes, then we only need consider the cost
of insertion. Clearly, only one list’s-worth of insertions occurs, and each insertion has an amortized cost of 3,
so \(\Phi\) need only assign 3 energy-per-element to the list selected by <code>shouldComeFromList1</code>.
However, there is in general no way to statically know which list that is – it is <em>undecidable</em>,
even if we had access to the source code for <code>shouldComeFromList1</code>.
This confounds our local method of accounting, since it must store potential in a specific list,
but cannot say which list will end up sourcing the data. We might get around this by having \(\Phi\) yield something
like \(3*\mathsf{max}(|\verb“list1“|, |\verb“list2“|)\) to cover the worst case, but this \(\mathsf{max}\) is not
expressible in a local way - at best, the local approach can overapproximate
\(\mathsf{max}\) with a sum, giving potential of \(3*(|\verb“list1“| + |\verb“list2“|)\), the cost for loading <em>both</em> lists.
And while this bound can only be loose by a constant factor of 2, other examples can loosen the bound to be exponentially worse
(like binary search <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>).</p>
<p>At this point, you might think the bound looseness
is just some weakness on <em>the analysis’s</em> end, where
presumably <em>some</em> localization of the tightest potential exists, but the analysis just can’t figure it out.
However, the situation is actually worse:
We can create an example where <em>no</em> tight localization suffices, even while nonlocal reasoning
makes a tight solution obvious<sup class="footnote-reference"><a href="#Bell">11</a></sup>.</p>
<p>This problem happens especially when measuring the cost of a resource like memory,
since memory is returned after use<sup class="footnote-reference"><a href="#neg">12</a></sup> and can be reused. When a resource returned, it is as
if additional payment is provided midway through the computation. This <em>could</em> lessen the amount
of resources needed upfront, but only if those resources aren’t needed prior to when the extra
resources are returned. In effect, the cost of resources like memory is measured with
their <em>peak</em> cost, the high water mark of the number of resources in use at one time.
These resources are therefore a bit more complicated than resources that only tick down, like
time. This makes it easy to create a situation with no tight localization of potential, like that below.</p>
<p>To see this problem in action, imagine we have a list of data, and two different
data processing procedures <code>process1</code> and <code>process2</code>. To compare the results of
these procedures, we might write the code below.
How should we account for the <em>memory</em> cost of the comparison, if each of <code>copy</code><sup class="footnote-reference"><a href="#copy">13</a></sup>,
<code>process1</code>, and <code>process2</code> temporarily
use one unit of memory per element in the list?</p>
<pre data-lang="python" style="background-color:#393939;color:#dedede;" class="language-python "><code class="language-python" data-lang="python"><span style="color:#fed6af;">def </span><span style="color:#fffd87;">copy</span><span>(list):
</span><span> ret </span><span style="color:#ececec;">= </span><span>emptyList()
</span><span> </span><span style="color:#fed6af;">for </span><span>i </span><span style="color:#fed6af;">in </span><span style="color:#fffb9d;">list</span><span>:
</span><span> ret.insert(i)
</span><span> </span><span style="color:#fed6af;">return </span><span>ret
</span><span>
</span><span style="color:#fed6af;">def </span><span style="color:#fffd87;">processBoth</span><span>(data):
</span><span> dataCopy </span><span style="color:#ececec;">= </span><span>copy(data)
</span><span> </span><span style="color:#fed6af;">return </span><span>(process1(data), process2(dataCopy))
</span></code></pre>
<p>It seems obvious from the outset that whatever memory <code>copy</code> uses can be
reused for <code>process1</code>, and that <code>process1</code>’s memory in turn can be reused for <code>process2</code>,
since all act on lists of equal length. So, we should only need to allocate \(|\verb“data“|\)
memory units. However, if that is all the memory we have,
accounting for it locally is impossible.</p>
<p>To follow the accounting, let’s step through a call to <code>processBoth</code>. We start with the only
data structure being our input <code>data</code>, so it must contain all the potential.
We proceed to copy <code>data</code> to ready it for each of the processing functions.
This copying procedure temporarily uses all the \(|\verb“data“|\) memory units,
leaving some amount stored on <code>data</code> and some amount stored on <code>dataCopy</code> when
the memory is returned.
Then <code>process1</code> is applied to <code>data</code>, requiring all of
the \(|\verb“data“|\) memory units. Now, because <code>process1</code> doesn’t touch
<code>dataCopy</code>, <code>process1</code> cannot use any of the potential in <code>dataCopy</code>
– this means <code>data</code>
needs to have recieved all the potential, and none is stored on <code>dataCopy</code>. However,
this is followed by applying <code>process2</code> to <code>dataCopy</code>, which results in mirrored accounting for
potential: all potential should have been returned to <code>dataCopy</code>, with none stored in <code>data</code>!
While we intuitively know that this could be solved by having <code>process1</code> return
potential to <code>dataCopy</code>, there is never a time where <code>process1</code> and <code>dataCopy</code>
are local to the same operation.
Thus, no local allocation of \(|\verb“data“|\) potential suffices.
Just like before, the local
approach can only manage to overapproximate this example by a factor of 2, and can
be exponentially worse in other examples.</p>
<h1 id="the-quantum-physicist-s-method">The Quantum Physicist’s Method</h1>
<p>So far, our situation is rather unfortunate. We have this beautiful framework
from algorithm analysis, but when we naively adapt it to a program analysis we
must sacrifice either the efficacy of the result or the beauty of locality.
However, there is a solution: bookkeeping using the <em>quantum</em> physicist’s method.
To keep this section intelligible to non-physicists, this section will focus on
the actual execution of the method, while any quantum
physical parallels that come up will be kept
contained in the footnotes.</p>
<p>The idea behind the quantum physicist’s method is to introduce
the accounting device of “worldviews”. Each individual worldview
\(\phi_j : \mathsf{state} \rightarrow \mathbb{R} \) is
just a normal local accounting of potential like like our previous \(\Phi\),
though with the added caveat that they are
allowed to locally assign <em>negative</em> amounts of potential under special
conditions<sup class="footnote-reference"><a href="#detail">14</a></sup>.</p>
<p>Formally, the collection of worldviews satisfies the following
properties for all state sequences induced by \(\mathsf{seq}\)
from any initial state \(S_0\)</p>
<ol>
<li>
<p>\(\forall j. \hspace{4pt} \phi_j(S_i) \geq \phi_j(S_{i+1}) + C(S_i, o_i)\),
i.e., every worldview pays out the usual costs</p>
</li>
<li>
<p>\(\exists j. \hspace{4pt} \forall T\subseteq S_i. \hspace{4pt} \phi_j(S_i) \geq 0 \),
i.e., some worldview is classically valid, wherein potential is non-negative
everywhere<sup class="footnote-reference"><a href="#whole">15</a></sup></p>
</li>
</ol>
<p>Given these properties, one can prove the following key theorem:</p>
<blockquote>
<p>Theorem: \(max_j\phi_j\) is a suitable definition
of \(\Phi\) for the classical physicist’s method</p>
</blockquote>
<p></p>
Indeed, the first property meets the bulk of the requirements for a valid
potential function, and the second property ensures that the max potential
is always classically valid.
<p>You might at this point wonder what this new way of finding a potential
function buys us. The answer is that this simple way of combining our
familiar local accounts of potential introduces some powerful <em>nonlocal</em>
flexibility. By allowing different worldviews to tactically “go into debt”,
this method can infer tighter cost bounds than naive local reasoning can usually
supply.</p>
<p>To better understand how the mechanics of these worldviews actually work,
it might help to walk through a situation without so much technical cruft:
Suppose that
Alice and Bob get $5 to share from their parents to spend on candy in a candy store. Alice wants a $3
pack of caramels and Bob wants a $2 chocolate bar. However, Alice’s caramels are in a
vending machine that only takes $5 bills. If Alice keeps $5 to herself, then Bob can’t buy his candy.
But if Bob keeps $2 to himself, then Alice can’t use the vending machine for her candy.
So, what do Alice and Bob do?</p>
<p>The answer is quite simple: Alice buys her caramels with all the money, gets the change back,
and then gives it to Bob –
they both can then get what they want with no extra money needed.
I’m sure I have done the same with my brother plenty of times growing up.
We can bookkeep this the following way:</p>
<table><thead><tr><th align="center"></th><th align="center">start</th><th>Alice buys</th><th>get change</th><th>transfer</th><th>Bob buys</th></tr></thead><tbody>
<tr><td align="center">Alice/Bob money split</td><td align="center">5/0</td><td>0/0</td><td>2/0</td><td>0/2</td><td>0/0</td></tr>
</tbody></table>
<p>And this is exactly what we want, with one small caveat:
The “transfer” operation is actually quite nontrivial to work with. Only
highly specialized programming languages will even have constructs for <em>mentioning</em>
potential, and those that do will be burdened (or burden the programmer) with
figuring out how such constructs can be soundly used. But, by using worldviews for
bookkeeping, this whole problem can be bypassed entirely. We provide such an
account below:</p>
<table><thead><tr><th align="center"></th><th align="center">start</th><th>Alice buys</th><th>get change</th><th>Bob Buys</th></tr></thead><tbody>
<tr><td align="center">worldview 1 Alice/Bob money split</td><td align="center">5/0</td><td>0/0</td><td>2/0</td><td>2/-2</td></tr>
<tr><td align="center">worldview 2 Alice/Bob money split</td><td align="center">3/2</td><td>-2/2</td><td>0/2</td><td>0/0</td></tr>
</tbody></table>
<p>With this worldview accounting<sup class="footnote-reference"><a href="#qt">16</a></sup>, we pay the exact same amount out of the same place at each step.
The only difference between the two worldviews is that worldview 1 starts in the allocation of money needed
for Alice to buy her candy, and worldview 2 starts in the allocation needed for Bob to buy his. Then,
we find that the problematic “transfer” occurs where different worldviews become classically valid –
we see that happen at “get change”, since worldview 1 is classically valid at “Alice buys”, and
worldview 2 is classically valid at “Bob buys”. This pattern will hold in general, allowing transfers
to be coded completely implicitly into our analysis.</p>
<p>Using worldviews like this, we can solve both of the problems from the previous section:</p>
<ul>
<li>
<p>To solve the first – the data loading problem – simply start with 2 worldviews: one where <code>list1</code> carries all potential,
and one where <code>list2</code> does. No matter which list pays, there will then be a classically valid worldview.</p>
</li>
<li>
<p>To solve the second – the data processing problem – start with 2 worldviews assigning <code>data</code> all the potential. Then upon copying,
let the worldviews diverge – one leaves all the potential on <code>data</code>, and one moves it all
to <code>dataCopy</code>. The former can be the classically valid one while applying <code>process1</code>, and the latter when applying <code>process2</code>.</p>
</li>
</ul>
<p>In either case the max amount of potential across the worldviews is exactly the tight amount of potential
we wanted assigned.</p>
<p>And so, with worldviews in hand, we can salvage the niceness of locality by wrapping a bunch of local accountings
together and letting them make each other more flexible. From such an accounting we can
reconstruct a potential function that satisfies the standard framework for
amortized analysis, just as our new theorem ensures. This
leaves us with a program analysis built off the physicist’s method that can give many tighter
bounds than its predecessors.</p>
<h1 id="wrap-up">Wrap Up</h1>
<p>If you are interested in seeing such an analysis in action,
I’ll point you again to my work extending AARA <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>.
My paper adds the quantum physicist’s method along with some special infrastructure called <em>remainder contexts</em>, and then
uses its new capabilities to be able to automatically reason about memory usage and tree depth. The work also
comes with an implementation, a description of how it was designed, and tables of experiments run with it on
the OCaml standard library <code>Set</code> module. The implementation never gave worse cost bounds than the local approach, and often
gave much better ones. You can check it out and see for yourself!</p>
<div class="footnote-definition" id="grav"><sup class="footnote-definition-label">1</sup>
<p>Specifically, they both have \(1\mathsf{kg} * 9.81\frac{\mathsf{m}}{\mathsf{s}^2} * 1 \mathsf{m} = 9.81 \mathsf{J}\) of energy.</p>
</div>
<div class="footnote-definition" id="speed"><sup class="footnote-definition-label">2</sup>
<p>Specifically, solving for
\(v\) in the conversion between energy and speed
\(9.81\mathsf{J} = \frac 1 2 * 1\mathsf{kg} * (v \frac{\mathsf{m}}{\mathsf{s}})^2 \) gives
\( 4.43\frac{\mathsf{m}}{\mathsf{s}}\) as the speed of the balls at ground level.</p>
</div>
<div class="footnote-definition" id="personal"><sup class="footnote-definition-label">3</sup>
<p>I personally found this analogy with physical reasoning very useful to my
understanding when I was learning about algorithm analysis in undergrad. I’m sure many students feel the same.</p>
</div>
<div class="footnote-definition" id="list"><sup class="footnote-definition-label">4</sup>
<p>This list would be the data structure called a <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Dynamic_array">dynamic array</a>.
It is the <a rel="noopener" target="_blank" href="https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html"><code>ArrayList</code> in Java</a>
and the <a rel="noopener" target="_blank" href="https://www.cplusplus.com/reference/vector/vector/"><code>vector</code> in C++</a>, and probably underlies a lot of other list implementations too.</p>
</div>
<div class="footnote-definition" id="graph"><sup class="footnote-definition-label">5</sup>
<p>Taken from <a rel="noopener" target="_blank" href="https://stackoverflow.com/questions/200384/constant-amortized-time">here</a>.</p>
</div>
<div class="footnote-definition" id="technically"><sup class="footnote-definition-label">6</sup>
<p>Technically speaking, it only corresponds to the non-creation of energy, since we are interested
in an upper-bound on cost. Energy conservation means both non-creation and
non-loss of energy. Adapting the amortized analysis framework to non-loss would result in a lower-bound on cost.</p>
</div>
<div class="footnote-definition" id="timesensitive"><sup class="footnote-definition-label">7</sup>
<p>To help with this order-sensitivity, we will also from
now on consider the program state to have some notion of where it lies in
execution, like a program counter. However, this is just a technical point to
allow \(\Phi\) the flexibility to leverage operation order, and its exact
implementation is not important.</p>
</div>
<div class="footnote-definition" id="pool"><sup class="footnote-definition-label">8</sup>
<p>One might consider that external energy could be introduced at the
very start when a user calls on the program to execute. However, we will just
streamline this initial
payment by treating it as part of the energy assigned
to the initial program state.</p>
</div>
<div class="footnote-definition" id="corollary"><sup class="footnote-definition-label">9</sup>
<p>As a corollary, since the amortized cost payments are gone,
we also find that the potential of the initial
state bounds the peak cost. This is more useful to measure resources like memory.</p>
</div>
<div class="footnote-definition" id="python"><sup class="footnote-definition-label">10</sup>
<p>By pseudo-code I mean python.</p>
</div>
<div class="footnote-definition" id="Bell"><sup class="footnote-definition-label">11</sup>
<p>For those with a physics background, you might consider this our version of a <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Bell_test">Bell test</a>.
In physics, this is a case proving that <em>local realism</em> is incompatible with
quantum quantum mechanics; in our setting, it is a case proving that
purely local potential is insufficient for a tight cost analysis.</p>
</div>
<div class="footnote-definition" id="neg"><sup class="footnote-definition-label">12</sup>
<p>This return of energy is modeled in our framework simply by letting \(C\) return negative costs.</p>
</div>
<div class="footnote-definition" id="copy"><sup class="footnote-definition-label">13</sup>
<p>Copying is only really needed in this code if <code>process1</code> or <code>process2</code> might mutate the underlying list. However,
the pertinent features of code pattern also come up in side-effect free settings during, e.g., tree traversal.
See <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3473581">here</a>.</p>
</div>
<div class="footnote-definition" id="overpay"><sup class="footnote-definition-label">17</sup>
<p>Well, technically a worldview could choose to overpay for the cost too.</p>
</div>
<div class="footnote-definition" id="detail"><sup class="footnote-definition-label">14</sup>
<p>This sets up our worldviews to begin looking somewhat like
states in <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Quantum_superposition">quantum superposition</a>.
Both are collections of simultaneous classical-looking states, just with negative
values allowed where they usually wouldn’t be. In quantum physics, those values are
probabilities; in our setting, they are potentials.</p>
</div>
<div class="footnote-definition" id="whole"><sup class="footnote-definition-label">15</sup>
<p>While only a technical point here, the consequences of
allowing the accumulation of negative potential in some parts of the
program state does
provide another quantum physical parallel. Two famous no-go theorems of
quantum physics, <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/No-cloning_theorem"><em>no-cloning</em></a>
and <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/No-deleting_theorem"><em>no-deleting</em></a>, mean
that a quantum state cannot simply duplicate or delete one of its pieces. These
same principles are relevant to the progam states of the quantum physicist’s method: We cannot
simply duplicate potential when copying a datastructure, nor may we simply lose potential
when deleting/ignoring a datastructure. Either case could
introduce extra potential, when positive amounts are duplicated or negative amounts
are lost, which would violate conservation.</p>
</div>
<div class="footnote-definition" id="qt"><sup class="footnote-definition-label">16</sup>
<p>We call this particular way of accounting for how to get around the barrier of the vending machine
“resource tunneling”, because it is analagous to
<a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Quantum_tunnelling">quantum tunneling</a>
around a potential barrier. In quantum physics, this occurs because a
particle’s position (or energy, depending on what you measure) is in a
superposition of many states, a small portion of which allow being on the
other side of the potential barrier; in our setting, it is because potential
is tracked through the collection of worldviews, at least one of which is
sufficient to pay for the potential needed. In either case, there may be no
one state of the collection that can explain the tunneling; no state that,
if tracked individually from the start, could pass the potential barrier.</p>
</div>
Robustness between the worst and average case2023-04-21T00:00:00+00:002023-04-21T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/intermediate-robustness/<p>As machine learning systems become increasingly implemented in safety-critical applications, such as autonomous driving and healthcare, we need to ensure these systems are reliable and trustworthy. For example, we might wish to determine whether a car’s camera-based autopilot system can correctly classify the color of the light even in the presence of severe weather conditions, such as snow. Consider that the average snowy day looks something like the following:</p>
<img src=./snow1.png width="400">
<p>Overall, the visibility is not too bad, and we can guess that these weather conditions do not present too much of an issue for the car’s autopilot system. However, every once in a while, we might get a snowy day that looks more like this:</p>
<img src=./snow2.png width="400">
<p>The visibility is much worse in this scenario, and these conditions might be more difficult for the car’s autopilot system to safely navigate. However, the traffic light color, as well as most of the objects on the road, can still be identified, and we would hope that the autopilot would be able to operate correctly in these conditions. Finally, very rarely, we might have a snow squall like the following: </p>
<img src=./snow3.png width="400">
<p>These conditions are so extreme that a human driver would probably need to pull over to the side of the road rather than attempt to drive in with such little visibility. Therefore, we probably should not require the autopilot system to be robust to such conditions. Now we ask the question, how should we evaluate the robustness of the the car’s autopilot to severe weather conditions? </p>
<p>Existing methods for evaluating the robustness of a machine learning model to perturbed inputs (e.g. images that have been corrupted due to severe weather) are largely based on one of two notions. Average-case robustness, measures the model’s average performance on randomly sampled perturbations. In the autopilot example, for instance, we could randomly sample a bunch of images from all days recorded snow precipitation, and measure the average accuracy of the traffic light detection on those days. If most of those samples look like the first image shown above, we should expect the system’s average robustness to be pretty good. This notion of robustness, however, doesn’t tell us much about how our autopilot system will operate on more extreme conditions as depicted in the second and third images. </p>
<p>Alternatively, worst-case robustness, or adversarial robustness, measures the model’s worst-case performance across all possible perturbations. For example, the worst-case performance of the autopilot system might be the result of navigating in the conditions depicted by the third image, displaying the snow squall. In this case, we should expect the system’s worst-case robustness to be pretty bad. But as we mentioned previously, we may not care so much if the system is able to navigate the worst-case conditions shown in the third image. </p>
<p>So then, how do we best measure the robustness of the system to conditions like those shown in the second image, i.e. conditions worse than average, but not the worst possible conditions? In this blog post, we present an alternative method for evaluating the test-time performance of machine learning models that measures robustness <em>between</em> the worst and average case, or <em>intermediate</em> robustness. </p>
<h2 id="a-simple-example-robustness-to-gaussian-noise">A simple example: robustness to Gaussian noise</h2>
<p>To further motivate the notion of intermediate robustness, consider the simple scenario in which we are interested in evaluating the robustness of an image classification model to Gaussian noise applied to the input images. The image below is a sample from the ImageNet validation dataset, which an image classifier trained on the ImageNet training dataset correctly classifies as “pizza”. </p>
<img src=./pizza1.png width="500">
<p>Given ten thousand random samples of Gaussian noise, the model classifies 97% of these noised images correctly, including the image below. Given the model correctly classifies a large majority of the randomly perturbed images, evaluating according to average-case robustness will place most weight on “easy” noise samples like this image.</p>
<img src=./pizza2.png width="500">
<p>The following image shown below illustrates an example of a randomly noised image that the model incorrectly classifies as “soup bowl”. Evaluating according to average-case robustness will place not put much weight on these samples that are harder for the model to classify correctly. </p>
<img src=./pizza3.png width="500">
<p>What if we want to evaluate the model’s robustness on a stricter level than average-case robustness? Evaluating the image classifier according to worst-case robustness doesn’t make much sense for this particular noise distribution, because the worst-case noise could be an arbitrarily large amount of noise with extremely low probability due to the Gaussian distribution being unbounded. A more practical evaluation of robustness would consider something stricter than simply average performance on random perturbations, but not quite as strict as adversarial robustness, which is exactly what our intermediate robustness metric enables.</p>
<h2 id="an-intermediate-robustness-metric">An intermediate robustness metric</h2>
<p>We’ll now go into the details of how we formulate an intermediate robustness metric. We start by observing that we can naturally generalize average-case and worst-case robustness under one framework. To show this, we will make use of the mathematical definition of an \( L^p \)-norm of a function \( f \) on a measure space \( X \): \( ||f||_p = (\int_X |f|^p )^{(1/p)} \). However, to differentiate this from the use of \( \ell_p \)-norm balls as perturbation sets in adversarial robustness, we will from here on out refer this definition as a \(q \)-norm of a function. Ultimately, average and worst-case robustness can be generalized by taking the \( q \)-norm of the loss function over the perturbation distribution, where the loss just measures how well the model performs on the perturbed data. The setting of \( q=1 \) results in average-case robustness, whereas the setting of \( q = \infty \) results in worst-case robustness, because by definition the \( L^\infty \)-norm is given by the pointwise maximum of a function. Then, any value of \( q \) between \( 1 \) and \( \infty \) results in <em>intermediate</em> robustness. This is more formally written below:</p>
<blockquote>
<p>Define a neural network \( h \) with parameters \( \theta \), and a loss function \( \ell \) that measures how different the model predictions are from the true label \( y \) given an input \( x \). Consider we are interested in measuring the robustness of this model to perturbations \( \delta \) from some perturbation distribution with density \( \mu \). Now consider the following expectation over the \( q \)-norm of the loss according to this perturbation density,
$$ \mathbf{E}_{x, y \sim \mathcal{D}} \Big[ ||\ell(h_\theta(x+\delta), y)||_{\mu, q} \Big], $$
where the \( q \)-norm of the loss with perturbation density \( \mu \) is the following:
$$ ||\ell(h_\theta(x+\delta), y)||_{\mu, q} = \mathbf{E}_{\delta \sim \mu} [|\ell(h_\theta(x+\delta), y)|^q]^{1/q} = \Big( \int |\ell(h_\theta(x+\delta), y)|^q \mu(\delta)d\delta) \Big)^{1/q}.$$
Assuming a smooth, non-negative loss function, this expectation corresponds to the expected loss on random perturbations (average-case) when \( q = 1 \),
$$ || \ell(h_\theta(x+\delta), y) ||_{\mu, 1} = \mathbf{E}_{\delta \sim \mu} [\ell(h_\theta(x+\delta), y)], $$
and the expected maximum loss over all possible perturbations (worst-case) when \( q = \infty \),
$$ || \ell(h_\theta(x+\delta), y) ||_{\mu, \infty} = \text{max}_{\delta \in \text{Supp}(\mu)}[\ell(h_\theta(x+\delta), y)].$$</p>
</blockquote>
<p> </p>
<p>Intuitively, as we increase \( q \), more emphasis will be placed on high loss values, as the losses become more strongly “peaked” due to the exponent of \( q \). And so by increasing \(q \) from \( 1 \) to \( \infty \), we enable a full spectrum of intermediate robustness measurements that are increasingly strict by placing more weight on high loss values. This formulation allows us to evaluate a model’s robustness in a wide range between the two extremes of average and worst case performance.</p>
<h2 id="approximating-the-metric-using-path-sampling">Approximating the metric using path sampling</h2>
<p>In most cases, we have to approximate the metric we just defined, which cannot be calculated exactly because it requires computing a high-dimensional integral over the perturbation space. Ultimately, we approximate the integral using the path sampling method [Gelman and Meng, 1998], but to motivate why this is important, we’ll first give an example of a naive, yet suboptimal, way of estimating the integral.</p>
<h3 id="monte-carlo-estimator">Monte Carlo estimator</h3>
<p>For illustration purposes, let’s consider approximating the integral \( \int_a^b f(x)^q \mu(x)dx \) for an arbitrary function \( f \) and probability density \( \mu \). Recall that the integral of a function can be interpreted as calculating the area below the function’s curve. We could pick a random sample \( x \), evaluate the function \( f(x)^q \) at \( x \) and multiply by \( (b-a ) \) to estimate the area. However, using just one sample, this will likely underestimate or overestimate the area. If we instead pick many samples and take the average of their estimates, with enough samples this theoretically should eventually converge to something close to the desired integral. This is known as the Monte Carlo estimator, and can be visualized in the plot below for the function \( f(x)^q \) with \( q = 1 \).</p>
<img src=./integral1.png width="400">
<p>Now let’s see what this plot looks like for \( q=2 \). We see that values of \( x \) for which the value \( f(x) \) is large make a larger contribution to the integral. However, because these values of \( x \) have a lower probability of being sampled, random sampling places a disproportionate amount of weight on estimates from \( x \) with lower values of \( f(x) \).</p>
<img src=./integral2.png width="400">
<p>As we continue to increase the value of \( q \), as shown in the plot below for \( q=3 \), we can see that Monte Carlo sampling will be increasingly insufficient to approximate this integral well.</p>
<img src=./integral3.png width="400">
<p>Translating this back to our integral of interest, when the perturbation density is concentrated in a region with low loss values, the Monte Carlo estimator will be less capable of producing an accurate approximation of the integral when we want to evaluate intermediate robustness for larger values of \( q \).</p>
<h3 id="path-sampling-estimator">Path sampling estimator</h3>
<p>To better approximate the integral for large values of \( q \), we need to sample perturbations that contribute more largely to the integral (e.g. result in higher loss values) more frequently. Path sampling is one method that boosts the frequency of more “important” samples by sampling from a “path” of alternative densities that encourages samples where the integrand is large. </p>
<p>The path sampling estimator of the intermediate robustness metric ultimately takes the form of the geometric mean of the losses given the sampled perturbations from these alternative densities, which are annealed to follow an increasingly “peaked” distribution. Practically, these samples can be drawn using Markov chain Monte Carlo (MCMC) methods. The path sampling estimator is written more formally below:</p>
<blockquote>
<p>Consider the following class of densities,
$$ p(\delta|t) \propto \ell(h_\theta(x+\delta),y)^t \mu(\delta),$$
and construct a path by interpolating \( t^{(i)} \) between 0 and \( q \) and sampling a perturbation \( \delta^{(i)} \) from \( p(\delta|t^{(i)}) \) using MCMC. Then, the path sampling estimator of the intermediate robustness metric is the following geometric mean,
$$ \hat{Z}_\text{Path sampling} := \Big( \prod_{i=1}^m \ell \big( h_\theta(x+\delta^{(i)}), y \big) \Big)^{1/m}.$$</p>
</blockquote>
<h2 id="evaluating-the-intermediate-robustness-of-an-image-classifier">Evaluating the intermediate robustness of an image classifier</h2>
<p>Now that we have introduced a metric for evaluating the intermediate robustness of a model, along with methods for approximating this metric, let’s evaluate the performance of a model at different robustness levels. We’ll see that the intermediate robustness metric interpolates between measurements of average and the worst-case robustness, providing a multitude of additional ways in which we can measure a model’s robustness, and we’ll empirically show the advantage of the path sampling estimator over the Monte Carlo estimator.</p>
<p>Because it is a setting commonly considered in the adversarial (worst-case) robustness literature, consider evaluating the robustness of an image classifier to perturbations \( \delta \) uniformly distributed within the \( \ell_\infty \)-norm ball with radius \( \epsilon \) (i.e. each component of \( \delta \) is uniformly distributed between \( [-\epsilon, \epsilon] \)).</p>
<p>In the figure below, we plot the test-time performance of an image classifier, trained on the CIFAR-10 dataset, using our intermediate robustness metric for different values of \( q \).</p>
<img src=./interpolating.jpeg width="500">
<p>This figure shows that our proposed intermediate robustness metric does indeed capture the gap between the two existing robustness metrics, effectively interpolating between average-case robustness (\( q=1 \)) and worst-case (adversarial) robustness measurements when increasing the value of \( q \) from left to right.</p>
<p>We can also compare the Monte Carlo and path sampling estimators for different values of \( q \). This figure illustrates that while both of the approximation methods result in a similar estimate for \( q=1 \), for larger values of \( q \), path sampling results in a higher, more accurate estimate of the intermediate robustness metric, more closely approaching the adversarial loss, when compared to Monte Carlo sampling.</p>
<p>The benefit of the path sampling estimator can be further shown in the figure below, which plots the convergence of the Monte Carlo sampling and path sampling estimates of the intermediate robustness metric given an increasing number of samples.</p>
<table><thead><tr><th align="center">Convergence with \( q=1 \)</th><th align="center">Convergence with \( q=100 \)</th></tr></thead><tbody>
<tr><td align="center"><img src="./convergence-q=1.jpeg" alt="q=1" /></td><td align="center"><img src="./convergence-q=100.jpeg" alt="q=100" /></td></tr>
</tbody></table>
<p>Again, when approximating the robustness metric for \( q=1 \), shown on the left, both estimators converge to the same value with relatively few iterations. However, when approximating the intermediate robustness metric for \( q=100 \), shown on the right, the Monte Carlo sampler results in estimates that are consistently lower and less accurate than those of path sampling, even with a large number of samples. </p>
<h2 id="training-for-different-levels-of-robustness">Training for different levels of robustness</h2>
<p>We can also <em>train</em> machine learning models according to specific levels of robustness by choosing a value of \( q \) and minimizing the intermediate robustness objective. However, training intermediate robust models is computationally challenging because a non-trivial number of perturbation samples is needed to accurately estimate the robustness objective, even when using the path sampling method. While evaluating models simply requires one iteration over the test dataset, training requires multiple iterations over the training dataset, resulting in an extremely expensive procedure when effectively multiplying the dataset size by the number of perturbaton samples.</p>
<p>Due to this computational complexity, we train an image classifier on the simpler MNIST dataset (using the same perturbation set) to minimize the intermediate robustness objective for different values of \( q \) (approximated using path sampling). We train one model with \( q=1 \), which is just like training with data augmentation (training on randomly sampled perturbations), and we train one model with \( q=100 \), which is somewhere in between training with data augmentation and adversarial training (training on worst-case perturbations).</p>
<p>We evaluate the intermediate and adversarial robustness of each of the final trained models, the results of which can be seen in the figure below.</p>
<table><thead><tr><th align="center">Training with \( q=1 \)</th><th align="center">Training with \( q=100 \)</th></tr></thead><tbody>
<tr><td align="center"><img src="./train_q1.png" alt="q=1" /></td><td align="center"><img src="./train_q100.png" alt="q=100" /></td></tr>
</tbody></table>
<p>While the model trained with \( q=1 \), shown on the left, and the model trained with \( q=100 \), shown on the right, have similar performance when evaluated at less strict robustness levels, \( q=1 \) and \( q=10 \), the model trained with \( q=100 \) is much more robust when comparing the stricter \( q=1000 \) and adversarial robustness measurements.</p>
<p>Ultimately, the main takeaway from training using the proposed intermediate robustness objective is that the choice of \( q \) allows for fine-grained control over the desired level of robustness, rather than being restricted to average-case or worst-case objectives.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We’ve introduced a new robustness metric that allows for evaluating a machine learning model’s intermediate robustness, bridging the gap between evaluating robustness to random perturbations and robustness to worst-case perturbations. This intermediate robustness metric generalizes average-case and worst-case notions of robustness under the same framework as functional \( q \)-norms of the loss function over the perturbation distribution. We introduced a method for approximating this metric using path sampling, which results in a more accurate estimate of the metric compared to naive Monte Carlo sampling when evaluating at robustness levels approaching adversarial robustness. Empirically we showed that by evaluating an image classifier on additive noise perturbations, the proposed intermediate robustness metric enables a broader spectrum of robustness measurements, between the least strict notion of average performance on random perturbations and the most conservative notion of adversarial robustness. Finally, we highlighted the potential ability to train for specific levels of robustness using intermediate-\( q \) robustness as a training objective. For additional details, see our paper <a rel="noopener" target="_blank" href="https://proceedings.neurips.cc/paper/2021/file/ea4c796cccfc3899b5f9ae2874237c20-Paper.pdf">here</a> and code <a rel="noopener" target="_blank" href="https://github.com/locuslab/intermediate_robustness">here</a>.</p>
<h2 id="references">References</h2>
<p>Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling
to bridge sampling to path sampling. Statistical science, pages 163–185, 1998.</p>
<p>Bennett, Charles H. “Efficient estimation of free energy differences from Monte Carlo data.” Journal of Computational Physics 22.2 (1976): 245-268.</p>
<p>Meng, Xiao-Li, and Wing Hung Wong. “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration.” Statistica Sinica (1996): 831-860.</p>
<p>Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo.
Physics letters B, 195(2):216–222, 1987</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This blog post is based on the NeurIPS 2021 paper titled <a rel="noopener" target="_blank" href="https://proceedings.neurips.cc/paper/2021/file/ea4c796cccfc3899b5f9ae2874237c20-Paper.pdf">Robustness between the worst and average case</a>, which was joint work with <a rel="noopener" target="_blank" href="https://annaebair.github.io/">Anna Bair</a>, <a rel="noopener" target="_blank" href="https://www.huan-zhang.com/">Huan Zhang</a>, and <a rel="noopener" target="_blank" href="http://zicokolter.com/">Zico Kolter</a>. This work was supported by a grant from the Bosch Center for Intelligence.</p>
Classification with Strategically Withheld Data2023-02-21T00:00:00+00:002023-02-21T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/withheld/<p><em>This blog post is based on a <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2012.10203.pdf">research paper</a> with the same title, authored by Anilesh Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent Conitzer, published at AAAI 2021.</em></p>
<p><em>TL;DR: We investigate a classification problem where each data point being classified is controlled by an agent who has its own goals or incentives, and may strategically withhold certain features in order to game the classifer and get a more desirable label. We use (an oversimplied version of) college admissions as a running example to illustrate how traditional methods may fail in such settings, as well as how insights from the economic field of mechanism design may help. We then demonstrate a principled method — Incentive-Compatible Logistic Regression — for classification problems with strategically withheld features, which achieves remarkable empirical performance on credit approval data.</em></p>
<p>Applicants to most colleges in the US are required to submit their scores for at least one of the SAT and the ACT.
Applicants usually take one of these two tests — <a rel="noopener" target="_blank" href="https://www.princetonreview.com/college/sat-act">whichever works to their advantage</a>.
However, given the growing competitiveness of college admissions, many applicants now take both tests and then strategically decide whether to <a rel="noopener" target="_blank" href="https://blog.collegevine.com/should-you-submit-your-sat-act-scores/">drop one of the scores</a> (if they think it will hurt their application) or report both.
The key issue here is that it is impossible to distinguish between an applicant who takes both tests but reports only one, and an applicant who takes only one test — for example, because the applicant simply took the one required by their school, the dates for the other test did not work with their schedule, or for other reasons that are not strategic in nature.
Such ambiguity makes it harder for colleges to accurately evaluate applicants, especially since colleges now increasingly <a rel="noopener" target="_blank" href="https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-help-pick-who-gets-in-what-could-go-wrong">rely on machine learning techniques to help make admissions decisions</a>.</p>
<h2 id="what-can-go-wrong">What Can Go Wrong?</h2>
<p>Consider the following simplified scenario: each applicant may naturally (i.e., before they strategically drop one of the scores) have an SAT score, an ACT score, or both.
We also assume these scores are normalized, so they become real numbers between 0 and 1.
Suppose the true competitiveness of an applicant is the average of the scores they naturally have — that is, if an applicant naturally has only one score, then that score is their true competitiveness; if an applicant naturally has both scores, then their true competitiveness is the average of the two scores.
We will use this setup as our running example from now on.
We will not try to “solve” this example problem (later we will see that in some cases, there is no satisfactory solution to the problem), but rather, we will use the example to illustrate the limitations of some classical methods, as well as to motivate the more principled method that we propose.</p>
<p>Now a college wishes to assess each applicant’s competitiveness based on the scores, and admit all applicants whose competitiveness is at least 0.5 (or some threshold chosen by the college).
Assuming all applicants report all scores they naturally have, it is easy to make admissions decisions: the college simply computes each applicant’s average score, and admits that applicant if the average is at least 0.5.
In other words, the college implements a simple <strong>classifier</strong>, which assigns any applicant <strong>label “admitted”</strong> if the average value of their <strong><em>natural</em> features</strong> is at least 0.5.</p>
<p>However, the simple classifier has its problems: after it is used for admissions for a couple of years, applicants may eventually figure out how it works (for example, by talking to past applicants and researching their test scores and application results).
Once applicants know how the decisions are made, they can easily game the system by strategically withholding information.
Consider, for example, an applicant with an SAT score of 0.6 and an ACT score of 0.2.
The applicant would normally be rejected since their true competitiveness is 0.4, which is smaller than the classifier’s threshold, 0.5.
However, knowing how the classifier works, the applicant can withhold the ACT score and report the SAT score only to the college.
Then the classifier would mistakenly believe that the applicant’s competitiveness is 0.6, and admit the applicant.
As a result, the classifier is not accurate anymore when applicants act strategically and try to game it.</p>
<h2 id="how-can-we-fix-it">(How) Can We Fix It?</h2>
<p>Taking into consideration the fact that applicants will eventually figure out how decisions are made, and in response to that, withhold scores strategically to maximize their chances, is it still possible for the college to admit exactly those applicants that the college wants to admit?
The answer is — perhaps not so surprisingly — it <em>depends on the <strong>distribution</strong> of applicants</em>, including how often each score is missing, as well as how the two scores correlate.
To illustrate this dependence, below we discuss two extreme cases.</p>
<p><img src="./examples.png" alt="two extreme cases" /></p>
<p>In one extreme case (illustrated in the left of the figure), every applicant naturally has both scores and the college knows that.
Then, the college’s problem is again simple: the college admits an applicant if and only if that applicant reports both scores, and the average of the two scores is at least 0.5.
This ensures that no applicant would want to withhold a score, because that would lead to automatic rejection.
Moreover, no applicant would be mistakenly rejected because they cannot report both scores, since everyone naturally has both scores.</p>
<p>In another extreme case (illustrated in the right of the figure), there are only two types of applicants: a type-1 applicant naturally has an SAT score of 0.6 and does not have an ACT score; a type-2 applicant naturally has an SAT score of 0.6 and an ACT score of 0.2.
Ideally, the college would like to admit all type-1 applicants (because their competitiveness is 0.6), and reject all type-2 applicants (because their competitiveness is 0.4).
However, this is impossible once applicants respond strategically to the college’s classifier.
For example, if the college admits all type-1 applicants whose SAT score is 0.6 and ACT score is missing, then a type-2 applicant would pretend to be a type-1 applicant by withholding their ACT score, and get admitted too.
On the other hand, to prevent type-2 applicants getting in by pretending to be type-1 applicants, the college would have to reject all type-1 applicants too, eventually admitting no one.</p>
<h2 id="a-principled-approach-via-mechanism-design">A Principled Approach via Mechanism Design</h2>
<p>The above discussion highlights one fact: when applicants respond strategically, the optimal classifier must depend on the distribution of applicants, even if the college’s criteria for admissions stays the same, and there is no restrictions whatsoever on how many applicants can be admitted.
This is reminiscent of problems in <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Mechanism_design">mechanism design</a>.
In a mechanism design problem, a <strong>principal</strong> designs and commits to a decision rule, or a <strong>mechanism</strong> — in the admissions problem discussed above, the principal is the college, and the decision rule is the classifier used for admissions.
Self-interested <strong>agents</strong> (e.g., applicants) then respond to this rule by reporting (possibly nontruthfully) their private information (e.g., their test scores) to the principal.
The mechanism then chooses an <strong>outcome</strong> (e.g., admissions decisions) based on the reported information.
Taking the agents’ strategic behavior into consideration, the principal aims to design a mechanism to maximize their own <strong>utility</strong> (e.g., accuracy of admissions decisions), which generally depends on both the outcome and the agents’ private information.
In fact, in our running example, the college’s problem can be cast directly as a mechanism design problem.
Below we will see how tools from mechanism design can help in solving the college’s classification problem.</p>
<h3 id="incentive-compatibility-and-the-revelation-principle">Incentive Compatibility and the Revelation Principle</h3>
<p>A key notion in mechanism design is <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Incentive_compatibility">incentive compatibility</a>: a mechanism is incentive-compatible if it is always in the agents’ best interest to truthfully report their private information.
Applied to our running example, incentive compatibility means that applicants would never want to withhold a test score that they naturally have.
One reason that incentive compatibility is so important in mechanism design is that it is often <em>without loss of generality</em>: if there is no restriction on the ways in which an agent can (mis)report their private information, then for any (possibly not incentive-compatible) mechanism, there always exists an “incentive-compatible version” of that mechanism which achieves the same effects.
This is famously known as the <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Revelation_principle">revelation principle</a>.
The reason that the revelation principle holds is simple: the principal can adapt any mechanism into an incentive-compatible one by “misreporting for” the agents, in the exact way that the agents would misreport in response to the original mechanism.
We show that a variant of the revelation principle applies to the college’s classification problem (and more generally, to all classification problems with strategically withheld features).
This greatly simplifies the problem, because without loss of generality, we only need to consider classifiers under which applicants have no incentive to withhold any score.
This effectively removes the strategic aspect and leaves a clean classification problem.</p>
<h3 id="incentive-compatible-logistic-regression">Incentive-Compatible Logistic Regression</h3>
<p>Given the revelation principle, we propose a principled method, <strong>incentive-compatible logistic regression</strong>, for classification problems with strategically withheld data.
The idea is simple: we run the classical gradient-based algorithm for logistic regression, <em>but with the search space restricted to classifiers that are incentive-compatible</em>.
The college can then use the resulting model to classify applicants in an incentive-compatible way.
We will see below how this can be done by adding a projection step to the region of incentive-compatible classifiers, after each gradient step.</p>
<p>Recall that in logistic regression, the goal is to learn a set of coefficients \({\beta_i}\), one for each feature \(i\), as well as an intercept \(\beta_0\), such that for each data point \((x, y)\), the predicted label \(\hat{y}\) given by
\[
\hat{y} = \mathbb{I}\left[\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) \ge 0.5\right]
\]
fits the true label \(y\) as well as possible.
Here, \(\sigma\) is the logistic function, defined as
\[
\sigma(t) = 1 / (1 + e^{-t}).
\]
Mapping these notions back to our running example, each data point \((x, y)\) corresponds to an applicant, where each feature \(x_i\) is one of the two scores, and the true label \(y\) is \(1\) (corresponding to “admitted”) if the applicant’s true competitiveness is at least the college’s desired threshold, and \(0\) (corresponding to “rejected”) otherwise.
The classifier computes a predicted label \(\hat{y}\) for each data point, which is the admissions decision for that specific applicant.
Naturally, the college wants \(\hat{y}\) to fit \(y\) as well as possible.</p>
<p>It turns out there is a simple condition for the classifier of the above form to be incentive-compatible.
Without loss of generality, suppose each feature \(x_i\) is always nonnegative.
this is naturally true in our running example, since each feature is a score between \(0\) and \(1\); in general, we can shift the features if they are not nonnegative.
Moreover, if a feature is missing in a data point, then we simply treat that feature as \(0\).
Then a classifier induced by \({\beta_i}\) is incentive-compatible if and only if each \(\beta_i\) is nonnegative.
This is because if some \(\beta_i < 0\), then a data point with feature \(x_i > 0\) will be able to increase their score, \(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right)\), by withholding feature \(x_i\).
Depending on the values of other features, this will sometimes change the predicted label of that data point from \(0\) (i.e., rejected) to \(1\) (i.e., admitted).
In other words, such a classifier cannot be incentive-compatible.
On the other hand, if each \(\beta_i\) is nonnegative, then for any data point, withholding a feature \(x_i\) can never increase the score, so there is no incentive to withhold any feature.</p>
<p>Given the above characterization, we can simply adapt the gradient-based algorithm for (unconstrained) logistic regression to find a good incentive-compatible classifier.
We initialize the classifier arbitrarily, and repeat the following steps for each data point \((x, y)\) until convergence:</p>
<ul>
<li>
<p><strong>The gradient step</strong>: Let
\[
\beta_0 \gets \beta_0 - \eta_t \cdot \left(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) - y\right).
\]
For each feature \(i\), let
\[
\beta_i \gets \beta_i - \eta_t \cdot \left(\sigma\left(\beta_0 + \sum_i x_i \cdot \beta_i\right) - y\right) \cdot x_i.
\]
Here, \(\eta_t\) is the learning rate in step \(t\).
This rate normally decreases in \(t\), e.g., \(\eta_t = 1 / \sqrt{t}\).</p>
</li>
<li>
<p><strong>The projection step</strong>: For each feature \(i\), let
\[
\beta_i \gets \max\{0, \beta_i\}.
\]</p>
</li>
</ul>
<p>This can be viewed as an instantiation of the projected gradient descent algorithm: the gradient step is exactly the same as in (unconstrained) logistic regression, and the projection step ensures that the incentive-compatibility constraint is satisfied.</p>
<p>Coming back to our running example, incentive-compatible logistic regression will assign a nonnegative weight to each test score and admit an applicant if the weighted sum of the two scores exceeds some threshold. Note that this does not “solve” the college’s problem in all cases: for example, between the two extreme cases discussed above, incentive-compatible logistic regression would work very well in the first case, but in the second case its performance would not be practically meaningful, simply because the second case is intrinsically hard and no classifier can achieve a reasonable accuracy there.</p>
<h2 id="experimental-results">Experimental Results</h2>
<p>We empirically evaluate incentive-compatible logistic regression on 4 real-world credit approval datasets from the <a rel="noopener" target="_blank" href="http://archive.ics.uci.edu/ml/index.php">UCI ML Repository</a>, based on historical data collected in Australia, Germany, Poland, and Taiwan.
Each data point in each dataset corresponds to a single credit application, with tens of features (3 datasets provide 15-23 features, and the other provides 64), including annual income, employment status, current balance in savings account, etc.
Each data point has a binary label, which is either “approve” (i.e., 1) or “reject” (i.e., 0).
We preprocess the datasets by randomly dropping some features for each data point, thus simulating naturally missing features.
We consider two ways of reporting in our evaluation:</p>
<ul>
<li>
<p><strong>Truthful reporting</strong>: Each data point always reveals all features it naturally has to the classifier.
This is the assumption made by the baseline methods, which we compare against in our evaluation.</p>
</li>
<li>
<p><strong>Strategic reporting</strong>: In reponse to the classifier, each data point optimally withholds a subset of features to maximize the chance of getting approved (i.e., label 1).
For incentive-compatible logistic regression, strategic reporting is equivalent to truthful reporting.
However, as we will see, the baseline methods perform significantly worse with strategic reporting (which is natural, since they were not designed to be robust against strategic manipulation).</p>
</li>
</ul>
<p>As for the baseline methods, we compare against <strong>logistic regression</strong> (without incentive-compatibility), <strong>neural networks</strong>, and <strong>random forests</strong>.
These are the most popular and accurate methods in credit approval applications.
For more details of the experiments, please see Section 6 of <a rel="noopener" target="_blank" href="https://arxiv.org/pdf/2012.10203.pdf">our paper</a>.</p>
<p>The accuracy of each classifier tested on each dataset can be found in the table below.
Note that there are two numbers in each cell: the left one corresponds to the accuracy under truthful reporting, and the right one corresponds to the accuracy under strategic reporting.</p>
<table><thead><tr><th>Classifier</th><th>Australia</th><th>Germany</th><th>Poland</th><th>Taiwan</th></tr></thead><tbody>
<tr><td>Incentive-compatible logistic regression</td><td><strong>0.800</strong> / <strong>0.800</strong></td><td>0.651 / <strong>0.651</strong></td><td>0.698 / <strong>0.698</strong></td><td>0.646 / <strong>0.646</strong></td></tr>
<tr><td>Logistic regression (baseline)</td><td><strong>0.800</strong> / 0.763</td><td><strong>0.652</strong> / 0.580</td><td>0.714 / 0.660</td><td>0.670 / 0.618</td></tr>
<tr><td>Artificial neural networks (baseline)</td><td><strong>0.800</strong> / 0.747</td><td><strong>0.652</strong> / 0.580</td><td><strong>0.719</strong> / 0.636</td><td><strong>0.688</strong> / 0.543</td></tr>
<tr><td>Random forest (baseline)</td><td>0.797 / 0.541</td><td>0.633 / 0.516</td><td>0.709 / 0.522</td><td>0.684 / 0.588</td></tr>
</tbody></table>
<p>Here we make two observations:</p>
<ul>
<li>Under strategic reporting, incentive-compatible logistic regression is consistently much more accurate than all 3 baseline methods.
This highlights the importance of robustness against strategic manipulation by design.</li>
<li>The accuracy of incentive-compatible logistic regression under strategic reporting is often comparable to that of the baseline methods under truthful reporting.
In other words, although strategic manipulation poses challenges in the design of a good classifier, from an information-theoretic perspective, the classification problem does not become much harder.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>We study the problem of classification when each data point can strategically withhold some of its features to obtain a more favorable outcome.
We propose a principled classification method, incentive-compatible logistic regreggsion, which is robust to strategic manipulation.
The new method is tested on real-world datasets, showing that it outperforms out-of-the-box methods that do not account for strategic behavior.
More generally, we draw connections between strategic classification and mechanism design, which may inspire future work in other strategic classification settings.</p>
Designing Data Structures for Collaborative Apps2023-02-17T00:00:00+00:002023-02-17T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2023/collaborative-data-design/<h1 id="introduction-collaborative-apps-via-crdts">Introduction: Collaborative Apps via CRDTs</h1>
<blockquote>
<p>An extended version of this post appears <a rel="noopener" target="_blank" href="https://mattweidner.com/2022/02/10/collaborative-data-design.html">on my personal site</a>.</p>
</blockquote>
<p></p><br />
<p>Suppose you’re building a collaborative app, along the lines of Google Docs/Sheets/Slides, Figma, Notion, etc., but <em>without a central server</em>. One challenge you’ll face is the actual collaboration: when one user changes the shared state, their changes need to show up for every other user. For example, if multiple users type at the same time in a text field, the result should reflect all of their changes and be consistent (identical for all users).</p>
<p><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type"><strong>Conflict-free Replicated Data Types (CRDTs)</strong></a> provide a solution to this challenge. They are data structures that look like ordinary data structures (maps, sets, text strings, etc.), except that they are collaborative: when one user updates their copy of a CRDT, their changes automatically show up for everyone else. Each user sees their own changes immediately, while under the hood, the CRDT broadcasts a message describing the change to everyone else. Other users see the change once they receive this message.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/message_sending.png" alt="CRDTs broadcast messages to relay changes" /></p>
<p><a name="correctness"></a>Note that multiple users might make changes at the same time, e.g., both typing at once. Since each user sees their own changes immediately, their views of the document will temporarily diverge. However, CRDTs guarantee that once the users receive each others’ messages, they’ll see identical document states again: this is the definition of <strong>CRDT correctness</strong>. Ideally, this state will also be “reasonable”, i.e., it will incorporate both of their edits in the way that the users expect.</p>
<blockquote>
<p>In distributed systems terms, CRDTs are <em>Available</em>, <em>Partition tolerant</em>, and have <em>Strong Eventual Consistency</em>.</p>
</blockquote>
<p></p><br />
<p>CRDTs work even if messages might be arbitrarily delayed, or delivered to different users in different orders. This lets you make collaborative experiences that don’t need a central server, work offline, and/or are end-to-end encrypted (<a rel="noopener" target="_blank" href="https://www.inkandswitch.com/local-first/"><strong>local-first software</strong></a>).</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/google_docs_offline.png" alt="Google Docs doesn’t let you type while offline" /></p>
<p align="center"><i>CRDTs allow offline editing, unlike Google Docs.</i></p>
<p>I’m particularly excited by the potential for <strong>open-source collaborative apps</strong> that anyone can distribute or modify, without requiring app-specific hosting.</p>
<h2 id="the-challenge-designing-crdts">The Challenge: Designing CRDTs</h2>
<p>Having read all that, let’s say you choose to use a CRDT for your collaborative app. All you need is a CRDT representing your app’s state, a frontend UI, and a network of your choice (or a way for users to pick the network themselves). But where do you get a CRDT for your specific app?</p>
<p>If you’re lucky, it’s described in a <a rel="noopener" target="_blank" href="https://crdt.tech/papers.html">paper</a>, or even better, implemented in a <a rel="noopener" target="_blank" href="https://crdt.tech/implementations">library</a>. But those tend to have simple or one-size-fits-all data structures: maps, text strings, unstructured JSON, etc. You can usually rearrange your app’s state to make it fit in these CRDTs; and if users make changes at the same time, CRDT correctness guarantees that you’ll get <em>some</em> consistent result. However, it might not be what you or your users expect. Worse, you have little leeway to customize this behavior.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/json_anomaly.png" alt="Anomaly in a published JSON CRDT: In a collaborative todo-list, concurrently deleting an item and marking it done results in a nonsense list item with no text field." /></p>
<p align="center"><i>
In a <a href="https://doi.org/10.1109/TPDS.2017.2697382" target="_blank">published JSON CRDT</a>, when representing a todo-list using items with "title" and "done" fields, you can end up with an item <code>{"done": true}</code> having no "title" field. Image credit: Figure 6 from the paper.
</i></p>
<!--figure: hypothetical user Q&A asking for a change in the conflict-resolution, and you just reply "sorry".-->
<p>This blog post will instead teach you how to design CRDTs from the ground up. I’ll present a few simple CRDTs that are obviously correct, plus ways to compose them together into complicated whole-app CRDTs that are still obviously correct. I’ll also present principles of CRDT design to help guide you through the process. To cap it off, we’ll design a CRDT for a collaborative spreadsheet.</p>
<p><strong>Ultimately, I hope that you will gain not just an understanding of some existing CRDT designs, but also the confidence to tweak them and create your own!</strong></p>
<h2 id="related-work">Related Work</h2>
<p>The CRDTs I describe are based on <a rel="noopener" target="_blank" href="http://dx.doi.org/10.1007/978-3-642-24550-3_29">Shapiro et al. 2011</a> unless noted otherwise. However, the way I describe them, and the design principles and composition techniques, are my own way of thinking about CRDT design. It’s inspired by the way <a rel="noopener" target="_blank" href="https://www.figma.com/blog/how-figmas-multiplayer-technology-works/">Figma</a> and <a rel="noopener" target="_blank" href="https://hex.tech/blog/a-pragmatic-approach-to-live-collaboration">Hex</a> describe their collaboration platforms; they likewise support complex apps by composing simple, easy-to-reason-about pieces. Relative to those platforms, I incorporate more academic CRDT designs, enabling more flexible behavior and server-free operation.</p>
<!--I'll describe most CRDTs in terms of an implementation, because I find implementations easier to explain. However, my real goal is to describe their *semantics*: what users see after they perform various operations, possibly concurrently. If you can find alternate implementations that have the same behavior as the ones I describe but are more efficient, then by all means, use those instead. -->
<h1 id="basic-designs">Basic Designs</h1>
<p>I’ll start by going over some basic CRDT designs.</p>
<h2 id="unique-set">Unique Set</h2>
<p>Our foundational CRDT is the <strong>Unique Set</strong>. It is a set in which each added element is considered unique.</p>
<p>Formally, the user-facing operations on the set, and their collaborative implementations, are as follows:</p>
<ul>
<li><code>add(x)</code>: Adds an element <code>e = (t, x)</code> to the set, where <code>t</code> is a <em>unique new tag</em>, used to ensure that <code>(t, x)</code> is unique. To implement this, the adding user generates <code>t</code>, e.g., as a pair (device id, device-specific counter), then serializes <code>(t, x)</code> and broadcasts it to the other users. The receivers deserialize <code>(t, x)</code> and add it to their local copy of the set.</li>
<li><code>delete(e)</code>: Deletes the element <code>e = (t, x)</code> from the set. To implement this, the deleting user serializes <code>t</code> and broadcasts it to the other users. The receivers deserialize <code>t</code> and remove the element with tag <code>t</code> from their local copy, if it has not been deleted already.</li>
</ul>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/message_flow.png" alt="In response to user input, the operator calls “Output message”. The message is then delivered to every user’s “Receive & Update display” function." /></p>
<p align="center"><i>The lifecycle of an <code>add</code> or <code>delete</code> operation.</i></p>
<p>When displaying the set to the user, you ignore the tags and just list out the data values <code>x</code>, keeping in mind that (1) they are not ordered (at least not consistently across different users), and (2) there may be duplicates.</p>
<p><strong>Example:</strong> In a collaborative flash card app, you could represent the deck of cards as a Unique Set, using <code>x</code> to hold the flash card’s value (e.g., its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed. <!--Note that the collaborative state is just the *set* of cards; there is no ordering info. You could perhaps sort them alphabetically in editing mode (to make them consistent), and randomly in practice mode (deliberately inconsistent).--></p>
<p><a name="causal-order"></a>When broadcasting messages, we require that they are delivered <em>reliably</em> and <em>in causal order</em>, but it’s okay if they are arbitarily delayed. (These rules apply to all CRDTs, not just the Unique Set.) Delivery <strong>in causal order</strong> means that if a user sends a message \(m\) after receiving or sending a message \(m^\prime\), then all users delay receiving \(m\) until after receiving \(m^\prime\). This is the strictest ordering we can implement without a central server and without extra round-trips between users, e.g., by using <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Vector_clock">vector clocks</a>.</p>
<p>Messages that aren’t ordered by the causal order are <strong>concurrent</strong>, and different users might receive them in different orders. But for <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#correctness">CRDT correctness</a>, we must ensure that all users end up in the same state regardless, once they have received the same messages.</p>
<p>For the Unique Set, it is obvious that the state of the set, as seen by a specific user, is always the set of elements for which they have received an <code>add</code> message but no <code>delete</code> messages. This holds regardless of the order in which they received concurrent messages. Thus the Unique Set is correct.</p>
<blockquote>
<p>Note that delivery in causal order is important—a <code>delete</code> operation only works if it is received after its corresponding <code>add</code> operation.</p>
</blockquote>
<p></p><br />
<p>We now have our first principle of CRDT design:</p>
<p><a name="principle-1"></a><strong>Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.</strong></p>
<p>Although it is simple, the Unique Set forms the basis for the rest of our CRDTs.</p>
<!-- > **Aside.** Traditionally, one proves CRDT correctness by proving that concurrent messages *commute*---they have the same effect regardless of delivery order ([Shapiro et al. 2011](http://dx.doi.org/10.1007/978-3-642-24550-3_29))---or that the final state is a function of the causally-ordered message history ([Baquero, Almeida, and Shoker 2014](https://doi.org/10.1007/978-3-662-43352-2_11)). However, as long as you stick to the techniques in this blog post, you won't need explicit proofs: everything builds on the Unique Set in ways that trivially preserve CRDT correctness. For example, a deterministic view of a Unique Set is obviously still a CRDT.
<p></p><br /> -->
<h2 id="lists">Lists</h2>
<p>Our next CRDT is a <strong>list CRDT</strong>. It represents a list of elements, with <code>insert</code> and <code>delete</code> operations. For example, you can use a list CRDT of characters to store the text in a collaborative text editor, using <code>insert</code> to type a new character and <code>delete</code> for backspace.</p>
<p>Formally, the operations on a list CRDT are:</p>
<ul>
<li><code>insert(i, x)</code>: Inserts a new element with value <code>x</code> at index <code>i</code>, between the existing elements at indices <code>i</code> and <code>i+1</code>. All later elements (index <code>>= i+1</code>) are shifted one to the right.</li>
<li><code>delete(i)</code>: Deletes the element at index <code>i</code>. All later elements (index <code>>= i+1</code>) are shifted one to the left.</li>
</ul>
<p>We now need to decide on the semantics, i.e., what is the result of various insert and delete operations, possibly concurrent. The fact that insertions are unique suggests using a Unique Set (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>). However, we also have to account for indices and the list order.</p>
<p>One approach would use indices directly: when a user calls <code>insert(i, x)</code>, they send <code>(i, x)</code> to the other users, who use <code>i</code> to insert <code>x</code> at the appropriate location. The challenge is that your intended insertion index might move around as a result of users’ inserting/deleting in front of <code>i</code>.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/ot.png" alt="The gray cat jumped on the table." /></p>
<p align="center"><i>Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.</i></p>
<p>It’s possible to work around this by “transforming” <code>i</code> to account for concurrent edits. That idea leads to <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Operational_transformation"><strong>Operational Transformation (OT)</strong></a>, the earliest-invented approach to collaborative text editing, and the one used in Google Docs and most existing apps. Unfortunately, OT algorithms are quite complicated, leading to numerous <a rel="noopener" target="_blank" href="https://core.ac.uk/download/pdf/54049928.pdf">flawed algorithms</a>. You can reduce complexity by using a central server to manage the document, like Google Docs does, but that precludes decentralized networks, end-to-end encryption, and server-free open-source apps.</p>
<!--Several incorrect attempts at server-free OT were published before the [first correct one](https://core.ac.uk/download/pdf/54049928.pdf) in 2005 (cite, check correctness via citations)---the same year the [first CRDT paper](https://hal.inria.fr/inria-00071240/document) was published. -->
<p>List CRDTs use a different perspective from OT. When you type a character in a text document, you probably don’t think of its position as “index 17” or whatever; instead, its position is at a certain place within the existing text.</p>
<p>“A certain place within the existing text” is vague, but at a minimum, it should be between the characters left and right of your insertion point (“on” and “ table“ in the example above) Also, unlike an index, this intuitive position is <em>immutable</em>.</p>
<p>This leads to the following implementation. The list’s state is a Unique Set whose values are pairs <code>(p, x)</code>, where <code>x</code> is the actual value (e.g., a character), and <code>p</code> is a <strong>unique immutable position</strong> drawn from some abstract total order. The user-visible state of the list is the list of values <code>x</code> ordered by their positions <code>p</code>. Operations are implemented as:</p>
<ul>
<li><code>insert(i, x)</code>: The inserting user looks up the positions <code>pL</code>, <code>pR</code> of the values to the left and right (indices <code>i</code> and <code>i+1</code>), generates a unique new position <code>p</code> such that <code>pL < p < pR</code>, and calls <code>add((p, x))</code> on the Unique Set. </li>
<li><code>delete(i)</code>: The deleting user finds the element <code>e</code> of the Unique Set at index <code>i</code>, then calls <code>delete(e)</code> on the Unique Set.</li>
</ul>
<p>Of course, we need a way to create the positions <code>p</code>. That’s the hard part—in fact, the hardest part of any CRDT—and I don’t have space to go into it here; you should use an existing algorithm (e.g., <a rel="noopener" target="_blank" href="http://dx.doi.org/10.1016/j.jpdc.2010.12.006">RGA</a>) or implementation (e.g., <a rel="noopener" target="_blank" href="https://docs.yjs.dev/api/shared-types/y.array">Yjs’s <code>Y.Array</code></a>). <!--Generally, solutions involve a tree, sorted by the tree walk on nodes; you create a unique new position in between `pL` and `pR` by adding a new leaf somewhere between `pL` and `pR`, e.g., as a right child of `pL`.--></p>
<p>The important lesson here is that we had to translate indices (the language of normal, non-CRDT lists) into unique immutable positions (what the user intuitively means when they say “insert here”). That leads to our second principle of CRDT design:</p>
<p><a name="principle-2"></a><strong>Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.</strong></p>
<h2 id="registers">Registers</h2>
<p>Our last basic CRDT is the <strong>register</strong>. This is a variable that holds an arbitrary value that can be set and get. If multiple users set the value at the same time, you pick one of them arbitrarily, or perhaps average them together.</p>
<p><strong>Example uses for registers:</strong></p>
<ul>
<li>The font size of a character in a collaborative rich-text editor.</li>
<li>The name of a document.</li>
<li>The color of a specific pixel in a collaborative whiteboard.</li>
<li>Basically, anything where you’re fine with users overwriting each others’ concurrent changes and you don’t want to use a more complicated CRDT.</li>
</ul>
<p>Registers are very useful and suffice for many tasks (e.g., <a rel="noopener" target="_blank" href="https://www.figma.com/blog/how-figmas-multiplayer-technology-works/">Figma</a> and <a rel="noopener" target="_blank" href="https://hex.tech/blog/a-pragmatic-approach-to-live-collaboration">Hex</a> use them almost exclusively).</p>
<p>The only operation on a register is <code>set(x)</code>, which sets the value to <code>x</code> (in the absence of concurrent operations). We can’t perform these operations literally, since if two users receive concurrent <code>set</code> operations in different orders, they’ll end up with different values.</p>
<p>However, we can <em>add</em> the value <code>x</code> to a Unique Set, following <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>. The state is now a set of values instead of a single value, but we’ll address that soon. We can also delete old values each time <code>set(x)</code> is called, overwriting them.</p>
<p>Thus the implementation of <code>set(x)</code> becomes:</p>
<ul>
<li>For each element <code>e</code> in the Unique Set, call <code>delete(e)</code> on the Unique Set; then call <code>add(x)</code>.</li>
</ul>
<p>The result is that at any time, the register’s state is the set of all the most recent concurrently-set values.</p>
<p>Loops of the form “for each element of a collection, do something” are common in programming. We just saw a way to extend them to CRDTs: “for each element of a Unique Set, do some CRDT operation”. I call this a <strong>causal for-each operation</strong> because it only affects elements that are prior to the for-each operation in the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#causal-order">causal order</a>. It’s useful enough that we make it our next principle of CRDT design:</p>
<p><a name="principle-3a"></a><strong>Principle 3a. For operations that do something “for each” element of a collection, one option is to use a <em>causal for-each operation</em> on a Unique Set (or list CRDT).</strong></p>
<p>(Later we will expand on this with <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3b">Principle 3b</a>, which also concerns for-each operations.)</p>
<p>Returning to registers, we still need to handle the fact that our state is a set of values, instead of a specific value.</p>
<p>One option is to accept this as the state, and present all conflicting values to the user. That gives the <strong>Multi-Value Register (MVR)</strong>.</p>
<p><a name="lww-register"></a>Another option is to pick a value arbitrarily but deterministically. E.g., the <strong>Last-Writer Wins (LWW) Register</strong> tags each value with the wall-clock time when it is set, then picks the value with the latest timestamp.</p>
<p><img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/pixelpusher.png" alt="Grid of pixels, some conflicting (outlined in red). One conflicting pixel has been clicked on, revealing the conflicting choices." /></p>
<p align="center"><i>In <a href="https://medium.com/@pvh/pixelpusher-real-time-peer-to-peer-collaboration-with-react-7c7bc8ecbf74" target="_blank">Pixelpusher</a>, a collaborative pixel art editor, each pixel shows one color by default (LWW Register), but you can click to pop out all conflicting colors (MVR). Image credit: Peter van Hardenberg (<a href="https://miro.medium.com/max/270/1*tXSBtdqf6yBCO6i77VVH1A.png" target="_blank">original</a>).</i></p>
<p>In general, you can define the value getter to be an arbitrary deterministic function of the set of values.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>If the values are colors, you can average their RGB coordinates.</li>
</ul>
<!--figure: illustration-->
<ul>
<li><a name="enable-wins-flag"></a>If the values are booleans, you can choose to prefer <code>true</code> values, i.e., the register’s value is <code>true</code> if its set contains any <code>true</code> values. That gives the <strong>Enable-Wins Flag</strong>.</li>
</ul>
<h1 id="composing-crdts">Composing CRDTs</h1>
<p>We now have enough basic CRDTs to start making more complicated data structures through composition. I’ll describe three techniques: CRDT objects, CRDT-valued maps, and collections of CRDTs.</p>
<h2 id="crdt-objects">CRDT Objects</h2>
<p>The simplest composition technique is to use multiple CRDTs side-by-side. By making them instance fields in a class, you obtain a <strong>CRDT Object</strong>, which is itself a CRDT (trivially correct). The power of CRDT objects comes from using standard OOP techniques, e.g., implementation hiding.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>In a collaborative flash card app, to make individual cards editable, you could represent each card as a CRDT object with two text CRDT (list CRDT of characters) instance fields, one for the front and one for the back.</li>
<li>You can represent the position and size of an image in a collaborative slide editor by using separate registers for the left, top, width, and height. <!--To get a complete image object, you might also add registers for border color/size/style, a text CRDT for the caption, a register for the image source (unless it's immutable, in which case you can use an ordinary, non-CRDT instance field), etc.--></li>
</ul>
<!--- Recall that we defined lists and registers in terms of the Unique Set. We can consider these as CRDT objects as well, even though they just have one instance field (the set). The object lets us delegate operations and reads to the inner set while exposing the API of a list/register.-->
<p>To implement a CRDT object, each time an instance field requests to broadcast a message, the CRDT object broadcasts that message tagged with the field’s name. Receivers then deliver the message to their own instance field with the same name. <!--When nesting CRDT objects, this effectively creates a tree with a [basic CRDT](#basic-designs) at each leaf; each basic CRDT message is sent tagged with its path to the root.--></p>
<h2 id="crdt-valued-map">CRDT-Valued Map</h2>
<p>A CRDT-valued map is like a CRDT object but with potentially infinite instance fields, one for each allowed map key. Every key/value pair is implicitly always present in the map, but values are only explicitly constructed in memory as needed, using a predefined factory method (like Apache Commons’ <a rel="noopener" target="_blank" href="https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/LazyMap.html">LazyMap</a>).</p>
<p><strong>Examples:</strong></p>
<ul>
<li><a id="add-wins-set"></a>Consider a shared notes app in which users can archive notes, then restore them later. To indicate which notes are normal (not archived), we want to store them in a set. A Unique Set won’t work, since the same note can be added (restored) multiple times. Instead, you can use a CRDT-valued map whose keys are the documents and whose values are <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#enable-wins-flag">enable-wins flags</a>; the value of the flag for key <code>doc</code> indicates whether <code>doc</code> is in the set. This gives the <strong>Add-Wins Set</strong>.</li>
<li><a rel="noopener" target="_blank" href="https://quilljs.com/">Quill</a> lets you easily display and edit rich text in a browser app. In a Quill document, each character has an <code>attributes</code> map, which contains arbitrary key-value pairs describing formatting (e.g., <code>"bold": true</code>). You can model this using a CRDT-valued map with arbitrary keys and <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#lww-register">LWW register</a> values; the value of the register for key <code>attr</code> indicates the current value for <code>attr</code>.</li>
</ul>
<p>A CRDT-valued map is implemented like a CRDT object: each message broadcast by a value CRDT is tagged with its serialized key. Internally, the map stores only the explicitly-constructed key-value pairs; each value is constructed using the factory method the first time it is accessed by the local user or receives a message. However, this is not visible externally—from the outside, the other values still appear present, just in their initial states. (If you want an explicit set of “present” keys, you can track them using an <a href="#add-wins-set">Add-Wins Set</a>.)</p>
<blockquote>
<p>CRDT-valued maps are based on the <a rel="noopener" target="_blank" href="https://docs.riak.com/riak/kv/2.2.3/learn/concepts/crdts/index.html#maps">Riak map</a>.</p>
</blockquote>
<p></p><br />
<h2 id="collections-of-crdts">Collections of CRDTs</h2>
<p>Our above definition of a Unique Set implicitly assumed that the data values <code>x</code> were immutable and serializable (capable of being sent over the network). However, we can also make a <strong>Unique Set of CRDTs</strong>, whose values are dynamically-created CRDTs.</p>
<p>To add a new value CRDT, a user sends a unique new tag and any arguments needed to construct the value. Each recipient passes those arguments to a predefined factory method, then stores the returned CRDT in their copy of the set. When a value CRDT is deleted, it is forgotten and can no longer be used.</p>
<p>Note that unlike in a CRDT-valued map, values are explicitly created (with dynamic constructor arguments) and deleted—the set effectively provides collaborative <code>new</code> and <code>free</code> operations.</p>
<p>We can likewise make a <strong>list of CRDTs</strong>.</p>
<p><strong>Examples:</strong></p>
<ul>
<li>In a shared folder containing multiple collaborative documents, you can define your document CRDT, then use a Unique Set of document CRDTs to model the whole folder. (You can also use a CRDT-valued map from names to documents, but then documents can’t be renamed, and documents “created” concurrently with the same name will end up merged.)</li>
</ul>
<!--- In a todo-list app, you can define a "todo-item CRDT" with fields `text` and `done`, giving the item text and whether it is done. The whole app's state is then a list of todo-item CRDTs.-->
<ul>
<li>Continuing the Quill rich-text example from the previous section, you can model a rich-text document as a list of “rich character CRDTs”, where each “rich character CRDT” consists of an immutable (non-CRDT) character plus the <code>attributes</code> map CRDT. This is sufficient to build <a rel="noopener" target="_blank" href="https://compoventuals-tests.herokuapp.com/host.html?network=ws&container=demos/rich-text/dist/rich_text.html">a simple Google Docs-style app with CRDTs</a> (<a rel="noopener" target="_blank" href="https://github.com/composablesys/collabs/blob/master/demos/apps/rich-text/src/rich_text.ts">source</a>).</li>
</ul>
<h2 id="using-composition">Using Composition</h2>
<p>You can use the above composition techniques and basic CRDTs to design CRDTs for many collaborative apps. Choosing the exact structure, and how operations and user-visible state map onto that structure, is the main challenge.</p>
<p>A good starting point is to design an ordinary (non-CRDT) data model, using ordinary objects, collections, etc., then convert it to a CRDT version. So variables become registers, sets become Unique Sets or Add-Wins Sets, etc. You can then tweak the design as needed to accommodate extra operations or fix weird concurrent behaviors.</p>
<p>To accommodate as many operations as possible while preserving user intention, I recommend:</p>
<p><a name="principle-4"></a><strong>Principle 4. Independent operations (in the user’s mind) should act on independent state.</strong></p>
<p><strong>Examples:</strong></p>
<ul>
<li>As mentioned earlier, you can represent the position and size of an image in a collaborative slide editor by using separate registers for the left, top, width, and height. If you wanted, you could instead use a single register whose value is a tuple (left, top, width, height), but this would violate Principle 4. Indeed, then if one user moved the image while another resized it, one of their changes would overwrite the other, instead of both moving and resizing. <!--Likewise, it would be a mistake to replace (left, top, width, height) with (left, top, right, bottom) (this also violates [Principle 2](#principle-2)).--></li>
<li>Again in a collaborative slide editor, you might initially model the slide list as a list of slide CRDTs. However, this provides no way for users to move slides around in the list, e.g., swap the order of two slides. You could implement a move operation using cut-and-paste, but then slide edits concurrent to a move will be lost, even though they are intuitively independent operations.<br />
<a name="list-with-move"></a>Following Principle 4, you should instead implement move operations by modifying some state independent of the slide itself. You can do this by replacing the <em>list</em> of slides with a <em>Unique Set</em> of objects <code>{ slide, positionReg }</code>, where <code>positionReg</code> is an LWW register indicating the position. To move a slide, you create a unique new position like in a list CRDT, then set the value of <code>positionReg</code> equal to that position. This construction gives the <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3380787.3393677"><strong>list-with-move</strong></a> CRDT.</li>
</ul>
<h1 id="new-concurrent-causal-for-each-operations">New: Concurrent+Causal For-Each Operations</h1>
<p>There’s one more trick I want to show you. Sometimes, when performing a for-each operation on a Unique Set or list CRDT (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>), you don’t just want to affect existing (causally prior) elements. You also want to affect <em>elements that are added/inserted concurrently</em>.</p>
<p>For example:</p>
<ul>
<li>In a rich text editor, if one user bolds a range of text, while concurrently, another user types in the middle of the range, the latter text should also be bolded.
<br />
<img src="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/weird_bolding.png" alt="One user bolds a range of text, while concurrently, another user types “ the“ in the middle. In the final result, “ the“ is also bolded." />
<br />
In other words, the first user’s intended operation is “for each character in the range <em>including ones inserted concurrently</em>, bold it”.</li>
<li>In a collaborative recipe editor, if one user clicks a “double the recipe” button, while concurrently, another user edits an amount, then their edit should also be doubled. Otherwise, the recipe will be out of proportion, and the meal will be ruined!</li>
</ul>
<p>I call such an operation a <strong>concurrent+causal for-each operation</strong>. To accomodate the above examples, I propose the following addendum to <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>:</p>
<p><a name="principle-3b"></a><strong>Principle 3b. For operations that do something “for each” element of a collection, another option is to use a <em>concurrent+causal for-each operation</em> on a Unique Set (or list CRDT).</strong></p>
<p>To implement this, the initiating user first does a causal for-each operation. They then send a message describing how to perform the operation on concurrently added elements. The receivers apply the operation to any concurrently added elements they’ve received already (and haven’t yet deleted), then store the message in a log. Later, each time they receive a new element, they check if it’s concurrent to the stored message; if so, they apply the operation.</p>
<!-- > **Aside.** It would be more general to split Principle 3 into "causal for-each" and "concurrent for-each" operations. However, I haven't yet found a good use-case for a concurrent for-each operation that isn't part of a concurrent+causal for-each.
<p></p><br /> -->
<p>Concurrent+causal for-each operations are novel as far as I’m aware. They are based on <a rel="noopener" target="_blank" href="https://doi.org/10.1145/3408976">a paper</a> I, <a rel="noopener" target="_blank" href="https://heather.miller.am/">Heather Miller</a>, and <a rel="noopener" target="_blank" href="http://christophermeiklejohn.com/">Christopher Meiklejohn</a> wrote last year, about a composition technique we call the <em>semidirect product</em>, which can implement them (albeit in a confusing way). <!--Unfortunately, the paper doesn't make clear what the semidirect product is doing intuitively (since we didn't understand this ourselves!). My current opinion is that concurrent+causal for-each operations are what it's really trying to do; the semidirect product is a special case of an optimized implementation, but written in the confusing traditional style (implementation + proof that concurrent operations commute). --></p>
<!-- > If you do want to use the semidirect product as an optimized implementation, be aware that it is not as general as it could be. E.g., the recipe example can be optimized, but not using the semidirect product. I'll write up a tech report about a more general approach at some point.
<p></p><br /> -->
<!-- Aside: dual view: controller for the for-each part plus oppositely-adjusted state. E.g. for scaling, or reversible list? Perhaps contrast with that approach---ours should be easier, in comparison to e.g. rich-text CRDT using invisible formatting characters (direct construction approach). -->
<!--# Summary: Principles of CRDT Design
also non-principle advice (basic designs, composition techniques)
For easy reference, here are our principles of CRDT design.
[**Principle 1.**](#principle-1) Use the Unique Set CRDT for operations that "add" or "create" a unique new thing.
[**Principle 2.**](#principle-2) Express operations in terms of user intention---what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.
**Principle 3([a](#principle-3a), [b](#principle-3b)).** For operations that do something "for each" element of a collection, use a *causal for-each operation* or a *concurrent+causal for-each operation* on a Unique Set (or list CRDT).
[**Principle 4.**](#principle-4) Independent operations (in the user's mind) should act on independent state.-->
<h1 id="case-study-a-collaborative-spreadsheet">Case Study: A Collaborative Spreadsheet</h1>
<p>Now let’s get practical: we’re going to design a CRDT for a collaborative spreadsheet editor (think Google Sheets).</p>
<p>As practice, try sketching a design yourself before reading any further. The rest of this section describes how I would do it, but don’t worry if you come up with something different—there’s no one right answer! The point of this blog post is to give you the confidence to design and tweak CRDTs like this yourself, not to dictate “the one true spreadsheet CRDT™”.</p>
<h2 id="design-walkthrough">Design Walkthrough</h2>
<p>To start off, consider an individual cell. Fundamentally, it consists of a text string. We could make this a text (list) CRDT, but usually, you don’t edit individual cells collaboratively; instead, you type the new value of the cell, hit enter, and then its value shows up for everyone else. This suggests instead using a register, e.g., an LWW register.</p>
<p>Besides the text content, a cell can have properties like its font size, whether word wrap is enabled, etc. Since changing these properties are all independent operations, following <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>, they should have independent state. This suggests using a CRDT object to represent the cell, with a different CRDT instance field for each property. In pseudocode (classes are implicitly <a href="#crdt-objects">CRDT objects</a>):</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Cell </span><span>{
</span><span> content</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">string</span><span>>;
</span><span> fontSize</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> wordWrap</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span></code></pre>
<p>The spreadsheet itself is a grid of cells. Each cell is indexed by its location (row, column), suggesting a map from locations to cells. (A 2D list could work too, but then we’d have to put rows and columns on an unequal footing, which might cause trouble later.) Thus let’s use a <code>Cell</code>-CRDT-valued map.</p>
<p>What about the map keys? It’s tempting to use conventional row-column indicators like “A1”, “B3”, etc. However, then we can’t easily insert or delete rows/columns, since doing so renames other cells’ indicators. (We could try making a “rename” operation, but that violates <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-2">Principle 2</a>, since it does not match the user’s original intention: inserting/deleting a different row/column.)</p>
<p>Instead, let’s identify cell locations using pairs (row, column), where “row” means “the line of cells horizontally adjacent to this cell”, independent of that row’s literal location (1, 2, etc.), and likewise for “column”. That is, we create an opaque <code>Row</code> object to represent each row, and likewise for columns, then use pairs <code>(Row, Column)</code> for our map keys.</p>
<p>The word “create” suggests using Unique Sets (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-1">Principle 1</a>), although since the rows and columns are ordered, we actually want list CRDTs. Hence our app state looks like:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span>rows: ListCRDT</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListCRDT</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span><span>cells: CRDTValuedMap</span><span style="color:#ececec;"><</span><span style="color:#78cecc80;">[</span><span>row: Row, column: Column</span><span style="color:#78cecc80;">]</span><span>, Cell</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Now you can insert or delete rows and columns by calling the appropriate operations on <code>columns</code> and <code>rows</code>, without affecting the <code>cells</code> map at all. (Due to the lazy nature of the map, we don’t have to explicitly create cells to fill a new row or column; they implicitly already exist.)</p>
<p>Speaking of rows and columns, there’s more we can do here. For example, rows have editable properties like their height, whether they are visible, etc. These properties are independent, so they should have independent states (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>). This suggests making <code>Row</code> into a CRDT object:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Row </span><span>{
</span><span> height</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span></code></pre>
<p>Also, we want to be able to move rows and columns around. We already described how to do this using a <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#list-with-move">list-with-move</a>:</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">ListWithMove</span><span><</span><span style="color:#d6d6d6;">C</span><span>> {
</span><span> state</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">UniqueSet</span><span><{value</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">C</span><span>, positionReg</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#d6d6d6;">ListCRDTPosition</span><span>>}>;
</span><span>}
</span><span>
</span><span>rows: ListWithMove</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListWithMove</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Next, we can also perform operations on every cell in a row, like changing the font size of every cell. For each such operation, we have three options:</p>
<ol>
<li>Use a causal for-each operation (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3a">Principle 3a</a>). This will affect all current cells in the row, but not any cells that are created concurrently (when a new column is inserted). E.g., a “clear” operation that sets every cell’s value to <code>""</code>.</li>
<li>Use a concurrent+causal for-each operation (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-3b">Principle 3b</a>). This will affect all current cells in the row <em>and</em> any created concurrently. E.g., changing the font size of a whole row.</li>
<li>Use an independent state that affects the row itself, not the cells (<a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#principle-4">Principle 4</a>). E.g., our usage of <code>Row.height</code> for the height of a row.</li>
</ol>
<!-- > **Aside.** Note that the for-each loops loop over every cell in the row, even blank cells that have never been used. This has the downside of making all those cells explicitly exist in the CRDT-valued map, increasing memory usage. We tolerate this since our focus is to pin down the semantics, not give an efficient implementation. Once the semantics are pinned down, though, you are free to optimize the implementation.
<p></p><br /> -->
<!--Lastly, let's take another look at cell contents. Before I said it was just a string, but it's more interesting than that: cells can reference other cells in formulas, e.g., "= A2 + B3". If a column is inserted in front of column A, these references should update to "= B2 + C3", since they intuitively describe a *cell*, not the indicators themselves. So, we should store them using a pair `[row: Row, column: Column]`, like the map keys. The content then becomes an array of tokens, which can be literal strings or cell references:
```ts
class Cell {
content: LWWRegister<(string | [row: Row, column: Column])[]>;
fontSize: LWWRegister<number>;
wordWrap: EnableWinsFlag;
// ...
}
```-->
<h2 id="finished-design">Finished Design</h2>
<p>In summary, the state of our spreadsheet is as follows.</p>
<pre data-lang="ts" style="background-color:#393939;color:#dedede;" class="language-ts "><code class="language-ts" data-lang="ts"><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> ---- CRDT Objects ----
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Row </span><span>{
</span><span> height</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Column </span><span>{
</span><span> width</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> isVisible</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">Cell </span><span>{
</span><span> content</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">string</span><span>>;
</span><span> fontSize</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#fffb9d;">number</span><span>>;
</span><span> wordWrap</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">EnableWinsFlag</span><span>;
</span><span style="color:#a0cfa1;"> //</span><span style="color:#87ae86;"> ...
</span><span>}
</span><span>
</span><span style="color:#fffb9d;">class </span><span style="color:#f4a020;">ListWithMove</span><span><</span><span style="color:#d6d6d6;">C</span><span>> {
</span><span> state</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">UniqueSet</span><span><{value</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">C</span><span>, positionReg</span><span style="color:#ececec;">: </span><span style="color:#d6d6d6;">LWWRegister</span><span><</span><span style="color:#d6d6d6;">ListCRDTPosition</span><span>>}>;
</span><span>}
</span><span>
</span><span style="color:#a0cfa1;">//</span><span style="color:#87ae86;"> ---- App state ----
</span><span>rows: ListWithMove</span><span style="color:#ececec;"><</span><span>Row</span><span style="color:#ececec;">></span><span>;
</span><span>columns: ListWithMove</span><span style="color:#ececec;"><</span><span>Column</span><span style="color:#ececec;">></span><span>;
</span><span>cells: CRDTValuedMap</span><span style="color:#ececec;"><</span><span style="color:#78cecc80;">[</span><span>row: Row, column: Column</span><span style="color:#78cecc80;">]</span><span>, Cell</span><span style="color:#ececec;">></span><span>;
</span></code></pre>
<p>Note that I never explicitly mentioned <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2023/collaborative-data-design/#correctness">CRDT correctness</a>—the claim that all users see the same document state after receiving the same messages. Because we assembled the design from existing CRDTs using composition techniques that preserve CRDT correctness, it is trivially correct. Plus, it should be straightforward to reason out what would happen in various concurrency scenarios.</p>
<p>As exercises, here are some further tweaks you can make to this design, phrased as user requests:</p>
<ol>
<li>“I’d like to have multiple sheets in the same document, accessible by tabs at the bottom of the screen, like in Excel.” <em>Hint (highlight to reveal): <font color="white">Use a list of CRDTs.</font></em></li>
<li>“I’ve noticed that if I change the font size of a cell, while at the same time someone else changes the font size for the whole row, sometimes their change overwrites mine. I’d rather keep my change, since it’s more specific.” <em>Hint: <font color="white">Use a register with a custom getter.</font></em></li>
<li>“I want to reference other cells in formulas, e.g., <code>= A2 + B3</code>. Later, if <code>B3</code> moves to <code>C3</code>, its references should update too.” <em>Hint: <font color="white">Store the reference as something immutable.</font></em></li>
</ol>
<h1 id="conclusion">Conclusion</h1>
<p>I hope you’ve gained an understanding of how CRDTs work, plus perhaps a desire to apply them in your own apps. We covered a lot:</p>
<ul>
<li><strong>Traditional CRDTs:</strong> Unique Set, List/Text, LWW Register, Enable-Wins Flag, Add-Wins Set, CRDT-Valued Map, and List-with-Move.</li>
<li><strong>Novel Operations</strong>: Concurrent+causal for-each operations on a Unique Set or list.</li>
<li><strong>Whole Apps</strong>: Spreadsheet, rich text, and pieces of various other apps.</li>
</ul>
<p>For more info, <a rel="noopener" target="_blank" href="https://crdt.tech/">crdt.tech</a> collects most CRDT resources in one place.</p>
<p>I’ve also started putting these ideas into practice in a library, <a rel="noopener" target="_blank" href="https://www.npmjs.com/package/@collabs/collabs">Collabs</a>. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in <a rel="noopener" target="_blank" href="https://www.youtube.com/watch?v=Exr0iY_D-vw">my Strange Loop talk</a>.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>I thank Justine Sherry, Jonathan Aldrich, and Pratik Fegade for reviewing this post and giving helpful feedback. I also thank Heather Miller, Ria Pradeep, and Benito Geordie for numerous CRDT design discussions that led to these ideas.</p>
Time-Traveling Simulation for Security2022-12-06T00:00:00+00:002022-12-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/timetraveling-simulation/<p>Blockchains are a powerful technology which allow decentralized agreement with an immutable history. Since transactions can be added, but not removed, blockchains allow distributed banking as a trustworthy alternative to central banking.
A vast amount of cryptographic research on constructing secure blockchains has led to them being trusted to secure currency worth <a rel="noopener" target="_blank" href="https://coinmarketcap.com/currencies/bitcoin/">hundreds of billions</a> of US dollars.</p>
<p>Recently, blockchains have received attention as an enabler of cryptography rather than just a goal of it. Several works have used blockchains to build a variety of cryptographic tools, including <a rel="noopener" target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-70500-2_18">one-time programs</a> and <a rel="noopener" target="_blank" href="https://link.springer.com/article/10.1007/s10623-018-0461-x">time-lock encryption</a>. These tools are impossible to construct without special assumptions. These works model cryptographic protocols as occurring in a world where a blockchain protocol is being executed. The cryptographic protocol is therefore able to perform actions such as reading the state of the blockchain or posting transactions to it. The exact security definitions vary significantly between these approaches.</p>
<p>Time-traveling simulation is a new security model for protocols executed in the presence of a blockchain. Intuitively, time-traveling simulation captures the philosophy that “any extra information an adversary learns in a real execution could have been learned on their own by waiting for the natural passage of time”. Since a blockchain will naturally progress no matter what the adversary does, it provides the notion of time needed to formalize this philosophy. </p>
<p>Time-traveling simulation bypasses many impossibility results, while the same time yielding an arguably stronger notion of security than prior blockchain based works. For example, time-traveling simulation enables zero knowledge arguments and secure two-party computation in three messages. It is currently not known how to construct these protocols in three messages with the standard notion of security, without relying on new hardness assumptions. </p>
<p>In this article, we will dive into the <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#the-philosophy-of-security">definition of time-traveling simulation</a> and how it <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#comparison-to-other-relaxed-security-notions">compares to other security notions</a>. Additionally, we will explore how it can be used to bypass impossibility results for <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#application-time-traveling-simulators-for-zero-knowledge">three message zero knowledge arguments</a>.</p>
<h1 id="the-philosophy-of-security">The Philosophy of Security</h1>
<p>In modern cryptography, the central philosophy for security is “any extra information an adversary learns in a real execution could have been learned on their own”. In other words, the adversary learns nothing from participating in the real execution, beyond what they were supposed to learn. For example, in a zero knowledge argument, the adversary only learns that a given NP statement is true, without learning a witness for <em>why</em> it is true. This particular notion is actually too strong for many applications, so cryptographers usually consider weakenings of this philosophy with the same spirit. The most common weakening is “any extra information an adversary learns in a real execution could have been learned on their own using a little extra computation”.</p>
<p>These philosophies are captured formally by a mathematical object called a simulator. A simulator’s job is to reproduce whatever knowledge the adversarial verifier would have learned in a real execution of the protocol. However, it must do this without access to the real prover; it only has the adversary’s code. If such a simulator exists, then the adversary could run the simulator on its own. By doing so, it learns everything it would have learned in a real interaction, without interacting with the real prover.</p>
<p>More formally, a simulator (for zero knowledge) takes as input the adversary’s code and the statement being proven, then outputs a transcript of a protocol execution, along with the adversary’s internal state. In the real world, without loss of generality, the adversary outputs the transcript of the protocol execution along with its own internal state. This is before any post-processing. A protocol is zero knowledge if there exists a simulator whose output distribution is indistinguishable from the output distribution of the adversary in the real world. This guarantees that whatever information can be derived from the output of the adversary in the real world is indistinguishable from what can be derived from the simulator. Thus, by running the simulator, the adversary can learn whatever it would have learned in a real execution.</p>
<p><img src="./simulator-paradigm.png" alt="In the real world, the adversary interacts with someone who knows a secret. In the ideal world, the simulator does not know the secret, and may internally interact with the adversary to produce a realistic looking view." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> The simulator imagines an interaction between the adversarial verifier and an imaginary prover. This interaction is indistinguishable from a real interaction, from the adversary's point of view.</div>
<p>In some sense, a simulator can be viewed as a method for the adversary to fool itself into accepting the truth of a statement without knowing a witness. It is important that the adversary can only fool itself - an adversarial prover should not be able to fool an honest verifier. This requires some asymmetry between the simulator and a real-world adversary. One of the most basic forms of asymmetry is knowledge of the adversary’s code, which allows the simulator to internally run and interact with the adversary. Any adversary knows its own code, but it certainly shouldn’t know anyone else’s!</p>
<p>To relax the security philosophy, the simulator is provided with some form of additional power which represents additional asymmetry between the simulator and a real-world adversary. The more asymmetry, the easier it is to create a simulator without allowing an adversarial prover to convince an honest verifier of a false statement. In general, providing more extra power to the simulator corresponds to a weaker security notion. The adversary can learn whatever the simulator can learn, so a more powerful simulator corresponds to an adversary which can learn more information. The table below compares common relaxations to time-traveling simulation in terms of their philosophies and what extra power is given to the simulator.</p>
<table><thead><tr><th align="left">Security Notion</th><th align="left">Philosophy: <br/>“Any extra information an adversary learns in a real execution could have been learned on their own…”</th><th align="left">Simulator</th></tr></thead><tbody>
<tr><td align="left">Expected PPT (Standard)</td><td align="left">in expected PPT.</td><td align="left">Runs in expected poly time.</td></tr>
<tr><td align="left">Superpolynomial Simulation</td><td align="left">in superpolynomial time.</td><td align="left">Runs in superpolynomial time.</td></tr>
<tr><td align="left">Common Reference String (CRS)</td><td align="left">using the CRS trapdoor.</td><td align="left">Can choose the CRS used by both parties. This allows adding a trapdoor to it.</td></tr>
<tr><td align="left">Majority Simulation</td><td align="left">if they controlled the blockchain.</td><td align="left">Controls the majority of blockchain participants.</td></tr>
<tr><td align="left">Time-Traveling Simulation</td><td align="left">shortly into the future.</td><td align="left">Can look into the future.</td></tr>
</tbody></table>
<h2 id="security-implications-of-time-traveling-simulation">Security Implications of Time-Traveling Simulation</h2>
<p>As mentioned previously, time-traveling simulation captures the philosophy that “any extra information an adversary learns in a real execution could have been learned on their own by waiting for the natural passage of time”. This is realized by allowing the simulator to see a potential future state of the blockchain, which consists of a valid extension by \( F \) blocks. Since such a state will become public information after a short time regardless of what the adversary does, this only reveals information that would have anyway been revealed with the natural passage of time.</p>
<p>Simulator access to a future state allows time-traveling simulation to bypass impossibility results for expected probabilistic polynomial time simulation, which is considered the standard notion of simulation.
A common blockchain property is that a computationally-bounded adversary cannot compute a valid extension by \( F \) blocks faster than the honest parties can extend the chain by, say, \( \sqrt{F} \) blocks. Therefore access to a future state represents additional asymmetry between the simulator and a real adversary.
This additional asymmetry makes it possible for the simulator to “imagine” the adversary’s real-world view in protocols where it otherwise would not have been able to, bypassing the impossibility results for expected PPT simulation (aka standard simulation).</p>
<p><img src="./future-state.png" alt="A blockchain comes equipped with a validity predicate which allows checking whether a state is a valid extension of a previous state. A future state is a valid extension of the current state." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> A blockchain comes equipped with a validity predicate which allows checking whether a state is a valid extension of a previous state. A future state is a valid extension of the current state.</div>
<p>Time-traveling simulation is almost as meaningful as standard simulation when it comes to long-term knowledge.
For example, imagine the task of constructing multi-party computation protocols which are secure against malicious adversaries. A malicious adversary may deviate from the protocol arbitrarily. Another kind of adversary is a semi-honest adversary, which follows the protocol, but may attempt to analyze the transcript later. It is much easier to construct multi-party computation protocols which are secure against semi-honest adversaries. A multi-party computation protocol with semi-honest security can be transformed to have malicious security by using the <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/pdf/10.1145/28395.28420">GMW compiler</a>. To do the transformation, each party proves the statement “I executed the protocol honestly using some input” in zero knowledge. This convinces the other parties that they did indeed behave honestly, but does not reveal an explanation for the honest behavior. Crucially, this means that the zero knowledge argument preserves the privacy of each party’s inputs. Now consider using a zero knowledge argument with time-traveling simulation to instantiate the GMW compiler. Since honest behavior in a non-time-sensitive protocol does not depend on the passage of time, this does not reveal an explanation for the honest behavior. In particular, the inputs of each party are still private.</p>
<p>In contrast, time-traveling simulation may not be suitable for applications which are inherently time sensitive. For example, consider using a zero knowledge argument with time-traveling simulation to prove knowledge of a solution to a time-lock puzzle. A time-lock puzzle can be solved in some set amount of time (for example, a day), but cannot be solved faster than that. Since the simulator has access to a future state from after the time-lock puzzle can be solved, in this situation time-traveling simulation may allow the solution to be leaked today instead of tomorrow.</p>
<h3 id="comparison-to-other-relaxed-security-notions">Comparison to Other Relaxed Security Notions</h3>
<p>Several of these security notions also bypass impossibility results for expected PPT simulation. One way to further compare security notions is comparing how powerful their simulators are. As mentioned previously, a security notion which allows the simulator more power may allow the adversary to learn more information. In many cases, time-traveling simulation gives the simulator less power than other simulation notions, so it corresponds to better security guarantees.</p>
<p><strong>Super-Polynomial Time Simulation.</strong> Time-traveling simulation can be seen as a very restricted form of super-polynomial time or angel-based simulation. Angel-based simulation is similar to super-polynomial time simulation, but restricts the extra computational power to performing one specific task. For example, an angel may break the security of a particular commitment scheme. Both super-polynomial time and angel-based simulators are very powerful and can bypass many impossibility results. However, it can be challenging to argue that the simulator cannot break the security of other primitives. These primitives may only have security against polynomial-time adversaries, so they can be broken using any super-polynomial time computation. Continuing the example of commitments, if the simulator could also break a second commitment scheme, then it cannot guarantee that the second scheme is secure against the real adversary.</p>
<p>In the case of time-traveling simulation, the angel’s task is to quickly compute a potential future state of the blockchain exactly once. It is worth emphasizing the special nature of this task: it is computing something which will be publicly available information in just a short while. As such, whatever security a time-traveling simulator breaks would have been broken soon anyway. For example, regardless of which commitment scheme the parties use, the commitment to their input can never be broken by a time-traveling simulator.</p>
<p><strong>Common Reference String.</strong> Another good point of comparison is the common reference string model, since the blockchain state represents a pre-agreed-upon string. One important difference between a CRS and the way time-traveling simulation uses a blockchain is that the format of a common reference string often depends on the exact protocol being run (for example, a zero knowledge proof or a secure computation protocol). However, a blockchain does not adapt to auxiliary protocols. A second, and perhaps more important difference, is the notion of control. In the CRS model, the simulator has full control over the CRS. A time-traveling simulator, on the other hand, has no actual control over the blockchain, only some extra information about it. This means that a time-traveling simulator can learn less information than a simulator with full control over the blockchain. Since an adversary might be able to learn whatever a simulator can, the security notion is stronger if the simulator only has extra information, instead of full control.</p>
<p><strong>Majority Simulation.</strong> This difference in control over versus knowledge about the blockchain is especially illustrated when comparing time-traveling simulation to majority simulation. Majority simulation is another relaxed security model for protocols executed alongside a blockchain. In majority simulation, the simulator is allowed control over all honest parties which are participating in the progression of the blockchain. Since blockchain security requires the honest parties to be in control of the blockchain, this allows a majority simulator to perform tasks such as pausing or even rewinding the blockchain. Such capabilities should even allow computation of future states of the blockchain, which is the only power given to a time-traveling simulator. </p>
<p>In particular, majority simulation can introduce security vulnerabilities when running two different protocols using the same blockchain. Since the two protocols rely on the security of the blockchain, a simulator with full control over the blockchain can easily break the security of either protocol. Therefore majority simulation does not guarantee that a party which participates in one protocol cannot violate the security of the other protocol. Although it is nontrivial to see, time-traveling simulation can allow multiple protocols to use the same blockchain at the same time if they are careful. </p>
<h1 id="application-time-traveling-simulators-for-zero-knowledge">Application: Time-Traveling Simulators for Zero Knowledge</h1>
<p>Time-traveling simulators allow a particularly simple construction for zero knowledge arguments with three messages. As mentioned previously, constructing zero knowledge arguments with three messages is very difficult under the standard notion of security (expected PPT simulation). <a rel="noopener" target="_blank" href="https://iacr.org/archive/tcc2008/49480068/49480068.pdf">Prior work</a> shows that any security proof for a three message zero knowledge argument must make non-blackbox use of the adversary’s code. However, non-blackbox techniques are notoriously difficult. The only <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3188745.3188870">current construction</a> for three message zero knowledge relies on new cryptographic hardness assumptions.</p>
<p>A zero knowledge argument is, first and foremost, an argument. A prover attempts to convince a verifier that an NP statement \( x\) is in an NP language \( L\). The prover should not be able to convince the verifier of a false statement; this property is called soundness. The zero knowledge property requires that the argument does allow the verifier to learn anything about the witness for \( x \in L\). This is formalized using the simulator definition discussed <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/timetraveling-simulation/#the-philosophy-of-security">above</a>. As a reminder, the simulator must approximate a real view of the argument, except it does not have access to the real prover. In the standard notion of simulation, the simulator is an expected PPT algorithm.</p>
<p>In time-traveling simulation for zero knowledge arguments, the simulator additionally receives a valid extension of the blockchain by \(F\) blocks. Then it must produce the adversary’s view. If left alone, the blockchain will generate extensions of itself which are independent of the statement \(x\) or its witnesses. Therefore the future state which the simulator receives is effectively harmless and contains no information about the witness beyond what is naturally leaked with the passage of time.</p>
<p><img src="./timetraveling-simulator-zk.png" alt="In a real zero knowledge argument execution, the prover knows the witness. A time-traveling simulator for zero knowledge receives a future state of the blockchain instead of the witness." /></p>
<div style="margin-left: 50px; margin-right: 40px;"><b>Figure:</b> In a real zero knowledge argument execution, the prover knows the witness. A time-traveling simulator for zero knowledge receives a future state of the blockchain instead of the witness.</div>
<h2 id="zero-knowledge-in-three-rounds">Zero Knowledge in Three Rounds</h2>
<p>The construction of a three round zero knowledge argument uses a three round witness indistinguishable proof of knowledge (WIPoK). In a WIPoK, a prover convinces a verifier that they “know” a witness for some NP statement. The witness indistinguishability property guarantees that if there are two possible witnesses for the statement, then the verifier cannot tell which one the prover knows. This is a weaker security guarantee than zero knowledge, so it is possible to construct a WIPoK in just three rounds (even without assuming special setup like a CRS or a blockchain).</p>
<p>The construction is as follows. To prove the truth of an NP statement \( x\), the prover and verifier engage in a WIPoK for the statement “I know a witness for \( x\) or I know a blockchain state \(F\) blocks ahead of the current state”. Showing zero knowledge requires constructing a time-traveling simulator, which is initialized with a future state. The simulator acts as a prover in the WIPoK with the adversary, using the future state as its witness. Witness indistinguishability guarantees that an execution using the future state as a witness is indistinguishable from an execution using a witness for \(x \). The latter case is exactly what occurs in a real execution, so the simulator’s output is indistinguishable from a real execution.</p>
<p>To show soundness, observe that any adversarial prover must know a witness for the statement. This is either a witness for \( x \) or it is a future state of the blockchain. Since a real adversary cannot possibly know a future state of the blockchain without violating the blockchain’s security, it must know a witness for \( x\). The full argument for soundness requires some additional care in order to use the proof of knowledge property, since the WIPoK is composed in parallel with a blockchain protocol and many security properties break down during parallel composition. See the <a rel="noopener" target="_blank" href="https://eprint.iacr.org/2022/035.pdf">full paper</a> for details.</p>
Kangaroo: Caching billions of tiny objects on flash2022-05-02T00:00:00+00:002022-05-02T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/kangaroo/<p>Many social-media and Internet-of-Things services have large numbers of tiny objects, each a few hundred bytes or less.
For example, edges in Facebook’s social graph, which are needed to connect friends, posts, and images among other content, <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/osdi20/presentation/berg">average under 100 bytes</a>.
Twitter tweets <a rel="noopener" target="_blank" href="https://techcrunch.com/2018/10/30/twitters-doubling-of-character-count-from-140-to-280-had-little-impact-on-length-of-tweets/">average 33 bytes</a>.</p>
<p>These objects are permanently stored in large-scale databases, object stores, or filesystems.
On top of this permanent storage layer,
popular objects are cached.
Caches allow quicker access to the popular objects and lower load on the storage layer.
A cache’s effectiveness in these systems is primarily measured by the ratio of
the number of requests it can fulfill to the total number of requests, or its miss ratio.
As the quantity of data scales, caching layers need to also scale to maintain
their miss ratio, otherwise end-user experiences such as website load times suffer.
However, scaling traditional DRAM caches is prohibitively expensive.
Instead, companies are increasingly using flash
to build larger caches since flash is 100x cheaper per bit than DRAM.</p>
<p>Unfortunately, prior flash caches fall short of efficiently caching tiny objects,
a challenging workload for flash caching.
Prior approaches either increase the cache’s cost by having a high indexing overhead
that requires excessive DRAM capacity to support
or writing too much and rapidly wearing out flash devices.
Thus, with prior designs, flash caching fails to live up to its potential as a cheap, large cache for tiny objects.</p>
<p>Kangaroo is a new flash cache optimized for tiny objects.
It enables efficient caching of tiny objects by requiring only a small
DRAM overhead and a small write overhead for cached objects.
In addition, Kangaroo introduces a new cache eviction policy that uses
minimal DRAM overhead while significantly reducing cache
misses, further reducing load on the storage layer.
Kangaroo is <a rel="noopener" target="_blank" href="https://github.com/saramcallister/Kangaroo">open source</a>
and implemented in <a rel="noopener" target="_blank" href="https://cachelib.org/">CacheLib</a>,
Facebook’s open-source caching engine.</p>
<p>Kangaroo lowers the number of cache misses by 29% over state-of-the-art
flash caching systems under production DRAM and flash constraints on traces
from production social-graph caching workloads at Facebook and Twitter.
These results are also corroborated with a
test deployment of Kangaroo in a shadow production setup at Facebook.
This research was published at <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">SOSP 2021</a> where it won the <a rel="noopener" target="_blank" href="https://sosp2021.mpi-sws.org/awards.html">Best Paper Award</a>.</p>
<h2 id="prior-approaches-too-much-dram-or-too-many-writes">Prior approaches: Too much DRAM or too many writes</h2>
<p>Prior flash caches fall into two main categories: <em>log-structured caches</em> and <em>set-associative caches</em>. Neither of these flash caches can efficiently support tiny objects
because, as explained further below, log-structured caches require prohibitively large
DRAM overheads whereas set-associative caches require prohibitively large write overheads.</p>
<h3 id="log-structured-caches-too-much-dram">Log-structured caches: Too much DRAM</h3>
<p>Log-structured caches use flash as a circular log.
During an insert, objects are first buffered in DRAM and then written to flash
sequentially in large groups.
Since objects can end up anywhere on flash, the cache maintains an in-memory index to find objects.</p>
<p>The advantage of a log-structured design is that it has a low <em>write amplification</em>.
Write amplification is the number of bytes written to flash divided by
the cumulative object size, and it represents the write overhead of a cache.
A write amplification of one is optimal, though often it is higher.
For example, writing a 100-byte object to flash by itself has a write amplification
of ~40x since flash has a minimum write granularity of 4KB.
Flash has a limited number of times it can be rewritten before becoming unusable.
Therefore, this significant write amplification wears out the flash device quickly,
requiring the device to be replaced quickly.
Since a log-structured cache buffers objects in DRAM,
it can wait until it has enough objects to write them to flash efficiently.
Thus, log-structured caches have close-to-optimal write amplification.</p>
<p>However, log-structured caches have a large DRAM overhead when caching tiny objects.
They have to keep an index entry for every on-flash object to enable
finding those objects again on a lookup request.
Since objects are around 100 bytes, there would be roughly 20 billion of them
in a 2 TB flash cache.
Even with the <a rel="noopener" target="_blank" href="https://www.usenix.org/conference/nsdi19/presentation/eisenman">lowest overhead in the literature at 30 bits/object</a>,
the cache would require 75 GB just to the index objects on flash.
Since caching on flash is meant to lower costs through removing DRAM,
log-structured caches are inefficient for tiny objects because they require too much DRAM.</p>
<h3 id="set-associative-caches-too-many-writes">Set-associative caches: Too many writes</h3>
<p>Meanwhile, set-associative caches use flash as a large hash table where each flash page is a single <em>set</em>, or hash bucket.
During a lookup request, the cache hashes an object’s key to find it’s potential set on
flash and reads that flash page to find the object.</p>
<p>Since the finding objects is based on a hash function, set-associative caches
do not need large amounts of memory to track objects.
Thus, unlike log-structured caches, set-associative caches have a low
enough memory overhead to support large flash caches.</p>
<p>However, these caches write many more bytes than necessary.
When inserting a new object, the cache has to write, at a minimum,
a 4 KB flash page for every object.
If objects are roughly 100 bytes, the cache has a <em>40x</em> write amplification.
Thus, set-associative caches are also inefficient for tiny objects because they
require too many writes.</p>
<h2 id="kangaroo-an-efficient-tiny-object-flash-cache">Kangaroo: An efficient tiny-object flash cache</h2>
<p>Kangaroo caches tiny objects on flash effectively by combining log-structured and set-associative caches to reduce both DRAM and flash-write overheads.
Kangaroo has two main parts: <em>KLog</em>, a small log-structured flash cache, and <em>KSet</em>, a large set-associative flash cache.
At a high level, Kangaroo uses KLog as a staging area for objects so that
writing them to KSet is more efficient.</p>
<h3 id="finding-objects-in-kangaroo">Finding objects in Kangaroo</h3>
<p><img src="../kangaroo-lookup.png" alt="Lookup in Kangaroo" /></p>
<figure-caption>
<p>On a lookup, Kangaroo looks for the object in (1) the DRAM cache, then (2a) KLog’s index and (2b) KLog if the key is in the index, then finally (3a) KSet’s
Bloom filters and (3b) KSet if the Bloom filters indicate the object could be there.
If the object is not found in any of these locations, Kangaroo returns a miss.</p>
</figure-caption>
<h3 id="inserting-objects-in-kangaroo">Inserting objects in Kangaroo</h3>
<p><img src="../kangaroo-insert.png" alt="Insert into Kangaroo" /></p>
<figure-caption>
<p>On an insert, Kangaroo first places the object in (1) the DRAM cache.
This insertion may evict an object from the DRAM cache.
If the object is not admitted to flash, (2a) it is evicted from Kangaroo.
For instance, objects can be evicted at this stage based on a random admission policy,
where each object has a fixed probability of admission to the flash cache.
Otherwise, it is inserted into (2b) KLog’s index and (2c) written to flash in KLog via a buffered write.
When objects are evicted from KLog, they are again subject to an admission policy,
described more in the next section,
and (3a) can be evicted from Kangaroo entirely.
Admitted objects are written to (3b) KSet along with any other objects in KLog
that map to the same set in KSet.</p>
</figure-caption>
<p>One important aspect of the insertion path in Kangaroo that reduces write amplification
is how Kangaroo moves objects from KLog to KSet.
KLog often contains multiple objects mapping to the same set in KSet,
such as the pink and yellow objects in the figure above.
Whenever an object is evicted from KLog, Kangaroo proactively uses KLog’s index to
find any other objects that map to the same set in KSet,
and moves them to KLog as well.
Since writing a set always requires writing 4 KB, regardless of the number of objects inserted, writing multiple new objects instead of just 1 greatly reduces the the write amplification.</p>
<p>Thus, Kangaroo amortizes writes to KSet over multiple objects, decreasing the overall number of bytes written to flash.
Kangaroo accomplishes this amortization with a small KLog (~5% of flash), resulting in only a small DRAM overhead to index KLog’s entire capacity.
Kangaroo thus addresses both the DRAM and flash-write overheads of caching tiny objects on flash.</p>
<h3 id="kangaroo-optimizations">Kangaroo optimizations</h3>
<p>On top of this basic design, Kangaroo introduces additional techniques to increase its effectiveness.
In particular, since Kangaroo is a cache and not a key-value store, it can evict objects to minimize writes.
Kangaroo exploits this opportunity by adding a threshold admission policy that evicts objects from KLog instead of admitting them to KSet if there are fewer than n objects to insert to a set in KSet.
This admission policy allows Kangaroo to guarantee that the write amplification for moving objects to KSet will be much lower than a set-associative cache.</p>
<p>Kangaroo also introduces RRIParoo, a low DRAM-overhead eviction policy for KSet
based on the processor eviction policy <a rel="noopener" target="_blank" href="https://people.csail.mit.edu/emer/papers/2010.06.isca.rrip.pdf">RRIP</a>.
At a high level, RRIParoo keeps one bit in DRAM per object in KSet
to represent whether an object has been requested since the object was last
written to flash.
When a set is rewritten, this bit is used to update a 3-bit recency values kept on flash per object.
Objects in a set are then ordered by their 3-bit recency value
and Kangaroo evicts the least valuable
objects to make room for objects coming from KLog.
Thus, RRIParoo allows an advanced eviction policy in KSet
while keeping a low DRAM overhead.</p>
<p>Kangaroo provides further optimizations to reduce DRAM overhead and reduce misses, as explained in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">SOSP’21 paper</a>.
Together, these optimizations allow Kangaroo to overcome the limitations of log-structured caches and set-associative caches,
creating a flash cache that delivers on the goal of efficient caching for tiny objects.</p>
<h2 id="kangaroo-outperforms-other-flash-caches">Kangaroo outperforms other flash caches</h2>
<p>We evaluated Kangaroo on a 2 TB flash drive using a production trace from Facebook
under production DRAM and write rate constraints.
We also evaluated CacheLib’s default small object cache (SA), a set-assocative
cache that Facebook uses to serve its social graph,
and an optimistic version of a log-structured cache (LS) with a full in-DRAM
index.</p>
<p><img src="../kangaroo-results.png" alt="Kangaroo vs LS vs SA on production FB trace" /></p>
<figure-caption>
<p>Kangaroo reduces misses compared to LS by 56% and to SA by 29% over the last
2 days of the production FB trace.
LS’s high DRAM overhead means that it cannot index the entire flash drive.
Thus, it has a lower effective capacity, which increases its miss ratio.
SA’s high write amplification means that it has to rate limit its insertions
and greatly over-provision flash to prevent the flash device from
wearing out too quickly.
Kangaroo does not run into these issues and has a better eviction policy,
allowing it to outperform other flash caches.</p>
</figure-caption>
<p>We corroborated these results in a production shadow deployment at Facebook.
In addition, Kangaroo maintains its advantage if operated under different constraints,
such as different write rate limits, more or less available DRAM, different tiny object workload, and larger device capacities. More details on these results can be found in our <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">paper</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Kangaroo is a flash cache for billions of tiny objects that handles a wide range of DRAM and flash-write budgets.
Kangaroo leverages prior log-structured and set-associative designs, together with new techniques, to achieve the best of both designs.
Experiments using a trace from Facebook show DRAM usage close to the best prior DRAM- optimized design,
flash writes close to the best prior write-optimized design,
and miss ratios better than either.
Kangaroo shows that flash caches can support tiny objects,
an adversarial workload for DRAM usage and write amplification,
while maintaining flash’s cost advantage.</p>
<p>For more details about Kangaroo, check out our SOSP <a rel="noopener" target="_blank" href="https://www.youtube.com/watch?v=bJ4rqSrcVqs">presentation</a> and <a rel="noopener" target="_blank" href="https://dl.acm.org/doi/10.1145/3477132.3483568">paper</a>.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>I want to thank my other collaborators on this work:
<a rel="noopener" target="_blank" href="https://bsb20.github.io/">Benjamin Berg</a> (CMU), <a rel="noopener" target="_blank" href="http://cmu.io/%7Ejtutuncu/">Julian Tutuncu-Macias</a> (CMU, now at Goldman Sachs), <a rel="noopener" target="_blank" href="https://jasony.me/">Juncheng Yang</a> (CMU), Sathya Gunasekar (Facebook), Jimmy Lu (Facebook),
<a rel="noopener" target="_blank" href="https://www.microsoft.com/en-us/research/people/daberg/">Daniel Berger</a> (Microsoft Research and University of Washington), <a rel="noopener" target="_blank" href="https://www.cs.cmu.edu/%7Ebeckmann/">Nathan Beckmann</a> (CMU), and <a rel="noopener" target="_blank" href="https://users.ece.cmu.edu/%7Eganger/">Greg Ganger</a> (CMU).
I would also like to give a special thanks to the <a rel="noopener" target="_blank" href="https://cachelib.org/">CacheLib</a> team at Facebook
as well as both Facebook and Twitter for sharing traces with us.</p>
Cases2Beds: A Case Study in Actionable Intelligence Highlights2022-01-06T00:00:00+00:002022-01-06T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2022/casestobeds/<p><em>This blog post is adapted from the <a rel="noopener" target="_blank" href="https://delphi.cmu.edu/blog/2021/03/10/cases2beds-a-case-study-in-actionable-intelligence/">Delphi blog</a>, originally published on March 10th, 2021. Again, thank you to the Allegheny County Health Department, the DELPHI Group, Chris Scott, and Roni Rosenfeld.</em></p>
<p>One of the <a rel="noopener" target="_blank" href="https://delphi.cmu.edu/">Delphi Group</a>’s goals is to create informative tools for healthcare organizations. Tools are only useful if the insights they provide can inform concrete actions. That is to say these tools must provide actionable intelligence. In early November 2020, as COVID case rates in Allegheny County continued to rise, the Delphi Group partnered with the Allegheny County Health Department (ACHD) to create such tools for investigating if hospitals located in the county would run out of hospital beds for COVID patients <a href="#f1">(Fig. 1)</a>.</p>
<div id="f1"></div>
<p><img src="./WPRDC-1.svg" alt="Image of the hospitalizations due to COVID-19 and new cases from positive PCR tests in Allegheny County. There are rapid upward trends in hospitalizations and positive cases from October 2020 to mid-December 2020. The maximum number of hospitalizations is about 600 and the minimum is less than 50 [in Oct 2020]. The maximum number of positive cases is over 7000 and the minimum is less than 1000 [in Oct 2020]." />
<strong>Fig. 1:</strong> Hospitalizations Due to COVID-19 and New Cases from Positive PCR Tests in Allegheny County (WPRDC Data <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#WPRDCLink">1</a></sup>)</p>
<p>Based on its planning, the ACHD needed at least a week to open emergency COVID facilities. If the emergency space wasn’t open and hospital beds ran out, mortality rates could soar. But, if we didn’t need the facility, that decision would have stretched already thin resources. Many of the hospitals in Allegheny County were in contact, but each hospital system only had visibility into its own facilities. We wanted to offer a more holistic picture of hospital resources for ACHD to assist in its planning.</p>
<h2 id="a-probabilistic-approach">A Probabilistic Approach</h2>
<p>To provide county-level intelligence on hospital bed usage, we developed Cases2Beds<sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#Cases2BedsLink">2</a></sup>.</p>
<p>To extrapolate beds utilization 1-2 weeks in the future, we needed to estimate:</p>
<ol>
<li>The probability that a person who tested positive for COVID-19 would require hospitalization</li>
<li>How many days after testing a person would be hospitalized</li>
<li>How long a person with COVID would stay in the hospital</li>
<li>The current number of positive COVID tests</li>
</ol>
<p>These values vary by demographic factors, most notably age (<a href="#f2">Fig. 2</a>), and to a lesser extent, sex and race.</p>
<div id="f2"></div>
<p><img src="./rates-1.svg" alt="Age Group Comparisons based on the Allegheny County COVID-19 Tableau. The age groups are 0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70+, and unspecified. As the age group increases, the percent of those who were tested in that age group and were later hospitalized in that age group increases (the 70+ age group being > 5%)." />
<strong>Fig. 2:</strong> Age Group Comparisons based on the Allegheny County COVID-19 Tableau <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#ACHDDashboardLink">3</a></sup></p>
<p>We used public data from Allegheny County about the number of people tested, test positivity rate, and hospitalization rate, broken down by the aforementioned demographic factors.</p>
<p>We also acquired information for two critical parameters: </p>
<ul>
<li><strong>Offset</strong>: Offset is the number of days between the day of testing (called specimen collection date) and the first day of hospitalization. For example, if the test date were 5 days before hospitalization, the offset would be 5 days. Also, if the test date is the hospital admit date, the offset would be 0 days (or sometimes, if, for example, they are admitted at midnight, -1 or +1 days). Notably, the offset can be negative, meaning a person may have been tested some days or weeks after admission.</li>
<li><strong>Length of Stay</strong>: The length of stay is approximately how many days a person uses a bed in the hospital (± 1 day).</li>
</ul>
<p>Given the hospitalization rate, the offset distribution, and the length of stay distribution, we can simulate multiple futures for any given set of positive cases and their testing dates. Estimating the future given a set of probabilities is a common problem and a possible approach is called a Monte Carlo simulation. This process ultimately shows the expected distribution of the number of beds needed each day.</p>
<p>Monte Carlo simulations involve running a large number of scenarios based on a set of probabilities. The more scenarios run, the more accurate the model tends to be. For example, if you gave 1000 people one dart to throw at a dartboard, even though each throw may not be very good, you’d still be able to get a pretty good idea of where the bull’s eye is after 1000 throws. This is the same principle we applied for Cases2Beds – after many simulations, we had a good idea of how many beds might be needed in the next two weeks.</p>
<p>Our prototype Monte Carlo simulation was written in Python and had a runtime of a few minutes. However, because the simulation works best with probabilities derived from Protected Health Information (PHI), ACHD needed to run it privately and offline so there would be no data transmission. Thus, any type of web application (which would transmit data to our servers) was ruled out. Even asking ACHD to run our Python software on their machines fell into a grey area. However, Microsoft Excel was easy to use and supported by ACHD. So we converted Cases2Beds into a spreadsheet. </p>
<p>It is relatively straightforward to port the Python application to VBScript macros for Microsoft Excel. However, those macros aren’t designed to run large simulations, and we saw that the time required to generate a model was far worse, bordering on unusable.</p>
<h2 id="an-alternative-to-monte-carlo-the-analytical-model">An Alternative to Monte Carlo: the Analytical Model</h2>
<p>As an alternative, we developed an analytical model for Microsoft Excel that offered a much faster run time than the full Monte Carlo simulation. The sheet has two tabs of inputs: constant parameters (first tab, static), and case counts (second tab, dynamic). </p>
<p>The analytical model had the same idea as the Monte Carlo simulation. Some fraction of individuals who test positive today will be hospitalized after a varying offset (from test date to admit date) and variable duration (from admit date to discharge date) based on their characteristics (see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#app">appendix</a>). Because these parameters can vary by region, anyone can change these values in spreadsheet tab 1.</p>
<p>The characteristics are:</p>
<ol>
<li>Age Group: (Most important) [unspecified, 0-9, 10-19, 20-29 … 70-79, 80+]</li>
<li>Sex: [unspecified, M, F]</li>
<li>Race: [unspecified, Black, White, Hispanic, Asian]</li>
</ol>
<p>And the parameters are:</p>
<ol>
<li>Hospitalization Rate</li>
<li>Offset Distribution Parameter Set: Parameters describing the number of days before someone who tests positive is hospitalized</li>
<li>Duration Distribution Parameter Set: Parameters describing the number of days someone will be in the hospital</li>
</ol>
<p>The second types of inputs are the daily positive cases split by their traits. This is the input that the user actively changes on their end.</p>
<p>Behind the scenes, we take these parameters (first input tab) and generate Offset Fractions, which is the probability that a patient with given traits will occupy a bed for a duration k days after the specimen testing date. These Offset Fractions and the daily positive case breakdown (second input) give us the expected mean and variance up to 1 month in the future of the number of patients in the hospital per day based on the cases already seen (for details, see <a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#app">appendix</a>). This information can be used to generate plots like <a href="#f3">(Fig. 3)</a>. This graph isn’t to suggest that there won’t be any need for beds after February! It is just that based on the cases we know, very few people will be hospitalized for more than a month.</p>
<div id="f3"></div>
<p><img src="./C2B-1.svg" alt="Output of Cases2Beds using historical data until January 21st for Allegheny County Using Public Parameters. In the output of Cases2Beds, we see a peak in mid-December 2020 in the mean number of beds, followed by a stagnation period in mid-January 2021 and a rapid decline until the end of March 2021. The 25-75 Quantile and 5-95 Quantile are highlighted on the graph with the band having the largest width between mid-December 2020 and mid-January 2021. " />
<strong>Fig. 3:</strong> Output of Cases2Beds using historical data until January 21st for Allegheny County Using Public Parameters</p>
<p>If we assume independence between patients, the mean and variance calculations are exact. However, our quantile estimates are based on approximating the sum of independent binary variables, which is inaccurate near the tails. So the accuracy of the more extreme quantiles (95%+) depends on the number of cases present, which in practice makes them less trustworthy.</p>
<h2 id="cases2beds-in-action">Cases2Beds in Action</h2>
<p>By the end of November 2020, we had a viable prototype Cases2Beds spreadsheet used by ACHD. Over the following months, we made various modifications with their feedback. For example, the ACHD staff did not have time to manually input case numbers. So, we were able to use the granular public data to give them estimates of future hospital utilization without any additional work on their end. </p>
<p>At the peak of bed utilization, hospital systems themselves increased their COVID beds utilization to 6x more than in October 2020. Fortunately, in Allegheny County, we never reached a point where demand for beds exceeded a somewhat elastic supply. In early January 2021, multiple organizations told us that the pandemic’s most acute problem had changed to vaccine distribution and the number of COVID-19 beds needed dropped. Cases2Beds continues to act as an early warning system if the number of cases rise quickly.</p>
<div id="f4"></div>
<p><img src="./HHS-1.svg" alt="Numbers of staffed COVID beds over time vs. capacity from the HHS Protect Data. There was peak hospital utilization (7-day Average of COVID Adult Beds Used) in mid-December 2020 (over 800 beds avg.) before a steady decline until February 2021 (around 200 beds avg). " />
<strong>Fig. 4:</strong> Numbers of staffed COVID beds over time vs. capacity from the HHS Protect Data <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#HHSLink">5</a></sup>.</p>
<p>We were also able to show the efficacy of the spreadsheet to other health departments and hospitals by generating tailored, public parameters for offset and length of stay from different national public resources, like the Florida line-level COVID dataset <sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#FloridaLineLevelLink">4</a></sup>. </p>
<p>Based on these organizations’ feedback that they needed projections more than 2 weeks out, we started to use Cases2Beds as an input to hospital utilization forecasting models. Other inputs to the hospital forecasting model included current hospital bed utilization (from HHS Protect<sup><a href="https://www.cs.cmu.edu/%7Ecsd-phd-blog/2022/casestobeds/#HHSLink">5</a></sup>), how long current patients are likely to continue to be hospitalized, and how many new cases there will be in the near future. A preliminary evaluation of such a method shows decent predictive power when parameters are tailored to a location.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Cases2Beds was a case study about the realities of research institutions offering actionable intelligence in healthcare. While the Cases2Beds tool demonstrated reasonable predictive power, it was difficult to deploy it in a timely and actionable way. Our most significant challenges were data access and bureaucratic limitations to develop solutions at the granularity needed. </p>
<p>Research institutions can be effective partners to health organizations, but the next set of challenges of this pandemic–or the next–will require quick action. The tools we build now can set the stage for the future. </p>
<p>Thank you to the Allegheny County Health Department (especially Antony Gnalian, Dr. LuAnn Brink, and Dr. Debra Bogen) for their invaluable feedback, efforts, and shared interest in actionable intelligence.</p>
<p>Many members of the Delphi Group, including Sumit Agrawal, Katie Mazaitis, and Phil McGuinness, met regularly with the Allegheny County Health Department, provided data, edited this blog post, and investigated various solutions other than Cases2Beds.</p>
<h2 id="resources">Resources</h2>
<p>Please check out the <a rel="noopener" target="_blank" href="https://github.com/cmu-delphi/cases-to-beds-public">Cases2Beds Github Repo</a></p>
<p><a id="WPRDCLink">1.</a> <a rel="noopener" target="_blank" href="https://data.wprdc.org/dataset/allegheny-county-covid-19-tests-cases-and-deaths">WPRDC Allegheny County COVID dataset</a></p>
<p><a id="Cases2BedsLink">2.</a> <a rel="noopener" target="_blank" href="https://www.cmu.edu/delphi-web/cases2beds-v0.2.3.xlsm">Cases2Beds Worksheet</a></p>
<p><a id="ACHDDashboardLink">3.</a> <a rel="noopener" target="_blank" href="https://tableau.alleghenycounty.us/t/PublicSite/views/AlleghenyCountyCOVID-19Information_15912788131180/Landingpage?iframeSizedToWindow=true&%3Aembed=y&%3AshowAppBanner=false&%3Adisplay_count=no&%3AshowVizHome=no&%3Aorigin=viz_share_link">ACHD COVID-19 Dashboard</a></p>
<p><a id="FloridaLineLevelLink">4.</a> <a rel="noopener" target="_blank" href="https://experience.arcgis.com/experience/96dd742462124fa0b38ddedb9b25e429">Florida line-level COVID dataset</a></p>
<p><a id="HHSLink">5.</a> <a rel="noopener" target="_blank" href="https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u">HHS Protect Hospital Utilization Data</a></p>
<div id="app"></div>
<h2 id="appendix">Appendix</h2>
<p>To generate the Offset Fractions (OF(k|traits)), which is the probability a patient with given traits will occupy a bed on k days after the specimen testing date, we follow <strong>Alg 1</strong>. For a given set of traits, the Offset Fractions for day k, where k is between -10 and 31, is the sum of the offset * distribution probabilities * hospitalization rate that sum up to day k. From these Offset Fractions, the mean/var of bed occupancy on a given day is given in <strong>Alg 2</strong>.</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>for o in (-10, 30): #This is the offset
</span><span> for d in (0, 40): #This is the duration of the stay
</span><span> for k in (o, o+d):
</span><span> if (k<31):
</span><span> OF(k|traits) += Offset(o|traits) * Duration(d|traits) * Hospitalization(traits)
</span></code></pre>
<p><strong>Alg 1</strong>: Generate Occupancy Fractions for a given set of traits</p>
<pre style="background-color:#393939;color:#dedede;"><code><span>for specimen_date and num_cases in case_inputs:
</span><span> for t in (-10, 30):
</span><span> p = OF(t|traits)
</span><span> beds_mean(spec_date + t) += num_cases * p
</span><span> beds_var(spec_date + t) += num_cases*p*(1-p)
</span></code></pre>
<p><strong>Alg 2</strong>: Generate Mean and Variances</p>
<p><strong>High-level Mathematical Formulation of the Model:</strong> </p>
<p>O<sub>r,l</sub>: The offset value for a given subset of the population r <span>∈</span> R where R := {race}x{gender}x{age group} for a given day l where -10 <span>≤</span> l <span>≤</span> 30. This <strong>pdf</strong> is derived from a piecewise function using segments of exponential distributions characterized by the offset parameters. </p>
<p>D<sub>r,k</sub>: The duration value for a given subset of the population r <span>∈</span> R for a given day k where 0 <span>≤</span> k <span>≤</span> 40. This <strong>pdf</strong> is derived from a piecewise function using segments of exponential distributions characterized by the duration parameters. </p>
<p>h<sub>r</sub>: The hospitalization rate for a given subset of the population r <span>∈</span> R where 0 <span>≤</span> h<sub>r</sub> <span>≤</span> 1 </p>
<p>c<sub>r,d</sub>: The number of cases for a given subset of the population r <span>∈</span> R on a particular specimen collection date d (ex: 5 cases with specimen collected on January 1st 2021).</p>
<p>$$OF_{r, j} = \sum_{l=-10}^{30} \sum_{k=0}^{40} \mathbb{I} ( l \leq j \leq l+k ) O_{r, l} * D_{r, k}*h_r $$
The offset fraction for a given subset of the population r <span>∈</span> R for a given delta j where -10 <span>≤</span> j <span>≤</span> 30.</p>
<p>$$ \mathbb{E}[\beta_i] = \sum_{d \in D}\sum_{r \in R}\sum_{j = -10}^{30} \mathbb{I} ( d+j = i) OF_{r, j}*c_{r, d} $$
The expected number of beds on date i, where i can start 10 days before the first case date and can end 30 days after the the last case date (c<sub>r,d</sub>)</p>
Hello World2021-08-16T00:00:00+00:002021-08-16T00:00:00+00:00https://www.cs.cmu.edu/~csd-phd-blog/2021/helloworld/<h1 id="hello-world">Hello World!</h1>
<h2 id="hello">Hello</h2>
<p>This is the first post being made to the CSD PhD blog, testing out the
system. And so, indeed, hello world!</p>
<p>That’s really all there is to this
post. You don’t need to keep reading. I just have to fill this space so
that the preview of this post is filled up. That way when it renders
we can see what it looks like full of text. So this is just filler text,
explaining what is going on in a meta way. Feel free to ignore it and
just go about your business.</p>
<p>But it seems that you are in fact continuing to read. I wonder why.
Perhaps if I had filled this
section in with <em>Lorem ipsum</em> it would be a better signal that there are
no secrets to be gotten from reading this section. You are still reading though.
Just reading along. This is just a test post,
and here you are, taking all this time to read it. It’s just gonna be
filled with meaningless filler text. Well, that and some markdown
rendering tests. Most of them are coming up in
the next section. And since you keep on reading, you’ll certainly run
into them. That’s probably going to be even more bland to read. It’s
just going to repeat “Hello World” over and over again. But maybe
you just enjoy reading any words at all. You are, after all, still
reading this.</p>
<p>What, did you still think there was going to be some
secret in this section? Well, there’s not. Honestly its just filler text.
I know, there were these whole additional paragraphs, but they’re not special - just testing
the paragraph break rendering. And, yeah, it works. You saw the paragraph break, right?
Or are you just reading on without paying attention? Or, actually, did the site break?
Well, whatever, this test post can’t do anything about it. No, this post just going to
go on, unread, moldering in a virtual corner. Well, almost unread. You are reading this.
I still don’t know why, but you’ve made it a long way through. Honestly, you could
probably go longer than I care to write for a post as meaningless as this.
Next time, I’m just going to use <em>Lorem ipsum</em> to fill space.</p>
<h2 id="world">World</h2>
<p><em>Hello World</em></p>
<p><strong>Hello World</strong></p>
<p><del>Hello World</del></p>
<p><code>Hello World</code></p>
<p>$$Hello World$$</p>
<blockquote>
<p>Hello World</p>
</blockquote>
<ul>
<li>Hello</li>
<li>World</li>
</ul>
<pre data-lang="c" style="background-color:#393939;color:#dedede;" class="language-c "><code class="language-c" data-lang="c"><span style="color:#fed6af;">#include </span><span style="color:#d6d6d680;"><</span><span style="color:#d68686;">stdio.h</span><span style="color:#d6d6d680;">>
</span><span>
</span><span style="color:#fffb9d;">int </span><span style="color:#fffd87;">main</span><span>() {
</span><span> </span><span style="color:#fffd87;">printf</span><span>(</span><span style="color:#d6d6d680;">"</span><span style="color:#d68686;">Hello World!</span><span style="color:#d6d6d680;">"</span><span>);
</span><span> </span><span style="color:#fed6af;">return </span><span style="font-weight:bold;color:#87d6d5;">0</span><span>;
</span><span>}
</span></code></pre>
<table><thead><tr><th align="right">Hello</th><th align="left">World</th></tr></thead><tbody>
<tr><td align="right">Hi</td><td align="left">Universe</td></tr>
<tr><td align="right">Greetings</td><td align="left">Earth</td></tr>
<tr><td align="right">Hey</td><td align="left">Everything</td></tr>
<tr><td align="right">Sup</td><td align="left">Realm</td></tr>
</tbody></table>
<p><a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/%22Hello,_World!%22_program">Further Reading</a></p>
<h1 id="lorem-ipsum">Lorem Ipsum</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut rutrum nulla luctus tristique porttitor. Curabitur ut nibh non nulla dapibus facilisis. In maximus, nisi bibendum volutpat sagittis, enim ligula vehicula dolor, a dignissim est justo quis lorem. Nulla cursus sagittis magna facilisis imperdiet. Etiam non luctus arcu. Sed vulputate urna urna, sed convallis metus imperdiet et. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Praesent ut ornare nisl, sit amet congue ligula. Ut iaculis euismod dictum. Donec est arcu, porta nec sem vel, euismod mollis eros. Nulla consequat vel magna nec ornare. Pellentesque eu massa vel orci ornare ultrices nec in nunc.</p>
<p>Quisque tellus est, accumsan vitae ullamcorper a, maximus et ante. Mauris odio sem, bibendum fringilla ullamcorper tempor, molestie id dolor. Nulla sed tincidunt sapien. Duis vitae arcu sollicitudin, ullamcorper est vel, varius dolor. Nunc augue erat, congue ut tincidunt id, ornare a libero. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque purus diam, ornare sed suscipit a, euismod in justo.</p>
<p>Aliquam aliquam congue eros vel volutpat. Nunc ullamcorper vitae mi vehicula commodo. Phasellus ultricies a nunc a blandit. Integer tincidunt velit ut metus vehicula, vitae dictum eros sodales. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Cras consectetur suscipit maximus. Integer ut sem fringilla, interdum nulla sed, pretium nisi. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nam lobortis mollis leo, ut condimentum erat hendrerit sit amet. Donec vitae semper risus. Aenean sollicitudin tincidunt laoreet. Quisque velit tellus, vestibulum sed nisi et, pharetra feugiat nunc.</p>
<p>Morbi luctus lobortis orci id aliquam. Pellentesque viverra arcu nunc, sed ultricies lectus molestie quis. Praesent cursus dui elementum purus tempor vehicula. Nulla sed ligula blandit, tristique purus nec, consequat ex. Nunc et consequat ligula, nec vehicula nisi. Integer imperdiet nisl felis, nec porttitor quam maximus quis. Sed commodo lacus eget urna consequat gravida. Proin pellentesque mollis magna, eu consectetur nulla efficitur vitae. Nullam rhoncus faucibus sapien id gravida. Nam maximus pellentesque lorem, auctor vulputate quam porttitor sed. Praesent fringilla id eros sit amet lobortis. Donec ultrices pretium nisl sit amet euismod. Vestibulum consectetur euismod orci non fermentum. Nam congue sapien id interdum malesuada. Sed sit amet rhoncus magna, vel gravida sem. Praesent tincidunt consectetur gravida.</p>
<p>Ut consectetur, ex at sagittis blandit, libero magna dictum velit, nec ullamcorper erat diam nec urna. Curabitur tincidunt nisi risus, non pulvinar ipsum eleifend et. Pellentesque nec dolor non tellus efficitur mattis vitae sed neque. Suspendisse lectus nulla, mollis in fermentum ac, tempus a sapien. Suspendisse tempor consectetur porttitor. Aenean sed purus tempor, mollis lectus ac, tristique odio. Sed purus risus, tempus non risus aliquet, tincidunt aliquam eros. Vestibulum eget sollicitudin diam, porta rhoncus felis. Cras pellentesque vestibulum euismod. Phasellus placerat iaculis quam, quis suscipit elit semper ut.
Foundus theus secretus. Donec tempus sed justo nec semper. Vestibulum blandit velit quis risus lobortis, sit amet efficitur nulla scelerisque. Phasellus condimentum lectus non augue molestie, egestas auctor turpis porta. Mauris eget est a eros venenatis tempus. Duis lorem nisl, vulputate et neque et, ullamcorper ornare ipsum.</p>