Large language models (LLMs) have led to significant progress in various NLP tasks, with long-context models becoming more prominent for processing larger inputs. However, the growing size of the key-value (KV) cache required by Transformer architectures increases memory demands,…
Not only are GPUs expensive, they are also too often idle. Bursty machine
learning (ML) inference requests leave gaps in time. Even when jobs are busy,
they may be compute- or memory-bound, leaving the other resource underutilized
(in space). The obvious solution to underutiliza…
Snowflake recently unveiled ArcticInference, the fastest speculative decoding solution for vLLM currently available. ArcticInference can reduce the end-to-end latency for LLM agent tasks by up to 4.5 times and can improve open-ended chat completion workloads by as much as 2.8 tim…
In the last couple of years a pressing question has started to permeate the mathematical community: how will mathematicians’ jobs coexist harmoniously with AI as it gets progressively better at mathematics? “A.I. Is Coming for Mathematics, Too” was the title chosen by the NY Tim…
Modern computer systems are fast—until they are not.
The memory channel bandwidth between DRAM and the CPU has been far behind the CPU performance for more than three decades, and the gap between their performances (called the Memory Wall) is larger than ever nowadays.
This has b…
Hyperparameter tuning is critical to the success of cross-device federated learning applications. Unfortunately, federated networks face issues of scale, heterogeneity, and privacy; addressing these issues with standard techniques (client subsampling and differential privacy) int…
For years, interpretability research in machine learning has been guided by a “microscope” metaphor: if we zoom in far enough on neurons and circuits, perhaps we can reverse-engineer how models think. This bottom-up program, rooted in the search for mechanistic detail, has yield…
Imagine that you program satellites for NASA that orbit the moon. These satellites are programmed remotely from Earth as the inaccessibility of space makes manual device maintenance (e.g. regular battery replacement) infeasible. Instead of batteries, these satellites are equipped…
Picture this. It’s a warm summer evening. After a long day, your roommate and you settle down on the couch to indulge
your respective Internet-fueled pastimes. You’ve pulled up undeniably adorable cat videos on YouTube, and your roommate
is catching up on the latest episode of he…
In this blog post, I will explain a series of 3 papers, where the third one gives the first construction of explicit constant-degree lossless vertex expanders.
[HMMP24] Explicit two-sided unique-neighbor expanders. With Theo McKenzie, Sidhanth Mohanty, Pedro Paredes.
STOC, 2024…