Adam KohlhaasMonday, June 2, 2025Print this page.
A collaborative research project led in part by researchers in Carnegie Mellon University's School of Computer Science has received top honors at the 2025 Conference on Machine Learning and Systems (MLSys). The paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," was named Best Paper at this year's event, held May 12-15 in Santa Clara, California.
Ruihang Lai, a Ph.D. student in the Computer Science Department (CSD), and Tianqi Chen, an assistant professor in both CSD and the Machine Learning Department, were among the paper's lead contributors. They joined forces with the University of Washington's Zihao Ye, who is a visiting scholar at CMU, to develop scalable solutions for deploying large language models (LLMs) in real-time environments.
FlashInfer introduces a high-performance attention engine optimized to serve LLMs. The project began as a joint effort between CMU; the University of Washington's Allen School of Computer Science and Engineering; and OctoAI, an AI systems startup acquired by NVIDIA. Initially designed to improve inference throughput and flexibility, FlashInfer has evolved into a widely used open-source library with production deployments and active contributions from across the AI systems community.
"It is amazing to see FlashInfer grow from a collaborative research project to a community project being used in major open-source LLM inference engines," Chen said.
More information about FlashInfer, including source code and documentation, is available through the project's repository. For more information about MLSys, visit the conference website.
Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu