Google Scholar page
Gates 9011


• Invited attendee at Microsoft Systems Faculty Summit 2018.
• Program committee member for USENIX OSDI 2018.
DART used in Top 5 WSDM 2018 Music Recommendation Challenge winner.
• Program committee member for SysML 2018.
• Joined CMU CSD as an assistant professor.
• Talk at Stanford Information Theory Forum.
• 'EC-Cache' accepted at USENIX OSDI 2016.
• Received Eli Jury Award 2016 for best thesis in the area of Systems, Communications, Control, or Signal Processing at EECS, UC Berkeley.
• 'Piggybacking framework' accepted to IEEE Transactions on Information Theory.
• 'Sparsifying storage codes for fast encoding' accepted at IEEE ISIT 2016.
• Invited talk at ITA Graduation Day 2016.
• Invited talk at Allerton 2015.
• Awarded Google Anita Borg Memorial Scholarsihp 2015. Thanks, Google!
• 'Distributed secret sharing' accepted to IEEE Journal of Selected Topics in Signal Processing.
• 'Reducing I/O cost in distributed storage codes' at USENIX FAST 2015. Chosen as the best paper of USENIX FAST 2015 by StorageMojo.

I am an Assistant Professor in the Computer Science Department at Carnegie Mellon University, with a courtesy appointment in the ECE department. At CMU, I am a part of the Parallel Data Lab (PDL). My research interests lie in the broad area of computer and networked systems with a current focus on big data systems and live video streaming. I am interested in the fault tolerance, scalability, and performance challenges that arise in all layers of the big data stack -- storage/caching, distributed computation, networking, and applications.

Research Overview

A bulk of my past research has focussed on the storage/caching layer and in part on the application (specifically, machine learning) layer:

  • Storage/caching: My research focus here has been on fault tolerance, scalability, load balancing, and reducing latency in large-scale distributed data storage and caching systems. We designed coding theory based solutions that we showed are provably optimal. We also built systems and evaluated them on Facebook's data-analytics cluster and on Amazon EC2 showing significant benefits over the state-of-the-art. Our solutions are now a part of Apache Hadoop 3.0 and are also being considered by several companies such as NetApp and Cisco.

  • Machine learning: My research focus here has been on the generalization performance of a class of learning algorithms that are widely used for ranking. We designed an algorithm building on top of Multiple Additive Regression Trees, and through empirical evaluation on real-world datasets showed significant improvement over classification, regression, and ranking tasks. The new algorithm that we proposed is now deployed in production in Microsoft's data-analysis toolbox which powers the Azure Machine Learning product.


I take a holistic approach towards solving real-world problems considering both theoretical and systems perspectives. I am interested in designing solutions rooted in fundamental theory as well as in building systems that employ these solutions and insights to advance the state-of-the-art.

Recent projects

Today's large-scale distributed storage systems comprise of thousands of nodes, storing hundreds of petabytes of data. In these systems, failures are common, and this mandates storing data in a redundant fashion to ensure reliability and availability. The most common way of adding redundancy is replication. However, replication is highly inefficient in utilizing storage capacity. With the rapid increase in the volume of data needed to be stored, replication is quickly becoming unsustainable, and many distributed storage systems are now turning to erasure coding which offers a storage-efficient alternative. While classical erasure codes are optimal in terms of storage utilization, they come with many drawbacks when applied to the distributed storage setting. For instance, they result in siginificant increase in the usage of network bandwidth and device I/O. Furthermore, in big-data systems, the usage of codes has largely been limited to achieving space-efficient fault tolerance in disk-based storage systems, that is, for storing "cold" (less-frequently accessed) data.

We have constructed new erasure codes (i.e., designing new encoding and decoding algorithms) that provably overcome the limitations of classical erasure codes for application into large-scale distributed storage systems, and designing and building systems that employ these new generation of storage codes in novel ways. We have also employed erasure coding in large-scale cluster caching systems for achieving load balancing under skewed popularity and for reducing latency.

EC-Cache: In-memory object caching in data-intensive clusters routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.

Hitchhiker: An erasure-coded storage system that reduces both network traffic and device I/O by around 25-45% during recovery with no additional storage, the same fault tolerance, and arbitrary flexibility in the choice of parameters, as compared to Reed-Solomon based systems. We have implemented Hitchhiker on top of Facebook's Hadoop Distributed File System (HDFS) and evaluated various metrics on the data-warehouse cluster in production at Facebook with real-time traffic and workloads. The underlying erasure code employed in Hitchhiker is desgined by making use of our Piggybacking framework (see below). Hitchhiker has beeb incorporated into Apache Hadoop.

Piggybacking Code Design Framework: Piggybacking is a framework for designing practical distributed storage codes that are efficient in terms of device I/O and network-bandwidth while not compromising on storage efficiency. Using the Piggybacking framework, we have constructed best known codes for multiple settings. The basic idea behind this framework is to take multiple instances of existing codes and add carefully designed functions of the data of one instance to the other.

TheSys (theory + systems) research group

I am fortunate to be advising and working with the following amazing students at CMU.

PhD advisees:
Jack Kosaian
Michael Rudow

Working with:
Saurabh Kadekodi (Prof. Greg Ganger's PhD student)
Devdeep Ray (Prof. Srini Seshan's PhD student)

Undergraduate advisees:
Amadou Ngom
Eliot Robson


(On Google Scholar)

* indicates equal contribution


Conference Papers

Workshop Papers

Journal Papers


Rashmi K. Vinayak is an assistant professor in the Computer Science department at Carnegie Mellon University. She recieved her PhD in the EECS department at UC Berkeley in 2016, and was a postdoctoral researcher at AMPLab/RISELab and BLISS. Her dissertation received the Eli Jury Award 2016 from the EECS department at UC Berkeley for outstanding achievement in the area of systems, communications, control, or signal processing. Rashmi is the recipient of the IEEE Data Storage Best Paper and Best Student Paper Awards for the years 2011/2012. She is also a recipient of the Facebook Fellowship 2012-13, the Microsoft Research PhD Fellowship 2013-15, and the Google Anita Borg Memorial Scholarship 2015-16. Her research interests lie in building high performance and resource-efficient big data systems based on theoretical foundations.