A bulk of my past research has focussed on the storage/caching layer and in part on the application (specifically, machine learning) layer:
- Storage/caching: My research focus here has been on fault tolerance, scalability, load balancing, and reducing latency in large-scale distributed data storage and caching systems. We designed coding theory based solutions that we showed are provably optimal. We also built systems and evaluated them on Facebook's data-analytics cluster and on Amazon EC2 showing significant benefits over the state-of-the-art. Our solutions are now a part of Apache Hadoop 3.0 and are also being considered by several companies such as NetApp and Cisco.
- Machine learning: My research focus here has been on the generalization performance of a class of learning algorithms that are widely used for ranking. We designed an algorithm building on top of Multiple Additive Regression Trees, and through empirical evaluation on real-world datasets showed significant improvement over classification, regression, and ranking tasks. The new algorithm that we proposed is now deployed in production in Microsoft's data-analysis toolbox which powers the Azure Machine Learning product.
I take a holistic approach towards solving real-world problems considering both theoretical and systems perspectives. I am interested in designing solutions rooted in fundamental theory as well as in building systems that employ these solutions and insights to advance the state-of-the-art.
Today's large-scale distributed storage systems comprise of thousands of nodes, storing hundreds of petabytes of data. In these systems, failures are common, and this mandates storing data in a redundant fashion to ensure reliability and availability. The most common way of adding redundancy is replication. However, replication is highly inefficient in utilizing storage capacity. With the rapid increase in the volume of data needed to be stored, replication is quickly becoming unsustainable, and many distributed storage systems are now turning to erasure coding
which offers a storage-efficient alternative. While classical erasure codes are optimal in terms of storage utilization, they come with many drawbacks when applied to the distributed storage setting. For instance, they result in siginificant increase in the usage of network bandwidth and device I/O. Furthermore, in big-data systems, the usage of codes has largely been limited to achieving space-efficient fault tolerance in disk-based storage systems, that is, for storing "cold" (less-frequently accessed) data.
We have constructed new erasure codes (i.e., designing new encoding and decoding algorithms) that provably overcome the limitations of classical erasure codes for application into large-scale distributed storage systems, and designing and building systems that employ these new generation of storage codes in novel ways. We have also employed erasure coding in large-scale cluster caching systems for achieving load balancing under skewed popularity and for reducing latency.
In-memory object caching in data-intensive clusters routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.
An erasure-coded storage system that reduces both network traffic and device I/O by around 25-45% during recovery with no additional storage, the same fault tolerance, and arbitrary flexibility in the choice of parameters, as compared to Reed-Solomon based systems. We have implemented Hitchhiker on top of Facebook's Hadoop Distributed File System (HDFS) and evaluated various metrics on the data-warehouse cluster in production at Facebook with real-time traffic and workloads. The underlying erasure code employed in Hitchhiker is desgined by making use of our Piggybacking framework (see below). Hitchhiker has beeb incorporated into Apache Hadoop.
Piggybacking Code Design Framework:
Piggybacking is a framework for designing practical distributed storage codes that are efficient in terms of device I/O and network-bandwidth while not compromising on storage efficiency. Using the Piggybacking framework, we have constructed best known codes for multiple settings. The basic idea behind this framework is to take multiple instances of existing codes and add carefully designed functions of the data of one instance to the other.