Hadoop Distributed Files System (HDFS) & Parallel Virtual File System (PVFS): Data-Intensive File Systems Comparison
Cloud file systems, such as GFS and HDFS, are emerging as a key component in large scale computing systems that compute on massive amounts of data. In order to run these applications fast, computations are distributed over a large cluster. By exposing data layout, Cloud file systems enable Map-reduce and Hadoop to minimize the transfer of large amounts of data by shipping computation to nodes that store the data. Although it is commonly believed that high performance computing (HPC) systems use specialized infrastructure, that their parallel file systems are designed for vastly different data access patterns, and that they cannot support Internet services workloads efficiently, in fact, many HPC clusters use commodity compute, storage and network infrastructure. Moreover, parallel file systems have mature deployments and data managements, are cost effective, and have high performance. In this project I compared a parallel file system, developed for HPC, and a Cloud file system. Using PVFS as a representative for parallel file systems and HDFS as a representative for Cloud file systems, I configured a parallel file system into a distributed computing system, Hadoop, and tested performance with micro-benchmarks and macro-benchmarks running on a 4,000 core Internet services cluster, Yahoo!s M45. Once a number of configuration issues such as stripe unit sizes and application buffering sizes are dealt with, issues of replication, data layout and data-guided function shipping are found to be different, but supportable in parallel file systems. Performance of Hadoop applications storing data in an appropriately configured PVFS are comparable to those using a purpose built HDFS.
DiskReduce: RAID for Cloud file systems
Cloud file systems, such as GFS and HDFS, provide high reliability and availability by replicating data, typically three copies of each file while high performance computing file systems, such as Lustre, PVFS, and PanFS, achieve tolerance for the same numbers of concurrent disk failures using much lower overhead erasure encoding, or RAID schemes. In this project, I modify HDFS to include RAID6 without change to the HDFS client, reducing storage overhead in HDFS from 200% down to about 25%. My implementation writes three copies initially, using the existing HDFS client code, and asynchronously encodes data into RAID sets. Delaying encoding trades space for the performance optimizations possible when reading can be satisfied by any one of three nodes, and delaying encoding trades space for reducing the amount of work that is done during the encoding. Finally, triplication and RAID6 are both two failure tolerant, that is, no data is lost if only two disks are concurrently failed. But many more than two disks are likely to fail in large data-intensive clusters, so we analyze reliability in these systems more closely to better understand the impact of data loss resulting from lowering capacity overhead. An earlier version on this project has already stimulated Dhruba Borthakur, the Hadoop author at Facebook, to implement and release HDFS-RAID, a variant on these ideas.
techinal report -
Cloud distributed databases systems, such as HBase, HyperTable, Cassandra and many others, provide a lightweight database system to manage structured data for cloud applications. Although they typically do not support ACID transactions, they can support a wide range of cloud applications. Given the number of different emerging Cloud database systems and the diverse range of Cloud applications, an apples-to-apples comparison is hard and it is difficult to understand tradeoffs between systems. In this project, I am creating a benchmark suit to represent a diverse range of applications including a real machine learning application code. The goal of this benchmark suits is to highlight a set of important workloads for different types of applications to help developers optimize their systems and help users choose a system that suits their workloads. With such a benchmark suits, we hope to identify areas for improving the Cloud database for future research.
Scalable Metadata Service