Wittawat Tantisiriroj

Office: GHC 6219

E-mail: wtan....@cs.cmu.edu

I am currently a Ph.D. student at Computer Science Department in Carnegie Mellon Unveristy. My research interests are in storage, databases, and optimization for computing in a massively distributed environment. My advisor is Professor Garth Gibson. I am a member of Parallel Data Lab (PDL) at CMU.

Top

Research

Hadoop Distributed Files System (HDFS) & Parallel Virtual File System (PVFS): Data-Intensive File Systems Comparison

Cloud file systems, such as GFS and HDFS, are emerging as a key component in large scale computing systems that compute on massive amounts of data. In order to run these applications fast, computations are distributed over a large cluster. By exposing data layout, Cloud file systems enable Map-reduce and Hadoop to minimize the transfer of large amounts of data by shipping computation to nodes that store the data. Although it is commonly believed that high performance computing (HPC) systems use specialized infrastructure, that their parallel file systems are designed for vastly different data access patterns, and that they cannot support Internet services workloads efficiently, in fact, many HPC clusters use commodity compute, storage and network infrastructure. Moreover, parallel file systems have mature deployments and data managements, are cost effective, and have high performance. In this project I compared a parallel file system, developed for HPC, and a Cloud file system. Using PVFS as a representative for parallel file systems and HDFS as a representative for Cloud file systems, I configured a parallel file system into a distributed computing system, Hadoop, and tested performance with micro-benchmarks and macro-benchmarks running on a 4,000 core Internet services cluster, Yahoo!s M45. Once a number of configuration issues such as stripe unit sizes and application buffering sizes are dealt with, issues of replication, data layout and data-guided function shipping are found to be different, but supportable in parallel file systems. Performance of Hadoop applications storing data in an appropriately configured PVFS are comparable to those using a purpose built HDFS.
paper - talk - poster

DiskReduce: RAID for Cloud file systems

Cloud file systems, such as GFS and HDFS, provide high reliability and availability by replicating data, typically three copies of each file while high performance computing file systems, such as Lustre, PVFS, and PanFS, achieve tolerance for the same numbers of concurrent disk failures using much lower overhead erasure encoding, or RAID schemes. In this project, I modify HDFS to include RAID6 without change to the HDFS client, reducing storage overhead in HDFS from 200% down to about 25%. My implementation writes three copies initially, using the existing HDFS client code, and asynchronously encodes data into RAID sets. Delaying encoding trades space for the performance optimizations possible when reading can be satisfied by any one of three nodes, and delaying encoding trades space for reducing the amount of work that is done during the encoding. Finally, triplication and RAID6 are both two failure tolerant, that is, no data is lost if only two disks are concurrently failed. But many more than two disks are likely to fail in large data-intensive clusters, so we analyze reliability in these systems more closely to better understand the impact of data loss resulting from lowering capacity overhead. An earlier version on this project has already stimulated Dhruba Borthakur, the Hadoop author at Facebook, to implement and release HDFS-RAID, a variant on these ideas.
paper - techinal report - talk - talk(implementation) - implementation - posters - website

Cloud Database

Cloud distributed databases systems, such as HBase, HyperTable, Cassandra and many others, provide a lightweight database system to manage structured data for cloud applications. Although they typically do not support ACID transactions, they can support a wide range of cloud applications. Given the number of different emerging Cloud database systems and the diverse range of Cloud applications, an apples-to-apples comparison is hard and it is difficult to understand tradeoffs between systems. In this project, I am creating a benchmark suit to represent a diverse range of applications including a real machine learning application code. The goal of this benchmark suits is to highlight a set of important workloads for different types of applications to help developers optimize their systems and help users choose a system that suits their workloads. With such a benchmark suits, we hope to identify areas for improving the Cloud database for future research.
paper - talk - poster - website

Scalable Metadata Service

talk - poster

Top

Resources

Publications

On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson (CMU), Seung Woo Son, Samuel J. Lang, Robert B. Ross (ANL). Appears in the proceedings of the 24th Supercomputing Conference (SC 2011). November 12-18, 2011. Seattle, Washington, USA.
DiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing. Bin Fan, Wittawat Tantisiriroj, Lin Xiao, Garth Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-11-112. October 2011.
YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio López, Garth Gibson (CMU), Adam Fuchs, Billie Rinaldi (NSA). Appears in the proceedings of the 2rd Symposium on Cloud Computing (SOCC 11). October 26-28, 2011, Cascais, Portugal.
DiskReduce: RAID for Data-Intensive Scalable Computing. Bin Fan, Wittawat Tantisiriroj, Lin Xiao, Garth Gibson. Appears in the proceedings of the 4th Petascale Data Storage Workshop (PDSW 09). November 15, 2009. Portland, Oregon, USA.
In Search of an API for Scalable File Systems: Under the Table or Above It?. Swapnil Patil, Garth A. Gibson, Gregory R. Ganger, Julio Lopez, Milo Polte, Wittawat Tantisiroj, Lin Xiao. Appears in the proceedings of the 1st Workshop on Hot Topics in Cloud Computing (HotCloud 09). June 15, 2009. San Diego, California, USA.
Fast Log-based Concurrent Writing of Checkpoints. Milo Polte, Jiri Simsa, Wittawat Tantisiriroj, Garth Gibson, Shobhit Dayal, Mikhail Chainani, Dilip Kumar Uppugandla. Appears in the proceedings of the 3rd Petascale Data Storage Workshop (PDSW 08). November 17, 2008. Austin, Texas, USA.
Data-intensive File Systems for Internet Services: A Rose by Any Other Name .... Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-114. October 2008.

Talks

On the Duality of Dataintensive File System Design: Reconciling HDFS and PVFS. Presented at the 24th Supercomputing Conference (SC 2011). November 2011.
YCSB++ Benchmarking Tool: Performance Debugging Advanced Features of Scalable Table Stores. Presented by Swapnil Patil at the 19th annual Parallel Data Lab Workshop & Retreat. November 2011.
Scalable Metadata Service in HDFS. Presented by Lin Xiao at the 19th annual Parallel Data Lab Workshop & Retreat. November 2011.
On the Duality of Dataintensive File System Design: Reconciling HDFS and PVFS. Presented at the 19th annual Parallel Data Lab Workshop & Retreat. November 2011.
RAIDTool: A First Step to RAID 6 in HDFS. Presented at the 18th annual Parallel Data Lab Workshop & Retreat. October 2010.
DiskReduce Analysis. Presented by Bin Fan at the 18th annual Parallel Data Lab Workshop & Retreat. October 2010.
DiskReduce: RAID for Data-Intensive Scalable Computing. Presented at the 4th Petascale Data Storage Workshop (PDSW), Supercomputing '09. November 2009.
DiskReduce: Making Room for More Data on DISCs. Presented at the 17th annual Parallel Data Lab Workshop & Retreat. November 2009.
Crossing the Chasm: Sneaking a Parallel File System into Hadoop. Presented at the 16th annual Parallel Data Lab Workshop & Retreat. November 2008.

Poster

Overcoming Metadata Bottlenecks: Scalable HDFS - 2011
On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS - 2011
DiskReduce: RAIDing the Cloud - Grouping Choices, Encoding Options and Reliability - 2011
YCSB++: Benchmarking Advanced Features of BigTable-like Stores - 2011
Scaling the Metadata Service in DISC storage - 2011
On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS - 2011
DiskReduce: RAIDing the Cloud - 2011
YCSB++: Benchmarking Advanced Features of Cloud Databases - 2011
DiskReduce: RAIDing the Cloud - 2010
DiskReduce: Implementation - 2010
Benchmarking Key Value Stores with Yahoo!'s YCSB - 2010
DiskReduce: Making Room for More Data on DISCs - 2009
Scalable Distributed Table Storage Experiments on OpenCloud - 2009
Tables and File Systems: Moving Into the Cloud - 2009
DiskReduce: Making Room for More Data on DISCs - 2009
Crossing the Chasm: Sneaking a Parallel File System into Hadoop - 2008
Log-structured Files for Fast Checkpointing - 2008
Network File System (NFS) in High Performance Networks - 2008

Misc

Resume - 2011
Undergraduate Homepage - 2007

Top

Courses

15-740: Graduate Computer Architecture
15-780: Graduate Artificial Intelligence
15-744: Graduate Computer Networks
15-746: Advanced Storage Systems
15-712: Advanced Operating Systems and Distribued Systems
15-812: Semantics of Programming Languages
15-857A: Performance Modeling and Design of Computer Systems
15-750: Graduate Algorithms
15-441: Computer Networks (TA)
15-746: Advanced Storage Systems (TA)

Top