Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File System," SOSP '03 Shvachko, Kuang, Radia, Chansler, "The Hadoop Distributed File System" MSST '10 Chang, Dean, ... "BigTable: A Distributed Storage System for Structured Data," OSDI '06, TOCS '08 Borthakur, et al., "Apache Hadoop goes realtime at Facebook", SIGMOD '11 Both GFS & HDFS designed originally with similar goals: * High throughput (latency less important) - Designed for batch processing jobs, e.g., MapReduce * High capacity (large block size). Typical files > 1GB * High scalability. Handle > 10^7 files * Reliability through replication. Treat failure as normal. Although GFS was developed first, HDFS is much simpler, and so will describe it first. HDFS Based on "write once, read many" model: * Each file has single writer * Originaly: File fully written and closed before any reader given access * Now: After file written, can reopen and append to it - Still guarantees that never mutate file According to 2010 paper: * Yahoo's largest cluster has 3500 nodes * Yahoo supplied 80% of engineering effort on HDFS Components: * Clients. * NameNode. Single node containing all metadata about all files * DataNodes. Set of nodes that store actual file contents.pp * CheckPointNode. Creates disk image of NameNode state * BackupNode. Creates shadow image of NameNode state * File represented as sequence of blocks of fixed size (64 or 128 MB). (Given byte offset for read, can immediately determine which block to read.) * Each block has unique block ID. * Blocks are distributed across multiple DataNodes to enable parallel access. * Blocks replicated (default 3X) to enable recovery when DataNode fails. - Thought question: What is advantage of replication to using RAID? * When block created, NameNode decides placements - Default: two within single rack, third on a different rack - Access time / safety tradeoff. NameNode * Metadata: Information about file + plus set of block IDs + location of all replicas of all blocks * Treats each FS operation as transaction - Maintains all information in memory - Logs to EditLog file Must do write to editlog before file operation conisidered to be completed. Can group updates from independent clients. - (Possibly outdated) backup copy stored on disk. - Can bring this copy up to date by replaying entries in EditLog * Periodic checkpoint: - Done by CheckPoint Node - Apply transactions in editlog to disk image - Delete old parts of editlog - Can do in background while other updates occuring. - Does not store locations of replicas * Tries to satisfy given read request with nearby DataNode - Same node / Same rack / Same system DataNode * Uses its local file system to store blocks * Each block has two files - Actual data - Metadata: checksum, generation stamp (to detect stale copies) * Has no understanding of overall FS semantics * Periodically (every hour) sends "block report" to NameNode, containing information about all replicas it holds. * More often, sends heartbeat message to NameNode. (Default: every 3 seconds) Client * Buffers file as it is being written * Create another block only when reach threshold * Once time to push data to DataNodes, sets up pipeline from client, through each replica's DataNode. * File not committed until closed. * Read: Retrieve list of blocks and where replicas are available - Subsequent reads involve direct interaction between client & datanodes. * API exposes block replica locations - E.g., so that MapReduce can schedule a task near a copy of its data. Interactions * Client & DataNodes communicate to NameNode via RPC Failures * DataNode - Detected by NameNode when DataNode fails to send heartbeat messages (default: 10 minutes) - NameNode will decrement replica counts for each of its blocks - Will cause replication to commence * NameNode - Point of high vulnerability - Requires manual intervention - 1-3 hours of effort. - Must rebuild memory image of metadata - Must build map of replicas from DataNodes (poll DataNodes for their block info). - Recently: Availability of BackupNode means that only have to get maps from datanodes (Still requires ~20 minutes for failure recovery) Some statistics Facebook, 2010 (Largest HDFS installation at the time) 2000 machines, 22,400 cores 24 TB / machine, (21 PB total) Writing 12TB / day Reading 800TB / day 25K MapReduce jobs / day 65 Million HDFS files 30K simultaneous clients. NameNode biggest impediment to scaling * Performance bottleneck * Holds all data structures in memory * Takes long time to rebuild metadata * Must vulnerable point for reliability Current workaround * Support multiple name spaces - Not ideal from application perspective * Each has separate NameNode * Share DataNodes - Each data block labeled by a group ID HDFS reliability at Yahoo 2009 Created 329M blocks on 10 clusters with total of 20K data nodes 650 lost blocks: * 533 Orphans from dead clients * 98 where user had specified that should only have 1 replica. * 19 lost due to software bugs (these are the more serious onces.) HDFS availability 22 NameNode failures over 25 clusters in 18 mos. Givens MTBF ~= 600 days 1-3 hours to recover Assuming 3 hours to recover, this gives 0.9998 availability. (OK, but being out of commission for 3 hours is not good. GFS Supports mutable files: * Writes to arbitrary position - Special case: single writer append * Record append - Multiple writers - Atomic, concurrent append - Each record will appear in file at least once - May have duplicate records - File may also contain padding & record fragments - Useful for implementing log files * Snapshots - Can quickly make copy of any file - Uses copy-on-write, similar to AFS Same general idea as HDFS (because Hadoop developers read papers about GFS): * Data divided into "chunks" of 64MB each * Single master node, many chunk servers Interesting features * Clients get cached copy of metadata via leases (reduces load on master) * Replicas migrate * Log file from master replicated on remote machine * Automatic failover of master ("10s of seconds") Supporting arbitrary writes * One replica designated "primary" via lease * It determines the serialization of writes to a file Supporting record appends * If not enough room within chunk, then pad rest of chunk and retry with new chunk * Possible to create duplicate or fragment of record if failure occurs while writing * May have different versions on different replicas, but they will all have at least one copy of each record, in a unique order. Limitations of GFS * Single master is serious performance bottleneck - MapReduce: Create many files at once - Have systems with multiple master nodes, all sharing set of chunk servers. Not a uniform name space. * Large chunk size. Can't afford to make smaller, since this would create more work for master. - Mitigated by move to BigTable Building on GFS: BigTable GFS originally designed to support high-throughput, batch operations, e.g., MapReduce jobs Later added BigTable. A "database" * Information stored as records (Rows) each containing set of fields (Columns). - Also support for maintaining multiple entries, each identified by time stamp - Each row or column identified by string key * Does not support relational operations * Provides record-level atomicity (not general transactions) Implementation * On top of GFS * Basic data unit: "tablet" - 100MB - 200MB - Stores contiguous (by key) subset of rows in a table - Also used to build high-radix trees * Multiple "tablet servers" * Single master * Tablet represented in different ways: - Base level via "string to string table" SSTable - Immutable key/value storage - Sorted by key - Updates accumulated in log file + Periodically perform "minor compaction" + Generate SSTable from current log file + Describes updates (including deletions) to set of existing SSTables + Periodically perform "major compaction" + Compress entire tablet into single SSTable - See that only uses immutable files (SSTable's) and append-only files (log files) * Table represented by 3-level hierarchy of tablets - High radix tree structure - Maximum capacity = 2^62 bytes (~1 hexabyte) Modifying file systems to support real-time applications Both GFS & HDFS originally conceived to support background tasks. E.g., * Generate search index for set of web pages (using MapReduce) every few hours/days * Analyzing log/click data Underlying assumptions * Large file sizes * Throughput more important than latency * File system outage of 2+ hours acceptable Now companies have applications that require immediate response * Real-time updating of search data * Personalized searching * Email, messages HDFS * Driven by Facebook. Want to use HDFS to store Facebook messages * Created real-time failover for NameNode - Operate two "Avatar" NameNode - Primary operates as master - Standby kept up to date * Receives duplicate messages from DataNodes * Continuously reads copy of Primary's Editlog to keep own state up to date - Can transfer control from primary to standby in a few seconds GFS * Replaced by "Colossus" ~2010. Sketchy information (Wired Magazine) - Eliminate master node as single point of failure - Reduce block size to 1MB