Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File System," SOSP '03 Shvachko, Kuang, Radia, Chansler, "The Hadoop Distributed File System" MSST '10 Chang, Dean, ... "BigTable: A Distributed Storage System for Structured Data," OSDI '06 Both GFS & HDFS are designed with similar goals: * High throughput (latency less important) - Designed for batch processing jobs, e.g., MapReduce * High capacity (large block size). Typical files > 1GB * High scalability. Handle > 10^7 files * Reliability through replication. Treat failure as normal. Although GFS was developed first, HDFS is much simpler, and so will describe it first. HDFS Based on "write once, read many" model: * Each file has single writer * File fully written and closed before any reader given access Three components: * Clients. * NameNode. Single node containing all metadata about all files * DataNodes. Set of nodes that store actual file contents. * File represented as sequence of blocks of fixed size (64 or 128 MB). (Given byte offset for read, can immediately determine which block to read.) * Each block has unique block ID. * Blocks are distributed across multiple DataNodes to enable parallel access. * Blocks replicated (default 3X) to enable recovery when DataNode fails. * When block created, NameNode decides placements - Default: two within single rack, third on a different rack - Access time / safety tradeoff. NameNode * Metadata: Information about file + plus set of block IDs + location of all replicas of all blocks * Treats each FS operation as transaction - Maintains all information in memory - Logs to EditLog file - (Possibly outdated) backup copy stored on disk. - Can bring this copy up to date by replaying entries in EditLog * Periodic checkpoint: - Apply transactions in editlog to disk image - Delete old parts of editlog - Can do in background while other updates occuring. - Does not store locations of replicas * Tries to satisfy given read request with nearby DataNode - Same node / Same rack / Same system DataNode * Uses its local file system to store blocks * Each block has two files - Actual data - Metadata: checksum, generation stamp (to detect stale copies) * Has no understanding of overall FS semantics * Periodically (every hour) sends "block report" to NameNode, containing information about all replicas it holds. * More often, sends heartbeat message to NameNode. Client * Buffers file as it is being written * Create another block only when reach threshold * Once time to push data to DataNodes, sets up pipeline from client, through each replica's DataNode. * File not committed until closed. * Read: Retrieve list of blocks and where replicas are available - Subsequent reads involve direct interaction between client & datanodes. * API exposes block replica locations - E.g., so that MapReduce can schedule a task near a copy of its data. Interactions * Client & DataNodes communicate to NameNode via RPC Failures * DataNode - Detected by NameNode when DataNode fails to send heartbeat messages - NameNode will decrement replica counts for each of its blocks - Will cause replication to commence * NameNode - Point of high vulnerability - Requires manual intervention - 1-3 hours of effort. - Must rebuild memory image of metadata - Must build map of replicas from DataNodes Some statistics Facebook, 2010 (Largest HDFS installaction at the time) 2000 machines, 22,400 cores 24 TB / machine, (21 PB total) Writing 12TB / day Reading 800TB / day 25K MapReduce jobs / day 65 Million HDFS files 30K simultaneous clients. NameNode biggest impediment to scaling * Performance bottleneck * Holds all data structures in memory * Must vulnerable point for reliability HDFS reliability at Yahoo 2009 Created 329M blocks on 10 clusters with total of 20K data nodes 650 lost blocks: * 533 Orphans from dead clients * 98 where user had specified that should only have 1 replica. * 19 lost due to software bugs (these are the more serious onces.) HDFS availability 22 NameNode failures over 25 clusters in 18 mos. Givens MTBF ~= 600 days 1-3 hours to recover Assuming 3 hours to recover, this gives 0.9998 availability. (OK, but being out of commission for 3 hours is not good. GFS Supports mutable files: * Writes to arbitrary position - Special case: single writer append * Record append - Atomic, concurrent append - Each record will appear in file at least once - May have duplicate records - File may also contain padding & record fragments - Useful for implementing log files * Snapshots - Can quickly make copy of any file - Uses copy-on-write, similar to AFS Same general idea as HDFS: * Data divided into "chunks" of 64MB each * Single master node, many chunk servers Interesting features * Clients get cached copy of metadata via leases (reduces load on master) * Replicas migrate * Log file from master replicated on remote machine * Automatic failover of master ("10s of seconds") Supporting arbitrary writes * One replica designated "primary" via lease * It determines the serialization of writes to a file Supporting record appends * If not enough room within chunk, then pad rest of chunk and retry with new chunk * Possible to create duplicate or fragment of record if failure occurs while writing * May have different versions on different replicas, but they will all have at least one copy of each record, in a unique order. Limitations of GFS * Single master is serious performance bottleneck - MapReduce: Create many files at once - Have systems with multiple master nodes, all sharing set of chunk servers. Not a uniform name space. * Large chunk size. Can't afford to make smaller, since this would create more work for master. - Mitigated by move to BigTable * Now used for tasks that require low latency: Gmail, etc. Building on GFS: BigTable GFS originally designed to support high-throughput, batch operations, e.g., MapReduce jobs Later added BigTable. A "database" * Information stored as records (Rows) each containing set of fields (Columns). * Does not support relational operations * Provides record-level atomicity Implementation * On top of GFS * Basic data unit: "tablet" - 100MB - 200MB - Stores subset of rows in a table - Also used to build high-radix trees * Multiple "tablet servers" * Single master * Tablet represented in different ways: - Base level via "string to string table" SSTable - Immutable key/value storage - Sorted by key - Updates accumulated in log file + Periodically perform "minor compaction" + Generate SSTable from current log file + Describes updates (including deletions) to set of existing SSTables + Periodically perform "major compaction" + Compress entire tablet into single SSTable - See that only uses immutable files (SSTable's) and append-only files (log files)