=========================================================================== SOSP 2009 Review #188A Updated Wednesday 3 Jun 2009 3:36:37pm PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: A. Excellent paper. I'd really want to see this paper in SOSP (I will Champion it). Confidence: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary & rationale for evaluation ===== The authors investigate the question of whether one can build a cost-effective cluster for data-intensive applications (e.g., reading small random objects such as thumbnail images and wall posts). The answer is yes, given one uses Wimpy nodes, low-power, effecient, embedded CPUs with flash storage. The paper nicely motivates why a Fast Array of Wimpy Nodes is the best approach for combining high performance with power efficiency (i.e., the power consumption of traditional CPUs cannot be scaled well). Their system has two main components, FAWN-DS and FAWN-KV. The authors begin by describing their single node Data Store, FAWN-DS. This is a log-structured key-value store supporting store, lookup, and delete. This design makes sense for flash, given its fast random reads and slow random writes. Each physical node in FAWN will be responsible for multiple virtual nodes, each of whose data objects are stored in their own file; operations exist to split and merge logs (to handle vnodes entering and exiting the system) and compacting (for garbage collection). FAWN-KV, the key-value system is built from known techniques for consistent hashing, with front-end nodes that manage the vnode membership lists, and chain replication. The evaluation uses a significant portion of the paper and appears fairly solid. They start by showing that FAWN-DS obtains close to the random read performance of ext2, that FAWN-DS can obtain near to the peak sequential write performance for bulk store operations, and that concurrent puts to different files (i.e, vnodes) is not prohibitively bad. The authors focus then on get queries for FAWN-KV, since they perform worse than puts. They show a 21-node system obtains 80% of the sustained rate that a single FAWN-DS node could handle. They measure the power consumption during puts/gets and report the query rates while a split is being performed. The authors conclude with a comparison of which configurations (Traditional CPUs vs Wimpy and Disk, SSD or DRAM) are best for TCO depending upon the mix of data set size and query rate. Strengths of paper: + The authors are investigating an interesting problem domain with looking at power consumption and TCO + Really like that they have implemented a system with 21 nodes + Reasonable evaluation (if not perfect) + Interesting fun paper to read, well written Weaknesses: - No grand new conceptual idea (but the combination of ideas is fine) - Don't have a compelling end-to-end application to fully demonstrate FAWN - I'd like to see evaluation of their full functionality, including delete operations, garbage collection, and cleaning/compaction ===== Comments for author ===== - You say the key space is allocated to front-ends by a single management node, that you envision being replicated using a small Paxos cluster. I hope you implement this. - I don't think you need to devote quite so much space to your description of the FAWN-KV ring (e.g., Figs 4 and 5) given how similar your approach is to previous systems. - I would have liked more description of the three flash devices shown in Figure 8. What are the prices of the three? Power consumption? I would like more of a conclusion to the Put Speed experiments. It isn't clear to me that the Sandisk Extreme IV is the appropriate choice. Why was it chosen? What values do you expect for the number of database files in a typical FAWN-KV system? I'd also like to see the graph extended to 256 files for all 3 lines. - As stated above, an evaluation of deletes and cleaning would be appreciated. - I assume you are rate limiting the traffic for the split in Section 4.3? Please discuss this and the trade-off for the amount of time the split takes. =========================================================================== SOSP 2009 Review #188B Updated Monday 30 Mar 2009 10:49:16pm PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: B. Good paper. I don't mind having it in SOSP. Confidence: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary & rationale for evaluation ===== This paper proposes a hierarchical key-value store (FAWN) built out of many low-power "wimpy" nodes with flash as the backing store. The system is based on the assumption that random-access reads dominate the workload, such that highly parallel flash access leads to a higher query servicing rate. Furthermore, because it is built out of wimpy, low-power nodes, FAWN's queries per Joule is much higher than a desktop-class, disk-based solution. The paper has several strengths: it's a complete design, including reliability through chain replication, dynamically adding and removing nodes, a hierarchy for breaking up the key space, and a nice demonstration of how a goal of energy efficiency can complete change system design considerations. The paper has four weaknesses: none are fatal. The first is that FAWN is an application of solid system design principles to an interesting problem, rather than novel design ideas. The second is that the description of a join operation in 3.4.4 doesn't seem quite right; given the depth and soundness of descriptions elsewhere, I'm willing to assume this is a miscommunication, and even if it weren't, fixing it would not be a big change. The third is that the paper does not evaluate the most expensive operation in FAWN, deletion, making it hard to judge the tradeoffs chosen. Finally, the paper itself is a bit rough; many of the design decisions in 3.3 are based on particular hardware properties not described until Section 4, the fact that FAWN is designed for read-heavy workloads doesn't come up until Section 4, and the desire for fault-tolerance (through replication chains) isn't at all mentioned in the introduction. ===== Comments for author ===== This is a good paper: still a bit rough around the edges, but a good paper. What's really nice is that it examines a simple, albeit narrow, problem -- read-dominated key-value stores of small data items -- and shows how energy efficiency can completely change the design space. I found Section 3.3 amazingly frustrating. There are many assumptions of the relative sizes of different hardware components in the tradeoffs made, yet the implementation platform isn't described until Section 4. For example, the fact that the Hash Index is kept in RAM assumes that the ratio of flash to RAM is such that this works. If you changed any of the quantities significantly (e.g., storing 32 byte values, using 64GB of flash, having only 16MB of RAM), then this design would break. Also, using more than 4GB of flash would require a larger pointer into the Data Log. It would be much better if the hardware were described first, with the caveat that different values might lead to slightly different structures. I was a bit confused by the sentence in 3.3 "uses standard hash chaining to pick another bucket." I assumed by hash chaining, this meant a linked list in a bucket. But "pick another bucket" suggests rehashing. Rehashing, however, runs into issues performance issues once the store is mostly full. The description of a join in 3.4.4 seems to have a problem. It goes like this: 1. C1 starts join. 2. E1 sends log to C1. 3. C1 inserted between B1 and D1. 4. C1 starts receiving updates from B1. 5. C1 requests entries which E1 received between 2. and 4. My concern has to do with an assumption of latency of in-flight writes between 3 and 5. For example, it might go like this: 2a. E1 sends log to C1. 2b. B1 sends update to D1. 3. C1 inserted between B1 and D1. 4. C1 starts receiving updates from B1. 5. C1 requests entries from E1, gets them all. 6. D1 forwards update to E1. There are of course a couple of simple solutions. Table 1: the variation between QPS and MB/s is annoying. If all operations are 1kB, use kBps? It's nice that the paper tries to look at TCO. As costs might change quickly, the exact values here aren't really a strong statement of FAWN's financial tradeoffs in the long-term, but this is a good demonstration that the idea isn't financially crazy today. An examination of how FAWN performs over time, as entries are deleted and the log is cleaned, would be nice. But this isn't critical. =========================================================================== SOSP 2009 Review #188C Updated Monday 30 Mar 2009 11:19:45pm PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: A. Excellent paper. I'd really want to see this paper in SOSP (I will Champion it). Confidence: Z. I am not an expert. My evaluation is that of an informed outsider. ===== Paper summary & rationale for evaluation ===== Summary: FAWN is a new cluster architecture that couples low-power, efficient embedded CPUs with flash storage to provide efficient, fast, and cost-effective access to large, random-access data. Such a workload is typical for some existing datacenter systems like Dynamo (Amazon), Voldmort (LinkedIn), and memcached (Facebook). The authors use consistent hashing (as in some structured P2P systems) for lookup (but not routing) and use log structured logging to cope with Flash's expensive random writes. Strengths: The authors identified a well-motivated problem scenario in data-center computing environment. The authors have made great observations that power consumption is a significant part of the total cost of ownership and neither disk nor DRAM are cost-effective. Aimed at optimizing power consumption without sacrificing performance, the authors have set off for a very interesting system that uses low-power embedded CPU together with Flash storage. The authors constructed the FAWN system nicely. They cleverly adopted existing techniques that are well-suited for a well-motivated problem scenario. The authors have demonstrated the benefits of the approach through interesting evaluation on both performance and total cost of ownership. Weaknesses: An open question (which is acknowledged by the authors) is how would such a system cope with a variety of work loads, such as that of MapReduce. Would FAWN still have the the same kind of power efficiency over traditional cluster constructions? =========================================================================== SOSP 2009 Review #188D Updated Monday 27 Apr 2009 9:42:54am PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: A. Excellent paper. I'd really want to see this paper in SOSP (I will Champion it). Confidence: Y. I am knowledgeable in the area, though not an expert. =========================================================================== SOSP 2009 Review #188E Updated Thursday 4 Jun 2009 8:32:12am PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: B. Good paper. I don't mind having it in SOSP. Confidence: Z. I am not an expert. My evaluation is that of an informed outsider. ===== Paper summary & rationale for evaluation ===== + Complete system + First system that builds a key-value system on flash Others have made the observation that flash is much lower power than hard drives and have built file systems around flash. Should compare with other papers that use flash in embedded systems: ELF (Sensys 2004), also uses log-structured technique to build flash file system. Capsule (Sensys 2006), log-structured object store for flash =========================================================================== SOSP 2009 Review #188F Updated Tuesday 12 May 2009 11:31:58am PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: B. Good paper. I don't mind having it in SOSP. Confidence: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary & rationale for evaluation ===== Nothing that struck me as fundamentally new in this paper---but still, a well-thought out system and a well-written paper that provides enough detail to understand the system without being bogged down in details. The intellectual value of the paper for me comes from appreciating how the design brings together elegantly all the different technical pieces, even as each of them is not per se surprising. I had fun reading this paper. ===== Comments for author ===== typos on the second column of page 5: - line 4: "so that the system *was* robust" --- perhaps "would be" ? - line 6: "what keys it are responsible..." =========================================================================== SOSP 2009 Review #188G Updated Thursday 4 Jun 2009 9:00:17am PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: A. Excellent paper. I'd really want to see this paper in SOSP (I will Champion it). Confidence: X. I am an expert in the subject area of this paper. ===== Paper summary & rationale for evaluation ===== This paper proposes a new architecture for clustered storage, using CPUs that are slower but much more energy efficient, and using flash memory instead of disk. The paper describes an object store built using the system and it measures the performance of that system. + Overall, a very nice paper. + Interesting ideas, described clearly, with comprehensive performance measurements. + Figure 14 is going to get reference or reproduced in a lot of papers in the future. Cuts to the heart of the issue. Potential impact: high. ===== Comments for author ===== * Nice abstract. * Page 3, Section 3.2: when describing flash as "low power consumption" you compare with disk using total watts per device; shouldn't this be characterized as watts per byte? By that metric I don't believe flash is better than disk. * Page 3: it would be helpful to give base performance members for flash memory here: how long does it take to read and write? * Section 3.3: it looks to me like FAWN has pretty significant memory overheads. If you keep 6 bytes in memory for each object, and if you overallocate hash table space to reduce collisions, you will probably end up with at least 10 bytes in memory for each object. If objects are only 100 bytes long, then you will need 10% as much memory as flash storage. If average object size is 500 bytes, then you will need 2% as much memory as flash. * Section 3.3.2, last paragraph: how much does cleaning increase the write traffic, which is already slow? * Section 3.4.2 mentions a buffer cache; how much memory are you assuming for this? I didn't understand the description of the front-end query cache; aren't there consistency issues with this? * Section 3.4.4: joins and leaves will cause significant additional write traffic to flash, no? How much will this impact system performance and flash memory lifetime? * Section 4 mentions an "Atom-based front-end": what is this? * Page 7, "Evaluation Hardware": can you really support 4GB of flash per node? Assuming you have to keep 10 bytes in main memory for each object, and assuming that you can use 100 MB of main memory for the hash table, that means you can handle 10M objects, which means average object size must be at least 400 bytes. Any idea whether this is a reasonable assumption? This makes me wonder about Table 4, where you hypothesize 32GB of flash per node. This will only work if you have an average object size of 3KB, or if you have significantly more than 256MB of memory per node. * Using a log-structured approach in FAWN-DS will expand the size of storage needed: log cleaning overheads become unacceptable if utilization gets close to 100%. Any estimates on how big this expansion effect will be? * Page 8, "Bulk store speed": what do you mean by inserting into a single file? What is a "file" in FAWN-DS? * Page 9, Section 4.1.3: I don't understand why FAWN-DS can't handle more puts per second then gets. I get it for disk, but random access is so fast with flash that I wouldn't think there'd be much difference. * Page 9, Section 4.1.3: the last paragraph says that frequent cleaning could reduce throughput by half. In fact, things can be much much worse than this. For example, if you are cleaning regions in which 90% of the data is still alive, then for each byte of new data you write you must read 9-10 bytes of existing data and write 9 bytes of live data (depending on how clever you are about identifying the live data), so your throughput has dropped by a factor of 19-20, not 2. It's hard to imagine a scenario where the throughput reduction would be as little as 2x. * Page 10, Figure 13: the "ledge" at 300 microseconds latency seems like it may be unrealistic. How much memory are you dedicating to cache, and what access patterns are you assuming? * Page 12, Table 4: how did you come up with 35K QPS for FAWN? From Figure 13 it looks to me like you can only get 1-3K QPS, depending on cache hit ratio. Are you assuming concurrent accesses to multiple flash devices? Even if this is possible, I doubt that your FAWN CPUs can sustain an RPC rate much more than 3K QPS. A factor of 10 change in FAWN QPS would have a significant impact on Figure 14. * Page 12, Figure 14: I really like this figure! I'm not sure I agree with some of the places where you have drawn lines, but the figure does a great job of capturing the fundamental trade-offs. =========================================================================== SOSP 2009 Review #188H Updated Wednesday 20 May 2009 9:26:42am PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: B. Good paper. I don't mind having it in SOSP. Confidence: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary & rationale for evaluation ===== FAWN is a brick-based key-value store using consistent hashing and chain replication. The focus here is on matching the brick hardware to the problem. Each brick combines flash/SSD to store the values, a low-power embedded processor, and sufficient DRAM to store a key hash for the values in the SSD. The crux trick is to store the key hash table compactly in DRAM, and use the SSD as an append-only store for the values. The claim here is that there is a significant workload sweet spot for this design that makes it a winner from the standpoint of lookups-per-second-per-joule (roughly a factor of 6). The design is clever but it comes with a lot of limitations. This system as described won't handle deletes or overwrites, and if the objects are too small then it needs a lot of DRAM. And the cost per-byte is much higher than disk, so it won't work if the dataset is too big. And it supports only primary-key queries. The paper is timely in its focus on combining flash and embedded processors in a cluster setting, for the purpose of energy efficiency. The strength of the paper is that it is a complete implementation with some clever tricks, real measurements and a good exploration of tradeoffs in the design space. That sets it apart from the other recent work on this focus. In particular, Gordon (ASPLOS 2009) is a paper design and simulation study for MapReduce clusters using similar hardware. The real measurements offered in this FAWN paper are particularly valuable because simulation models for energy consumption are not yet accepted as mature. This paper puts it all together and establishes something of an upper bound on the potential benefits from this idea (an array of flash SSD bricks with matched embedded processors) using existing technology. The weakness is that the paper does not actually demonstrate a compelling application in the sweet spot for this design, and the limitations of the simple design would exclude many potential applications. Key-value stores are indeed "critical parts of major Internet services", but the effectiveness of this design is sensitive to object size, read/write ratio, dataset size, and the need for overwrite/delete. Will it really work? The synthetic benchmarks mostly explore the sweet spot, and do not fully demonstrate or even acknowledge these sensitivities. The paper is all about the engineering tradeoffs for a particular design point: there are no new fundamental techniques here. And most of the discussion and analysis focuses on the clustering aspect (old news) and performance, rather than the energy. The paper would be much stronger if it offered some hybrid techniques to make this approach less sensitive to workload, or if it explored the sensitivities. ===== Comments for author ===== There is a lot of related work in the database community studying various flash-based structures that are more general (but also more complex) than this paper. From these there is only one cite, [26]. Look at the work of Ke Yi (FD-tree, ICDE 2009), other work by Suman Nath, (VLDB 2008), Sang Won Lee (several SIGMOD papers and log-structured B-Tree), and others in the same proceedings. This makes the paper feel somewhat "SOSP-centric", although it is true that none of these works addresses the sweet spot in this paper, and none of them addresses clustering. In the file/storage community, there are more sophisticated discussions of issues and tradeoffs for flash storage systems and software to drive them. Some important issues glossed over here are wear-leveling and placement of data across SSD chips. See the Wobber paper in USENIX 2008, and papers cited from there. Much of the implementation detail for hashing and chain replication, etc., seemed unnecessary since this is well-covered ground. It would be better to use the space to focus more on the issues relevant to the energy/performance claims. For example, 4.1.1 (Put Speed) gave a hint of some interesting tradeoffs between replication degree and cost, energy, and write speed in a clustered setting. What is the power cost of replication? Is it worth the cost to keep the replicas in flash? Perhaps it would be if we could use it to load-balance reads? I liked the discussion of the sources of throughput loss in 4.2, but it would be useful to quantify the various factors. I found the term "vnodes" confusing since it departs from standard usage of the term in the context of storage systems. I don't think you need the term. 3.3 could be more clear that "in-memory" means "in DRAM". 3.3 uses the terms "data store file", "data log", and "data store" to mean the same thing. Sometimes it just says "file". I struggled. The distinction between "FAWN-DS" and "FAWN-KV" also confused me, esp. since "FAWN-DS" is a KV store. 3.4.1 sounds more like DDS [14] than Chord. Why talk about Chord at all? The title is cute but it doesn't work for me. =========================================================================== SOSP 2009 Review #188I Updated Monday 1 Jun 2009 12:23:09pm PDT --------------------------------------------------------------------------- Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- Evaluation: A. Excellent paper. I'd really want to see this paper in SOSP (I will Champion it). Confidence: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary & rationale for evaluation ===== Presents a key-value store suitable for a Facebook-style workload built using a cluster of low-power (Geode-based) machines, and arguments as to why this is a good thing: very low joules/watt. ===== Comments for author ===== I thought this was an excellent (though slightly flawed) paper, and I'd be very happy to see it in SOSP. Much has already been said by the previous reviewers, so I'll stick to diffs and strong agreements. First, the system name is dreadful. For this reviewer, it had way too many connotations of either Narnia (eugh), or The Merchant of Venice ("How like a fawning publican he looks!"). If this paper is to be remembered, you really need a better name. Think of RAID, or NOW, or half a dozen other ideas with good names. FAWN is just pathetic. You must be able to do better than this. Bunch Of Small Servers? Cloud Of Low-power Devices ("Run COLD!")? Second, using simple devices for these workloads is not original to this paper, or this project. I think some acknowledgement of the other groups who have been proposing (and building) such schemes would strengthen the paper - the contribution here is FAWN-KV, and the cost-benefit analysis, which are great. Why Flash disks (other than they are cheap on a University budget)? What about directly attached Flash, or (better) Phase Change Memory? I'm sure you know some people at Intel, and they apparently are shipping sample parts. Several PC manufacturers (including Dell) are selling server boxes containing a number (>4) of VIA Nano processors. This sounds relevant... Page 5: "... there are typically 80 or more back-end nodes per front-end node." - where? In a FAWN clusters? Already? Where does this figure come from? Also, the workload on the front-end node looks very different to the back-end. Can you comment on this? The description of FAWN-KV contains large reams that sound just like any other datacenter-based 1-hop DHT. Perhaps you can clearly distinguish here the difference between your system and, say, Cassandra. Your power analysis is lacking a clear discussion of how switch power consumption scales with the cluster size. Most data center switches today use a lot more power than yours, and for a 10,000 node cluster those nice cheap NetGears are not going to cut it. =========================================================================== Comment Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- A sizeable chunk of this paper promotes the idea of using low-power processors for datacenter computing. This idea has become widely popular, and there are numerous related projects. The paper mentions some of them in related work, but if published should cover other projects more thoroughly. Perhaps the comparison in section 5 should be expanded from machine configurations not typically used in datacenters to include some of these. The paper mentions the Gordon paper from ASPLOS 2009, but states that it focuses mostly on the FTL. However, that paper strongly resembles this paper in that a lot of space is spent making a case for using low-power processors and Flash storage, although it focuses on a different layer of the flash storage hierarchy. Just recently, Dell announced that it would ship server blades with Via Nano processors, which are more commonly used in netbooks: - http://www.windowsfordevices.com/news/NS2212620043.html?kc=rss Two people at Microsoft have similar ideas on using low-power processors, although without the focus on random access workloads and flash memory: - Amdahl's blades - http://perspectives.mvdirona.com/default,date,2009-01-25.aspx - MSR has a project by Jim Larus looking at the same idea - http://www.greenm3.com/2009/02/microsoft-research-builds-intel-atom-servers.html =========================================================================== Comment Paper #188: FAWN: A Fast Array of Wimpy Nodes --------------------------------------------------------------------------- PC Meeting notes: The PC was unanimous in accepting the paper, and liked the motivation and evaluation.