===========================================================================
                          SOSP 2009 Review #188A
                Updated Wednesday 3 Jun 2009 3:36:37pm PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: A. Excellent paper. I'd really want to
                                        see this paper in SOSP (I will
                                        Champion it).
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

           ===== Paper summary & rationale for evaluation =====

The authors investigate the question of whether one can build a
cost-effective cluster for data-intensive applications (e.g., reading
small random objects such as thumbnail images and wall posts).  The
answer is yes, given one uses Wimpy nodes, low-power, effecient,
embedded CPUs with flash storage.  The paper nicely motivates why a
Fast Array of Wimpy Nodes is the best approach for combining high
performance with power efficiency (i.e., the power consumption of
traditional CPUs cannot be scaled well).

Their system has two main components, FAWN-DS and FAWN-KV.  The
authors begin by describing their single node Data Store, FAWN-DS.
This is a log-structured key-value store supporting store, lookup, and
delete.  This design makes sense for flash, given its fast random
reads and slow random writes.  Each physical node in FAWN will be
responsible for multiple virtual nodes, each of whose data objects are
stored in their own file; operations exist to split and merge logs (to
handle vnodes entering and exiting the system) and compacting (for
garbage collection).  FAWN-KV, the key-value system is built from
known techniques for consistent hashing, with front-end nodes that
manage the vnode membership lists, and chain replication.

The evaluation uses a significant portion of the paper and appears
fairly solid.  They start by showing that FAWN-DS obtains close to the
random read performance of ext2, that FAWN-DS can obtain near to the
peak sequential write performance for bulk store operations, and that
concurrent puts to different files (i.e, vnodes) is not prohibitively
bad.  The authors focus then on get queries for FAWN-KV, since they
perform worse than puts.  They show a 21-node system obtains 80% of
the sustained rate that a single FAWN-DS node could handle.  They
measure the power consumption during puts/gets and report the query
rates while a split is being performed.  The authors conclude with a
comparison of which configurations (Traditional CPUs vs Wimpy and Disk,
SSD or DRAM) are best for TCO depending upon the mix of data set size
and query rate.

Strengths of paper:

+ The authors are investigating an interesting problem domain with looking at power consumption and TCO
+ Really like that they have implemented a system with 21 nodes
+ Reasonable evaluation (if not perfect)
+ Interesting fun paper to read, well written

Weaknesses:

- No grand new conceptual idea (but the combination of ideas is fine)

- Don't have a compelling end-to-end application to fully demonstrate FAWN

- I'd like to see evaluation of their full functionality, including delete operations, garbage collection, and cleaning/compaction

                      ===== Comments for author =====

- You say the key space is allocated to front-ends by a single
  management node, that you envision being replicated using a small
  Paxos cluster.  I hope you implement this.

- I don't think you need to devote quite so much space to your
  description of the FAWN-KV ring (e.g., Figs 4 and 5) given how
  similar your approach is to previous systems.

- I would have liked more description of the three flash devices shown
  in Figure 8.  What are the prices of the three?  Power consumption?
  I would like more of a conclusion to the Put Speed experiments.  It
  isn't clear to me that the Sandisk Extreme IV is the appropriate
  choice.  Why was it chosen?  What values do you expect for the
  number of database files in a typical FAWN-KV system?  I'd also like
  to see the graph extended to 256 files for all 3 lines.

- As stated above, an evaluation of deletes and cleaning would be appreciated.

- I assume you are rate limiting the traffic for the split in Section
  4.3?  Please discuss this and the trade-off for the amount of time
  the split takes.

===========================================================================
                          SOSP 2009 Review #188B
                 Updated Monday 30 Mar 2009 10:49:16pm PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: B. Good paper. I don't mind having it
                                        in SOSP.
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

           ===== Paper summary & rationale for evaluation =====

This paper proposes a hierarchical key-value store (FAWN) built out of
many low-power "wimpy" nodes with flash as the backing store. The
system is based on the assumption that random-access reads dominate
the workload, such that highly parallel flash access leads to a higher
query servicing rate. Furthermore, because it is built out of wimpy,
low-power nodes, FAWN's queries per Joule is much higher than a desktop-class,
disk-based solution.

The paper has several strengths: it's a complete design, including reliability
through chain replication, dynamically adding and removing nodes, a hierarchy
for breaking up the key space, and a nice demonstration of how a goal of
energy efficiency can complete change system design considerations.

The paper has four weaknesses: none are fatal. The first is that FAWN
is an application of solid system design principles to an interesting
problem, rather than novel design ideas. The second is that the
description of a join operation in 3.4.4 doesn't seem quite right;
given the depth and soundness of descriptions elsewhere, I'm willing
to assume this is a miscommunication, and even if it weren't, fixing
it would not be a big change. The third is that the paper does not
evaluate the most expensive operation in FAWN, deletion, making it
hard to judge the tradeoffs chosen. Finally, the paper itself is a bit
rough; many of the design decisions in 3.3 are based on particular
hardware properties not described until Section 4, the fact that
FAWN is designed for read-heavy workloads doesn't come up until
Section 4, and the desire for fault-tolerance (through replication
chains) isn't at all mentioned in the introduction.

                      ===== Comments for author =====

This is a good paper: still a bit rough around the edges, but a good
paper.  What's really nice is that it examines a simple, albeit
narrow, problem -- read-dominated key-value stores of small data items
-- and shows how energy efficiency can completely change the design
space.

I found Section 3.3 amazingly frustrating. There are many assumptions
of the relative sizes of different hardware components in the tradeoffs
made, yet the implementation platform isn't described until Section 4.
For example, the fact that the Hash Index is kept in RAM assumes that
the ratio of flash to RAM is such that this works. If you changed any of
the quantities significantly (e.g., storing 32 byte values, using 64GB of
flash, having only 16MB of RAM), then this design would break. Also, using
more than 4GB of flash would require a larger pointer into the Data Log.
It would be much better if the hardware were described first, with the
caveat that different values might lead to slightly different
structures.

I was a bit confused by the sentence in 3.3 "uses standard hash chaining
to pick another bucket." I assumed by hash chaining, this meant a
linked list in a bucket. But "pick another bucket" suggests rehashing.
Rehashing, however, runs into issues performance issues once the store
is mostly full.

The description of a join in 3.4.4 seems to have a problem. It goes like this:

1. C1 starts join.
2. E1 sends log to C1.
3. C1 inserted between B1 and D1.
4. C1 starts receiving updates from B1.
5. C1 requests entries which E1 received between 2. and 4.

My concern has to do with an assumption of latency of in-flight writes
between 3 and 5. For example, it might go like this:

2a. E1 sends log to C1.
2b. B1 sends update to D1.
3. C1 inserted between B1 and D1.
4. C1 starts receiving updates from B1.
5. C1 requests entries from E1, gets them all.
6. D1 forwards update to E1.

There are of course a couple of simple solutions.

Table 1: the variation between QPS and MB/s is annoying. If all operations
are 1kB, use kBps?

It's nice that the paper tries to look at TCO. As costs might change
quickly, the exact values here aren't really a strong statement of
FAWN's financial tradeoffs in the long-term, but this is a good
demonstration that the idea isn't financially crazy today.

An examination of how FAWN performs over time, as entries are deleted
and the log is cleaned, would be nice. But this isn't critical.

===========================================================================
                          SOSP 2009 Review #188C
                 Updated Monday 30 Mar 2009 11:19:45pm PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: A. Excellent paper. I'd really want to
                                        see this paper in SOSP (I will
                                        Champion it).
                         Confidence: Z. I am not an expert. My evaluation
                                        is that of an informed outsider.

           ===== Paper summary & rationale for evaluation =====

Summary: 

FAWN is a new cluster architecture that couples low-power, efficient
embedded CPUs with flash storage to provide efficient, fast, and
cost-effective access to large, random-access data. Such a workload is
typical for some existing datacenter systems like Dynamo (Amazon), Voldmort
(LinkedIn), and memcached (Facebook). The authors use consistent hashing (as
in some structured P2P systems) for lookup (but not routing) and use log
structured logging to cope with Flash's expensive random writes.


Strengths:

The authors identified a well-motivated problem scenario in data-center
computing environment.  The authors have made great observations that power
consumption is a significant part of the total cost of ownership and neither disk nor DRAM are cost-effective. Aimed at
optimizing power consumption without sacrificing performance, the authors
have set off for a very interesting system that uses low-power embedded CPU
together with Flash storage.

The authors constructed the FAWN system nicely. They cleverly adopted
existing techniques that are well-suited for a well-motivated problem
scenario.

The authors have demonstrated the benefits of the approach through
interesting evaluation on both performance and total cost of ownership.

Weaknesses:

An open question (which is acknowledged by the authors) is how would such a
system cope with a variety of work loads, such as that of MapReduce. Would
FAWN still have the the same kind of power efficiency over traditional
cluster constructions?

===========================================================================
                          SOSP 2009 Review #188D
                 Updated Monday 27 Apr 2009 9:42:54am PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: A. Excellent paper. I'd really want to
                                        see this paper in SOSP (I will
                                        Champion it).
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

===========================================================================
                          SOSP 2009 Review #188E
                 Updated Thursday 4 Jun 2009 8:32:12am PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: B. Good paper. I don't mind having it
                                        in SOSP.
                         Confidence: Z. I am not an expert. My evaluation
                                        is that of an informed outsider.

           ===== Paper summary & rationale for evaluation =====

+ Complete system
+ First system that builds a key-value system on flash

Others have made the observation that flash is much lower power than
hard drives and have built file systems around flash.

Should compare with other papers that use flash in embedded systems:

ELF (Sensys 2004), also uses log-structured technique to build flash
file system.

Capsule (Sensys 2006), log-structured object store for flash

===========================================================================
                          SOSP 2009 Review #188F
                Updated Tuesday 12 May 2009 11:31:58am PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: B. Good paper. I don't mind having it
                                        in SOSP.
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

           ===== Paper summary & rationale for evaluation =====

Nothing that struck me as fundamentally new in this paper---but still, a well-thought out system and a well-written paper that provides enough detail to understand the system without being bogged down in details. The intellectual value of the paper for me comes from appreciating how the design brings together elegantly all the different technical pieces, even as each of them is not per se surprising. I had fun reading this paper.

                      ===== Comments for author =====

typos on the second column of page 5:

- line 4: "so that the system *was* robust" --- perhaps "would be" ?
- line 6: "what keys it are responsible..."

===========================================================================
                          SOSP 2009 Review #188G
                 Updated Thursday 4 Jun 2009 9:00:17am PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: A. Excellent paper. I'd really want to
                                        see this paper in SOSP (I will
                                        Champion it).
                         Confidence: X. I am an expert in the subject area
                                        of this paper.

           ===== Paper summary & rationale for evaluation =====

This paper proposes a new architecture for clustered storage, using
CPUs that are slower but much more energy efficient, and using flash
memory instead of disk. The paper describes an object store built
using the system and it measures the performance of that system.

+ Overall, a very nice paper.
+ Interesting ideas, described clearly, with comprehensive performance
  measurements.
+ Figure 14 is going to get reference or reproduced in a lot of papers
  in the future.  Cuts to the heart of the issue.

Potential impact: high.

                      ===== Comments for author =====

* Nice abstract.

* Page 3, Section 3.2: when describing flash as "low power consumption"
  you compare with disk using total watts per device; shouldn't this be
  characterized as watts per byte?  By that metric I don't believe
  flash is better than disk.

* Page 3: it would be helpful to give base performance members for
  flash memory here: how long does it take to read and write?

* Section 3.3: it looks to me like FAWN has pretty significant memory
  overheads. If you keep 6 bytes in memory for each object, and if you
  overallocate hash table space to reduce collisions, you will probably
  end up with at least 10 bytes in memory for each object. If objects
  are only 100 bytes long, then you will need 10% as much memory as
  flash storage. If average object size is 500 bytes, then you will need
  2% as much memory as flash.

* Section 3.3.2, last paragraph: how much does cleaning increase the
  write traffic, which is already slow?

* Section 3.4.2 mentions a buffer cache; how much memory are you assuming
  for this? I didn't understand the description of the front-end query
  cache; aren't there consistency issues with this?

* Section 3.4.4: joins and leaves will cause significant additional
  write traffic to flash, no? How much will this impact system performance
  and flash memory lifetime?

* Section 4 mentions an "Atom-based front-end": what is this?

* Page 7, "Evaluation Hardware": can you really support 4GB of flash
  per node? Assuming you have to keep 10 bytes in main memory for each
  object, and assuming that you can use 100 MB of main memory for the
  hash table, that means you can handle 10M objects, which means
  average object size must be at least 400 bytes.  Any idea whether
  this is a reasonable assumption?  This makes me wonder about
  Table 4, where you hypothesize 32GB of flash per node. This will
  only work if you have an average object size of 3KB, or if you
  have significantly more than 256MB of memory per node.

* Using a log-structured approach in FAWN-DS will expand the size of storage
  needed: log cleaning overheads become unacceptable if utilization gets
  close to 100%.  Any estimates on how big this expansion effect will be?

* Page 8, "Bulk store speed": what do you mean by inserting into a single
  file?  What is a "file" in FAWN-DS?

* Page 9, Section 4.1.3: I don't understand why FAWN-DS can't handle more
  puts per second then gets.  I get it for disk, but random access is so
  fast with flash that I wouldn't think there'd be much difference.

* Page 9, Section 4.1.3: the last paragraph says that frequent cleaning
  could reduce throughput by half.  In fact, things can be much much
  worse than this.  For example, if you are cleaning regions in which
  90% of the data is still alive, then for each byte of new data you
  write you must read 9-10 bytes of existing data and write 9 bytes of
  live data (depending on how clever you are about identifying the live
  data), so your throughput has dropped by a factor of 19-20, not 2.
  It's hard to imagine a scenario where the throughput reduction would
  be as little as 2x.

* Page 10, Figure 13: the "ledge" at 300 microseconds latency seems
  like it may be unrealistic.  How much memory are you dedicating to
  cache, and what access patterns are you assuming?

* Page 12, Table 4: how did you come up with 35K QPS for FAWN?  From
  Figure 13 it looks to me like you can only get 1-3K QPS, depending
  on cache hit ratio.  Are you assuming concurrent accesses to multiple
  flash devices? Even if this is possible, I doubt that your FAWN
  CPUs can sustain an RPC rate much more than 3K QPS.  A factor of 10
  change in FAWN QPS would have a significant impact on Figure 14.

* Page 12, Figure 14: I really like this figure!  I'm not sure I agree
  with some of the places where you have drawn lines, but the figure
  does a great job of capturing the fundamental trade-offs.

===========================================================================
                          SOSP 2009 Review #188H
                Updated Wednesday 20 May 2009 9:26:42am PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: B. Good paper. I don't mind having it
                                        in SOSP.
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

           ===== Paper summary & rationale for evaluation =====

FAWN is a brick-based key-value store using consistent hashing and chain replication.  The focus here is on matching the brick hardware to the problem.  Each brick combines flash/SSD to store the values, a low-power embedded processor, and sufficient DRAM to store a key hash for the values in the SSD.

The crux trick is to store the key hash table compactly in DRAM, and use the SSD as an append-only store for the values.   The claim here is that there is a significant workload sweet spot for this design that makes it a winner from the standpoint of lookups-per-second-per-joule (roughly a factor of 6).

The design is clever but it comes with a lot of limitations.  This system as described won't handle deletes or overwrites, and if the objects are too small then it needs a lot of DRAM.  And the cost per-byte is much higher than disk, so it won't work if the dataset is too big.   And it supports only primary-key queries.  

The paper is timely in its focus on combining flash and embedded processors in a cluster setting, for the purpose of energy efficiency.  The strength of the paper is that it is a complete implementation with some clever tricks, real measurements and a good exploration of tradeoffs in the design space.  That sets it apart from the other recent work on this focus.  In particular, Gordon (ASPLOS 2009) is a paper design and simulation study for MapReduce clusters using similar hardware.  The real measurements offered in this FAWN paper are particularly valuable because simulation models for energy consumption are not yet accepted as mature.  This paper puts it all together and establishes something of an upper bound on the potential benefits from this idea (an array of flash SSD bricks with matched embedded processors) using existing technology.

The weakness is that the paper does not actually demonstrate a compelling application in the sweet spot for this design, and the limitations of the simple design would exclude many potential applications.   Key-value stores are indeed "critical parts of major Internet services", but the effectiveness of this design is sensitive to object size, read/write ratio, dataset size, and the need for overwrite/delete.  Will it really work?  The synthetic benchmarks mostly explore the sweet spot, and do not fully demonstrate or even acknowledge these sensitivities.  The paper is all about the engineering tradeoffs for a particular design point: there are no new fundamental techniques here.  And most of the discussion and analysis focuses on the clustering aspect (old news) and performance, rather than the energy.

The paper would be much stronger if it offered some hybrid techniques to make this approach less sensitive to workload, or if it explored the sensitivities.

                      ===== Comments for author =====

There is a lot of related work in the database community studying various flash-based structures that are more general (but also more complex) than this paper.  From these there is only one cite, [26].    Look at the work of Ke Yi (FD-tree, ICDE 2009), other work by Suman Nath, (VLDB 2008), Sang Won Lee (several SIGMOD papers and log-structured B-Tree), and others in the same proceedings.  This makes the paper feel somewhat "SOSP-centric", although it is true that none of these works addresses the sweet spot in this paper, and none of them addresses clustering.

In the file/storage community, there are more sophisticated discussions of issues and tradeoffs for flash storage systems and software to drive them.  Some important issues glossed over here are wear-leveling and placement of data across SSD chips.  See the Wobber paper in USENIX 2008, and papers cited from there.

Much of the implementation detail for hashing and chain replication, etc., seemed unnecessary since this is well-covered ground.  It would be better to use the space to focus more on the issues relevant to the energy/performance claims.  For example, 4.1.1 (Put Speed) gave a hint of some interesting tradeoffs between replication degree and cost, energy, and write speed in a clustered setting.  What is the power cost of replication?   Is it worth the cost to keep the replicas in flash?  Perhaps it would be if we could use it to load-balance reads?

I liked the discussion of the sources of throughput loss in 4.2, but it would be useful to quantify the various factors.

I found the term "vnodes" confusing since it departs from standard usage of the term in the context of storage systems.  I don't think you need the term.

3.3 could be more clear that "in-memory" means "in DRAM".

3.3 uses the terms "data store file", "data log", and "data store" to mean the same thing.  Sometimes it just says "file".  I struggled.  The distinction between "FAWN-DS" and "FAWN-KV" also confused me, esp. since "FAWN-DS" is a KV store.

3.4.1 sounds more like DDS [14] than Chord.   Why talk about Chord at all?

The title is cute but it doesn't work for me.

===========================================================================
                          SOSP 2009 Review #188I
                 Updated Monday 1 Jun 2009 12:23:09pm PDT
---------------------------------------------------------------------------
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                         Evaluation: A. Excellent paper. I'd really want to
                                        see this paper in SOSP (I will
                                        Champion it).
                         Confidence: Y. I am knowledgeable in the area,
                                        though not an expert.

           ===== Paper summary & rationale for evaluation =====

Presents a key-value store suitable for a Facebook-style workload
built using a cluster of low-power (Geode-based) machines, and
arguments as to why this is a good thing: very low joules/watt.

                      ===== Comments for author =====

I thought this was an excellent (though slightly flawed) paper, and
I'd be very happy to see it in SOSP.  Much has already been said by
the previous reviewers, so I'll stick to diffs and strong agreements. 

First, the system name is dreadful.  For this reviewer, it had way too
many connotations of either Narnia (eugh), or The Merchant of Venice
("How like a fawning publican he looks!").  If this paper is to be
remembered, you really need a better name.  Think of RAID, or NOW, or
half a dozen other ideas with good names.   FAWN is just pathetic.
You must be able to do better than this.  Bunch Of Small Servers?
Cloud Of Low-power Devices ("Run COLD!")? 

Second, using simple devices for these workloads is not original to
this paper, or this project.  I think some acknowledgement of the
other groups who have been proposing (and building) such schemes would
strengthen the paper - the contribution here is FAWN-KV, and the
cost-benefit analysis, which are great.

Why Flash disks (other than they are cheap on a University budget)?
What about directly attached Flash, or (better) Phase Change Memory?
I'm sure you know some people at Intel, and they apparently are
shipping sample parts. 

Several PC manufacturers (including Dell) are selling server boxes
containing a number (>4) of VIA Nano processors.  This sounds
relevant...

Page 5: "... there are typically 80 or more back-end nodes per
front-end node." - where?  In a FAWN clusters?  Already?   Where does
this figure come from?

Also, the workload on the front-end node looks very different to the
back-end.  Can you comment on this?  

The description of FAWN-KV contains large reams that sound just like
any other datacenter-based 1-hop DHT.  Perhaps you can clearly
distinguish here the difference between your system and, say,
Cassandra. 

Your power analysis is lacking a clear discussion of how switch power
consumption scales with the cluster size.   Most data center switches
today use a lot more power than yours, and for a 10,000 node cluster
those nice cheap NetGears are not going to cut it.

===========================================================================
                                  Comment
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------
A sizeable chunk of this paper promotes the idea of using low-power processors for datacenter computing. This idea has become widely popular, and there are numerous related projects. The paper mentions some of them in related work, but if published should cover other projects more thoroughly. Perhaps the comparison in section 5 should be expanded from machine configurations not typically used in datacenters to include some of these.

The paper mentions the Gordon paper from ASPLOS 2009, but states that it focuses mostly on the FTL. However, that paper strongly resembles this paper in that a lot of space is spent making a case for using low-power processors and Flash storage, although it focuses on a different layer of the flash storage hierarchy.

Just recently, Dell announced that it would ship server blades with Via Nano processors, which are more commonly used in netbooks: - http://www.windowsfordevices.com/news/NS2212620043.html?kc=rss

Two people at Microsoft have similar ideas on using low-power processors, although without the focus on random access workloads and flash memory:
- Amdahl's blades - http://perspectives.mvdirona.com/default,date,2009-01-25.aspx
- MSR has a project by Jim Larus looking at the same idea - http://www.greenm3.com/2009/02/microsoft-research-builds-intel-atom-servers.html

===========================================================================
                                  Comment
               Paper #188: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------
PC Meeting notes:

The PC was unanimous in accepting the paper, and liked the motivation and evaluation.