===========================================================================
                          OSDI 2008 Review #157A
                Updated Wednesday 4 Jun 2008 5:58:09am EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 2. Reject

                ===== Paper strengths and weaknesses =====

This paper proposes an architecture for a key-value storage engine. Differently from several other proposals, the current one takes power consumption into account, that is, despite providing good performance, a high queries/joule rate is also desired. As a consequence, the paper uses a relatively unusual architecture composed of low-end processors, which consume less power than modern processors, as back-end servers, and replaces disk unities by power-efficient flash memories for stable storage. The resulting environment poses two main challenges: how to provide high performance using low-end processors and how to overcome the overheads of flash memory devices. The first challenge is addressed by interposing high performance processors between client applications and back-end servers; these will work pretty much like caches. The second challenge is addressed by using a log-like approach to handle flash memory. Good scalability and high churn are the main requirements of the system. The paper provides some empirical data to validate these, and speculates about how the system provides good scalability by extrapolating on the experimental results.

My impression is that this work is promising, although the current paper and results are premature --- in reality, this last fact is recognized by the authors themselves. The scalability study is approximative at best. The conclusion that the system scales is drawn from an experiment with eight nodes and a statement that "...there is little in the architecture that impairs scaling to the small thousand of nodes...". Moreover, the issue with frequent membership changes is not appropriately evaluated. Currently, there is a discussion on the time it takes to create a new node. It would be more interesting to discuss the overall system performance is impacted during the addition and removal of nodes. 

Finally, the paper builds on relevant and current assumptions, namely, the need to account for power constraints, but it fails to address the real problem in the proposed architecture, which is how to provide some level of strong consistency. I suspect that, differently than what is currently claimed, adding consistency would lead to revising some of the current design choices.

                      ===== Comments for author =====

A detailed discussion about how to ensure consistency would enrich the paper; or alternatively, explain clearly why consistency is not an issue in the kind of application envisioned (currently the matter is simply dismissed with a footnote claim). More detailed experiments concerning scalability and membership changes should be conducted.

While the paper in general reads well, some parts can be improved such as the cache size analysis. There are also some typos in the discussion about node join and leave traffic (Section 3.4): in the first item, replace the second n1-1 by n1-2; in the second it should be n3-2 and n1-2.

===========================================================================
                          OSDI 2008 Review #157B
                Updated Thursday 24 Jul 2008 9:23:32pm EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 3. Weak accept

                ===== Paper strengths and weaknesses =====

Strengths: interesting idea, low power storage important
Weaknesses: seems premature

                      ===== Comments for author =====

The paper reads like a very well-written and convincing grant proposal.  Power-awareness is important, and read-mostly workloads for small objects are increasingly common.  The proposed architecture addresses both.  I can see datacenter people getting excited about something like this.  However, lacking hardware the authors have to do a lot of extrapolating.  Lacking realistic benchmarks, the authors limit themselves to some micro-benchmarks.

I like the design of the system.  The split/merge mechanisms fits very well with both the peculiarities of FLASH and with continuous hashing.  It's original and of interest to the OSDI community.

The "state continuation" idea is not clear.  Is the continuation parameter to GET simply an upcall?  Why is "continuation" returned by GET?

In Section 3.5, what exactly do you mean by distribution to individual objects being extremely unbiased?

In Table 2, I had difficulty understanding the difference between Soekris (1) and Soekris (8), and to the extend that I do understand am somewhat surprised.

While the related work is extensive, I'm surprised there are few references to prior work on power-aware storage.  For example, Hibernator (SOSP'05) comes to mind, as well as the work by Ganesh et al. in HotOS'07.

One obvious point in the design space that is not covered by this paper is simply hooking up all the flash to the front-end.  Do we really need a cluster if you need a front-end in any case?

===========================================================================
                          OSDI 2008 Review #157C
                  Updated Friday 6 Jun 2008 1:53:35pm EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 4. Accept

                ===== Paper strengths and weaknesses =====

The paper describes an architecture (FAWN) for supporting scalable,
energy-efficient key-value storage, by using embedded processors &
flash instead of the more usual desktop or server class machines with
DRAM or disk storage. This is an interesting idea and very topical. It
will be of interest to many people building large-scale distributed
systems for datacenters - although the paper is focused on key-value
storage, the ideas and approach have wider applicability. The paper is
generally well-written and very readable. On the downside, the
prototype implementation does not provide consistency so this weakens
the evaluation.

                      ===== Comments for author =====

"imbalanced workloads" -> unbalanced
"that impair the system's ability" - misplaced modifier

"recent work in databases" - [3] is not exactly recent

Power Consumption [25] - Surely there is a better reference for this stuff.

Can you summarize the costs & power consumption of different
approaches to supporting a given workload? It would make these
individual costs much more meaningful.

"accessing two sequential pages is no faster than accessing two random
pages" -> accessing two random pages is as fast as accessing two
sequential pages

Your footnote about weak consistency is disappointing. Do the
applications (large DNS zone, e-commerce catalog) OK with weak
consistency? Please discuss why weak consistency is OK, or if it's not
OK please implement sufficient consistency for your applications.

Figure 4 is not worth the space, remove it.

"To support this requires three operations" - Doesn't n3 also need to
split the range from n1-2 to n3-2?

"the wimpy (Soekris) nodes" - This is the first mention of Soekris. I
have not idea what it means - please explain or cite.

"the front-end would need only 512KB" - Hmm if N=1000 then N log N is
10,000 so the front-end needs 5MB, no?

"coupon collectors problem" - cite a reference

Section 4.1 - You keep saying small cluster, without saying exactly how many
nodes are in the cluster. This is the place to say.

Table 2 - Explain the difference between the Soekris (1) and Soekris
(8) lines. I assume (1) and (8) are the number of nodes but make this
explicit.

In one place the model number is Net4801-60, and in another it's
Net4801-66?

Section 5.2 - Why is 1/2 million QPS and 4TB data your performance
requirement?

"over a 10Mbps takes" - 10Mbps what?

Figure 9 - This is placed/numbered out of order relative to the
discussion.

"owes an intellectual debt to RAID" - I don't agree that RAID should
be enshrined here. What about multiprocessor systems that predate
RAID??

"Myers and Madden" - The reference just mentions Myers. And [19] is
incomplete.

"Hadoop" - cite reference

===========================================================================
                          OSDI 2008 Review #157D
                 Updated Sunday 22 Jun 2008 11:06:10am EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 3. Weak accept

                ===== Paper strengths and weaknesses =====

Strengths
- new, fun architecture
- easy to understand
- potentially important
- catchy name

Weaknesses
- some odd design decisions
- not done with implementation
- no numbers at scale
- a little misleading upfront

                      ===== Comments for author =====

This paper presents FAWN (fast array of wimpy nodes). The idea is simple: use
a bunch of low-powered flash-based backends to build a high throughput but low
power cluster key/value storage system. The paper presents the rationale for
the idea, and then discusses its initial implementation as well as a few
experiments.

Overall, I was inclined to like this paper, but it has a few major flaws which
prevent me from recommending it whole-heartedly. As it stands, it will need
serious amounts of both rewriting and (more importantly) real technical work
to make it into the truly great paper it should (and perhaps could) be. I
wouldn't be against it being published (because I like the idea and direction
enough, and I'd be a little worried the authors will get scooped if we reject
it), but I also think there is a much better paper waiting to happen here.

In no particular order:

+ The paper presents a number of design decisions which make little sense to
me. For example, the use of Chord and the focus on join/leave performance make
me think of a wide-area setting, whereas the environment here clearly is (and
should be) a well-controlled server room. This is a cluster in a basement
somewhere! Even in relatively large systems, churn should not be an
issue. Thus, I felt like the paper focused on the wrong technical issues,
which lessens my interest as well as the potential impact of the paper.

+ A related point: some important related work is missing. For example, why is
there no discussion of systems such as Gribble's distributed hash tables or
follow-ons such as Thekkath's Boxwood? Why wouldn't software architectures
like these work well for the system you envision? The authors miss this work,
and thus also miss the chance to tell us what is really new that is needed
from a s/w perspective about FAWN.

+ The implementation is not complete. For example, failure during update is
not handled, which can result in inconsistency across replicas (buried at the
end of 3.4). This hints at the bigger problem: the implementation here is just
premature. Not much other than the basics to get a few experiments
running. The authors need to spend more time on this and make the system
really work.
+ There is no compelling workload put forth here. Some application or real use
of FAWN would make this paper very much more convincing. Even just taking a
database workload, many of which are "seek hungry" when run on disks, and
showing how FAWN is a better match, would be great. Probably forgiveable not
to have this if the rest of the paper was solid, but ...

+ The experiments are at incredibly small scale (8 nodes). Though the authors
claim that the system should be able to scale to much bigger sizes, no
evidence as such is given. The claim that "the system should scale" indicates
some naivete, alas -- indeed, it is those things we don't anticipate that make
scaling such a challenging problem.

+ You should look up Amdahl's rules of thumb for balanced systems (1 MIPS per
1 MB of memory per 1 MB/s of I/O or something like that). It would provide a
nice foil; further, perhaps you could extend this to include power? I think
there is something nice to be done along these lines. Probably the best
reference to start with is Jim Gray's "Rules of Thumb in Data Engineering".

+ Section 2 (background) was nice. When talking about disks, though, probably
best to talk about overall positioning time and not just seek. Rotational
delay is part of the equation after all.

+ 2.1: Not sure Cost/GB is the number you really want to look at. The point
(as you make!) is that disks are over-provisioned when it comes to size. It
would be great if you could point this out numerically in your comparison. Is
there some data you could get on how much of disk drives (in transactional
settings) are actually used?

+ The cache-aware ref to Ailamaki's work [3] is fine but not right. The first
such work is Gray's AlphaSort work (which makes this point first, and way
before), and there are many other works between Gray and [3] (e.g., work on
cache-conscious joins, etc.)

+ Figures 2 and 3 are awful. Not aesthetically pleasing (ok, ugly, i said it)
and not very useful. You can do a lot better here. Spend some cash on a good
drawing program (bye bye xfig) and spend more time on these figures. Done
well, they can be useful. As it stands, not so much.

+ 3.5: awkward sentence in 'load balanced assumptions', starting with 'We
assume that an adversary...' (the sentence is hard to parse). On top of that,
why do you care about adversaries here? (not a big deal, just wondering)

+ Some of the numbers don't make sense to me. For example, in the intro, you
say that each node can handle "700 queries/sec" whereas in Fig 6 a node seems
to be handling about half of that.

+ The discussion of how caching in the front-end is important is a good one,
and should be pushed on a bit more. You could open up the more general
question: how to configure a FAWN given a particular workload? How much to put
in the front end vs. how much on the backends? Etc. Perhaps you could automate
this a bit? (ala all the Minerva work from HP). There is a lot of potential
future work here...


+ Intro: comparison to memcached. You have to be a little careful here. What
exactly are you suggesting? For example, is FAWN a good replacement for all
uses of memory-based cluster caches? Probably not. Certainly, when updates are
thrown into the mix, things get more interesting. But you need to be clear:
what exact types of workloads do you think this will work for, and (more
importantly) where does FAWN fail? Be explicit. There is enough here to be
excited about FAWN.

+ Intro: Use bullets when talking about the key points of FAWN.

+ The last para of the Intro hints that a complete FAWN cluster can handle
"one half million lookups per second while consuming 5KW of power..." (and so
forth), but this is just based on a pen-and-paper analysis. You have to be
more explicit about this, otherwise you will mislead the reader (as you did to
me).

===========================================================================
                          OSDI 2008 Review #157E
                 Updated Sunday 22 Jun 2008 7:52:43pm EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 4. Accept

                ===== Paper strengths and weaknesses =====

A novel architecture for attacking a real problem.  Smaller lower power systems with
flash are used to replace much larger DRAM/disk based system for query processing.  It
isn't very general, we can't replace Google's world with what they design, but it points
in a very reasonable direction.

                      ===== Comments for author =====

I don't understand the n log n cache size argument for the front ends in 
section 3.5.   Could you expand that a bit?

I like the idea.  How far can you push it as a query processor?  Clearly a straight
lookup is possible and easy to do.   How complicated a query can you expect to process
on your machines?   IN a web search query, one has to find the best matches to a query rather
than just just find an object.  Think you can extend the architecture that far?  What
would it look like?   Or do you care?

===========================================================================
                          OSDI 2008 Review #157F
                 Updated Tuesday 8 Jul 2008 10:03:44am EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 2. Reject

                ===== Paper strengths and weaknesses =====

This paper makes the case for optimizing data-center hardware according to query-throughput per energy consumption. New CPUs offer reduced performance according to a mips/energy measure compared with older CPUs. A cluster of older CPUs may offer the same performance as a single new CPU, provided that the application scales well, which is the case (easily) for a DHT-based service. Since data-centers costs are increasingly dependent on energy efficiency, the paper suggests to design a data center around "wimpy" -- low end -- CPUs. A similar case is made regarding disks, where the sweet spot of storage/bandwidth vrs energy lies with flash disks. 

This point is highly interesting, and is argued clearly and with sufficient supporting data. 

The paper then describes the design of FAWN, a system designed with wimpy nodes to support a large scale DHT. The basic DHT design borrows from Chord, and provides a put/get API with a minor extension. Storage on nodes is organized into large chunks in order to avoid many small writes, which are expensive on flash disks.

Unfortunately, this part has very little novelty (if at all). Perhaps there are really no particular technical challenges to make use of the new architecture, it is merely a hardware choice. If that is so, then I am not sure that the paper really belongs in OSDI. Otherwise, the work fails to advance the system beyond the basic insight, and should mature further before it is published.

The evaluation is not particulaly enlightening. The interesting part is section 5.1, which underlies the entire paper's thesis. The rest is a somewhat wimpy (pub intended) small-cluster DHT prototype, with global knowledge of the entire hash map at the front end, and really nothing much to learn from.

===========================================================================
                          OSDI 2008 Review #157G
                 Updated Sunday 13 Jul 2008 4:39:17pm EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 2. Reject

                ===== Paper strengths and weaknesses =====

Proposes building a scalable hash table service for small objects by using a cluster of many cheap, low-power, Flash storage-based nodes that are fronted by a set of more powerful stateless caching nodes.  The key observation is that modern hardware is primarily optimized for certain kinds of high performance, that typical server configurations are unbalanced with respect to power consumption, and that Flash storage drives enable a more balanced scale-out storage design from a power point-of-view while still providing good performance.  While exploitation of Flash drives and balanced system design are good ideas, they are hardly novel and my first reaction from reading the abstract was to think "SAN storage nodes with Flash instead of disk".  Unfortunately the paper doesn't mention the SAN literature and, in general, does a poor job of evaluating its design with respect to other real cluster system designs instead of hypothetical designs based on evaluating the performance of individual standard PC and embedded-system nodes of various kinds.  The evaluation section furthermore seems to focus almost exclusively on a purely read-based workload (which is only vaguely described) as far as performance is concerned, without discussing any specific real-life workloads that it thinks its design would be appropriate for.  Scaling is evaluated to 8 nodes and then linear scaling is simply asserted to hold out to thousands of nodes with no design details given to convince a reader that various second-order effects that can affect scaling at hundreds to thousands of nodes have been addressed.  Finally, the paper completely ignores how strong consistency requirements might change/unbalance their design without giving any indication of how many real-life workloads would be happy with weakly consistent hash table semantics versus how many require strongly consistent semantics.

                      ===== Comments for author =====

The overall topic that this paper is trying to address is an important one and a good, detailed evaluation of how various different hardware and design configurations would perform compared to each other would be really valuable and interesting.  However, this paper is far too preliminary and incomplete in its results.  Examples of unanswered questions I was left with include the following:
- What exactly are the REAL workloads that are being targeted?  What kinds of read/write ratios do they have?  Which ones require strong consistency?  How big are the databases involved?  O(4) TB is a rather small data repository these days; what are the implications of serving substantially larger ones?
- As far as I can tell, you assume stateless front-end nodes and that works great for weakly-consistent semantics where each front-end can simply cache data without having to worry about invalidations.  If strongly consistent semantics were required then how would this permute the balance of your system design?  Your claim of 80K qps/front-end and 500K qps in aggregate might go down rather substantially and require a non-trivially different balance of system resources.
- The scaling evaluation reminds me of the "engineering proof" that all odd numbers are prime: 1 is prime, 3 is prime, 5 is prime, 7 is prime, therefore all odd numbers are prime.  I've never encountered a real system where every order of magnitude in scale didn't bring unexpected design problems out of the woodwork.  For example, how exactly are your front-end nodes keeping track of liveness of the back-end nodes?  How are front-end failures handled?  
- What effect do failure recovery actions have on actual system performance?  When splits and merges occur, how do copies affect regular network traffic?  Do you control network saturation?  Are various key-value pairs unavailable and, if so, for how long?
- How do typical SAN nodes compare to the server and embedded nodes that you've chosen to evaluate?  Where are SANs going wrt power considerations?

Some more specific comments:
- Sec. 3.5: please provide a reference for the coupon collector's problem.
- Sec. 5.1: What exactly is the "FAWN database"?  In general, there is effectively no detailed discussion of the workload you're evaluating.
- Sec. 5.2: 81K qps for a front-end node seems to only be about how fast you can forward client requests to back-end nodes.  But there is a lot more going on in a real front-end node than just that!  You need to describe all the things that the front-end would need to do in a real implementation, including liveness monitoring, etc.
- Sec 5.2: You claim that it would take ~ 2000 disks to provide 500K qps.  But your table lists 600 qps for your server configuration.  Wouldn't that imply that only ~ 850 disk-based machines are needed?

===========================================================================
                          OSDI 2008 Review #157H
                 Updated Monday 14 Jul 2008 9:37:53am EDT
---------------------------------------------------------------------------
               Paper #157: FAWN: A Fast Array of Wimpy Nodes
---------------------------------------------------------------------------

                      Overall merit: 1. Strong reject

                ===== Paper strengths and weaknesses =====

This paper presents FAWN, an architecture for building scalable and
power-efficient cluster of machines. It makes the observation that powerful
machines with hard disks consume significant amount of power thereby
providing poor throughput in terms of power-efficiency (i.e., number of
operations per joule is low). It then argues that cluster built using wimpy
nodes (low CPU and low memory) with flash memory and no hard disks can
provide much better power efficiency while providing good performance. It
describes a single-hop DHT-based architecture that provides a put-get store
using these nodes. The paper then presents the throughput per joule for
different hardware configurations and shows that the chosen FAWN node
hardware performs the best for this metric.

                      ===== Comments for author =====

The paper's observation of using "crippled nodes" with flash is interesting
but it is lacking on several fronts to make it a good paper at this
point in time. First, the software architecture of FAWN is not very novel - it is a
tweak over systems such as Dynamo with the storage architecture nodes done
in such a way to avoid some of the problems that Dynamo had faced - these
improvements are useful in the sense that one can use consistent hashing if
needed (but this improvement is relatively minor). Second, the paper punts
on the issue of consistency which is very important for application
writers. Dynamo got away with it because it had a very specific application
in mind - the shopping cart. Is FAWN also designed for the same set of
applications? If so, it would be important to get those applications to run
with FAWN.

Finally, given the above facts, the main contribution of the paper is the
observation that datacenters should be built using wimpy nodes. However, to
prove this claim, one would need a very solid evaluation with workloads or
systems from real internet web sites. Currently, the evaluation of FAWn is
quite weak - it essentially consists of a set of microbenchmarks and
synthetic workloads.

==================== Summary of PC discussion ====================
Comment Tuesday 29 Jul 2008 11:38:54am EDT

Summary of PC discussion: 
The PC liked much of this paper, but felt it was premature. 
Please focus on finishing the work itself (e.g., consistent 
remote update with failures), bringing out what was 
interesting/novel from the perspective of software architecture, 
and scaling to a larger number of (wimpy) nodes. The PC looks 
forward to seeing the improved/mature version of the work at the 
next major conference.