In theory, their design makes sense: separating the component pieces makes it easier to adapt to different systems/scenarios, it allows you to more succinctly map out the architecture (and hopefully simplify interfaces), you have that whole testing framework you can use on the individual pieces, etc.
The problem is with implementation. The more you separate components, the more code you have to add to allow them to interoperate. The more code, the more complexity. The more complexity, the more bugs. In addition, when you continue to separate components out into new failure domains that have a higher probability of failure ("3 components running on 1 node" -> "3 components running on 3 separate nodes") you increase the chances of problems even more. So, while the design looks nice in theory, and I'm sure they have some wonderful reports about how thorough their test framework is, in practice it might be a tire fire (depending on how it's actually used).
Some people are going to go, "I've been using this in production for a year, it's rock solid!" Well, people said that about Riak, until they wanted to set it on fire when they found out how buggy the implementation was at actually handling various errors and edge cases, and that they each had to hire dedicated Riak programmers just to fix all its problems. You can make large-scale Riak clusters work "at scale" if you hire enough on-call to constantly fight fires and monkey-patch the bugs and juggle clusters and nodes and repair indexes, and hand-wave outages and flaws as due to some other issue ("the cluster was completely hosed and had to be recovered from backup because the disks got corrupted and a node went down and a network partitioned all at once, it's not Riak's fault").
> looking at this HN thread, it seems like my assumption may be correct
that has got one guy complaining about fdb, and a couple of others saying positive things.
so in short, you've got no experience with fdb, wrote a couple of paragraphs of speculation and references to other systems, and then declared victory, hoping nobody would read the other thread, i guess.
i've written stuff against fdb, and i've seen it in non-trivial production. it's not a panacea, it's a useful point in the design space of databases, and does pretty well there.
Do you work in distributed systems at a level where you understand their claims, and the significance of this work vs. a general dislike of their high level implementation choices? Not implying anything with the previous statement except that not all software people understand strict serializability, etc. in distributed systems. In this case, you need some understanding to critique the paper.
And if you do work in distributed systems at that level, please elaborate in detail what you're trying to refute, because it's all "I feel this should be bad. We need to be careful in trusting them!", which is unconvincing.
My point is that even if it fails in a predictable way in test scenarios, it won't necessarily in real life. The paper & test framework are effectively going to trick people into trusting that the thing will work well, rather than going by actual observations of how it works in varied production environments. For the casual observer, who cares? But for the people about to spend millions of dollars on this tech to run it in production, I hope they aren't swayed purely by theoreticals.
Remember a decade ago when NoSQL came out, and everybody was hooping and hollering about how amazing the concept was, and people like me would go "Well, wait a minute, how well does it actually run in production?" And people on HN would shout us down because the cool new toy is always awesome. And lo and behold, most NoSQL databases are no better in production than Postgres, if not a tire fire. People who have fought these fires before can smell the smoke miles away.
You're still missing my point. Testing shows you testing bugs. Running in production shows you production bugs. There are bugs you literally will never see in test until you run in production. There is absolutely no way to discover those bugs purely by testing, no matter how rigorous your test is.
The most glaring of these bugs are SEUs caused by cosmic rays. Unless your test framework is running on 10,000 machines, 24 hours a day, for 3 years, you will not receive the SEUs that will affect production and cause bugs which are literally impossible without randomly-generated cosmic rays.
The simpler of the bugs are buggy firmware. Or very specific sections of a protocol getting corrupted in very specific ways that only trigger specific kinds of logic in a program over a long period of time. Or simply rolling over a floating point that was never reached in test because "we thought 10,000 test nodes was enough" or "we didn't have 64 terabytes of RAM".
And more examples, like the implementation or the administrative tools just being programmed shittily. Even if you find every edge case, you can write a program badly which will simply not deal with it properly. Or not have the tools for the admins to be able to deal with every strange scenario (a common problem with distributed systems that haven't been run in production). Or 5 different bugs happening across 5 completely different systems at the same time in just the right order to cause catastrophic failure.
Complex systems just fuck up more. In fact, if your big complex distributed system isn't fucking up, it's very likely that it will fuck up and you just haven't seen it yet, which is way more dangerous than not knowing how or when it's going to fuck up.
The problem is with implementation. The more you separate components, the more code you have to add to allow them to interoperate. The more code, the more complexity. The more complexity, the more bugs. In addition, when you continue to separate components out into new failure domains that have a higher probability of failure ("3 components running on 1 node" -> "3 components running on 3 separate nodes") you increase the chances of problems even more. So, while the design looks nice in theory, and I'm sure they have some wonderful reports about how thorough their test framework is, in practice it might be a tire fire (depending on how it's actually used).
Some people are going to go, "I've been using this in production for a year, it's rock solid!" Well, people said that about Riak, until they wanted to set it on fire when they found out how buggy the implementation was at actually handling various errors and edge cases, and that they each had to hire dedicated Riak programmers just to fix all its problems. You can make large-scale Riak clusters work "at scale" if you hire enough on-call to constantly fight fires and monkey-patch the bugs and juggle clusters and nodes and repair indexes, and hand-wave outages and flaws as due to some other issue ("the cluster was completely hosed and had to be recovered from backup because the disks got corrupted and a node went down and a network partitioned all at once, it's not Riak's fault").
(looking at this HN thread, it seems like my assumption may be correct: https://news.ycombinator.com/item?id=27424605)