In theory, their design makes sense: separating the component pieces makes it ea...

sigstoat · on Oct 4, 2021

> looking at this HN thread, it seems like my assumption may be correct

that has got one guy complaining about fdb, and a couple of others saying positive things.

so in short, you've got no experience with fdb, wrote a couple of paragraphs of speculation and references to other systems, and then declared victory, hoping nobody would read the other thread, i guess.

i've written stuff against fdb, and i've seen it in non-trivial production. it's not a panacea, it's a useful point in the design space of databases, and does pretty well there.

ctvo · on Oct 4, 2021

Do you work in distributed systems at a level where you understand their claims, and the significance of this work vs. a general dislike of their high level implementation choices? Not implying anything with the previous statement except that not all software people understand strict serializability, etc. in distributed systems. In this case, you need some understanding to critique the paper.

And if you do work in distributed systems at that level, please elaborate in detail what you're trying to refute, because it's all "I feel this should be bad. We need to be careful in trusting them!", which is unconvincing.

j16sdiz · on Oct 4, 2021

If this paper is to be trusted. FDB is tested in many failure mode and scenceio. I expect it may fail, but should fail in predictable way.

throwaway984393 · on Oct 4, 2021

My point is that even if it fails in a predictable way in test scenarios, it won't necessarily in real life. The paper & test framework are effectively going to trick people into trusting that the thing will work well, rather than going by actual observations of how it works in varied production environments. For the casual observer, who cares? But for the people about to spend millions of dollars on this tech to run it in production, I hope they aren't swayed purely by theoreticals.

Remember a decade ago when NoSQL came out, and everybody was hooping and hollering about how amazing the concept was, and people like me would go "Well, wait a minute, how well does it actually run in production?" And people on HN would shout us down because the cool new toy is always awesome. And lo and behold, most NoSQL databases are no better in production than Postgres, if not a tire fire. People who have fought these fires before can smell the smoke miles away.

j16sdiz · on Oct 4, 2021

Their test framework uncovered a few zookeeper bugs.

Zookeeper bugs are very rare even in production and hard to find.

If they can uncover zookeeper bug, i think they did some serious testing.

throwaway984393 · on Oct 7, 2021

You're still missing my point. Testing shows you testing bugs. Running in production shows you production bugs. There are bugs you literally will never see in test until you run in production. There is absolutely no way to discover those bugs purely by testing, no matter how rigorous your test is.

The most glaring of these bugs are SEUs caused by cosmic rays. Unless your test framework is running on 10,000 machines, 24 hours a day, for 3 years, you will not receive the SEUs that will affect production and cause bugs which are literally impossible without randomly-generated cosmic rays.

The simpler of the bugs are buggy firmware. Or very specific sections of a protocol getting corrupted in very specific ways that only trigger specific kinds of logic in a program over a long period of time. Or simply rolling over a floating point that was never reached in test because "we thought 10,000 test nodes was enough" or "we didn't have 64 terabytes of RAM".

And more examples, like the implementation or the administrative tools just being programmed shittily. Even if you find every edge case, you can write a program badly which will simply not deal with it properly. Or not have the tools for the admins to be able to deal with every strange scenario (a common problem with distributed systems that haven't been run in production). Or 5 different bugs happening across 5 completely different systems at the same time in just the right order to cause catastrophic failure.

Complex systems just fuck up more. In fact, if your big complex distributed system isn't fucking up, it's very likely that it will fuck up and you just haven't seen it yet, which is way more dangerous than not knowing how or when it's going to fuck up.