Hacker News new | past | comments | ask | show | jobs | submit login

My point is that even if it fails in a predictable way in test scenarios, it won't necessarily in real life. The paper & test framework are effectively going to trick people into trusting that the thing will work well, rather than going by actual observations of how it works in varied production environments. For the casual observer, who cares? But for the people about to spend millions of dollars on this tech to run it in production, I hope they aren't swayed purely by theoreticals.

Remember a decade ago when NoSQL came out, and everybody was hooping and hollering about how amazing the concept was, and people like me would go "Well, wait a minute, how well does it actually run in production?" And people on HN would shout us down because the cool new toy is always awesome. And lo and behold, most NoSQL databases are no better in production than Postgres, if not a tire fire. People who have fought these fires before can smell the smoke miles away.




Their test framework uncovered a few zookeeper bugs.

Zookeeper bugs are very rare even in production and hard to find.

If they can uncover zookeeper bug, i think they did some serious testing.


You're still missing my point. Testing shows you testing bugs. Running in production shows you production bugs. There are bugs you literally will never see in test until you run in production. There is absolutely no way to discover those bugs purely by testing, no matter how rigorous your test is.

The most glaring of these bugs are SEUs caused by cosmic rays. Unless your test framework is running on 10,000 machines, 24 hours a day, for 3 years, you will not receive the SEUs that will affect production and cause bugs which are literally impossible without randomly-generated cosmic rays.

The simpler of the bugs are buggy firmware. Or very specific sections of a protocol getting corrupted in very specific ways that only trigger specific kinds of logic in a program over a long period of time. Or simply rolling over a floating point that was never reached in test because "we thought 10,000 test nodes was enough" or "we didn't have 64 terabytes of RAM".

And more examples, like the implementation or the administrative tools just being programmed shittily. Even if you find every edge case, you can write a program badly which will simply not deal with it properly. Or not have the tools for the admins to be able to deal with every strange scenario (a common problem with distributed systems that haven't been run in production). Or 5 different bugs happening across 5 completely different systems at the same time in just the right order to cause catastrophic failure.

Complex systems just fuck up more. In fact, if your big complex distributed system isn't fucking up, it's very likely that it will fuck up and you just haven't seen it yet, which is way more dangerous than not knowing how or when it's going to fuck up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: