Hacker News new | past | comments | ask | show | jobs | submit login
Unreliability at Scale (dshr.org)
175 points by mooreds on June 16, 2021 | hide | past | favorite | 48 comments



One of the most severe S3 outages[1] was triggered by a single NIC flipping bits. S3 maintains (at least back then) the routing table (object to node mapping) in a Merkle Tree data structure and it keeps getting updated via gossip protocol. This bit-flipping machine however, would spread misinformation about the routing table forcing all its neighbours and their neighbours (and so on) to sync their Merkle Trees. So a large number of machines spent more time gossiping than serving the customer request.

I vividly remember the outage as I was at Amazon during that time, in fact I was in AWS group but a different team. I guess the ticket was escalated to sev-1 as large number of tier-1 services had begun depending on S3. The COE action items took months to get rolled out.

While it's one thing to say be ready for all kinds of failure, including hardware, it takes massive failures to internalise the lessons. AWS is now paranoid about being resilient against all kinds of failures including earthquake and lightening.

[1] https://status.aws.amazon.com/s3-20080720.html


> So a large number of machines spent more time gossiping than serving the customer request.

Machines are becoming increasingly capable of automating jobs previously only done by humans.


It always takes that one incident to reprioritize reliability. Kudos to AWS for actually taking action to change the org to be more resilient though. Most organizations that have reliability issues will hire a QE expert or have some Quality initiatives that start hot but fizzle out as leadership loses interest until another disaster hits again.


What did you call that single NIC flipping bits event in Amazon?

Is the term 'cosmic-ray-induced error' outdated in FAANG?

>Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.

[1]https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electroni...


This topic was raised earlier this week too:

https://news.ycombinator.com/item?id=27484866 Silent Data Corruptions at Scale

In the other comments, there was talk of having some kind of active search-for-unreliability as a background task in clusters. I was kinda pleased with my suggestion in that thread:

There are programs, used for testing and “fuzzing” compilers, that generate random programs based on a seed.

When a node runs a random program, it doesn’t know if the output is correct. But it could report the seed and result to a central database.

Then, if you had a several nodes, and two nodes ran the same seed and got different output, that would mean something was wrong and needed investigation.

There are also programs for reducing such programs down to a minimum test case. So once a discrepancy is found, it can be reduced to some small program that recreates it.

I once worked on a compiler backend and a CI job generated random C programs and compared x86 output. against the novel cpu simulator. Any discrepancies found were auto reduced by these tools and then a ticket was automatically created. Lots of our bugs were found and fixed this way.

(My memory is we used C-reduce for the reductions. I can’t remember the tool we used for generating the test programs, but there are several.)


Nice. Automated testing like this is going to be absolutely essential as the systems we build get more complex and operate at scales humans would find it hard to work with.


SW testing before deployment will be nice to have , but it's expensive. /s


As someone said: "Everyone has a testing environment. Some are lucky enough to have a completely separate production environment."


The issue with testing environments is that they almost always differ from the production environment in important ways. Cloud and infrastructure-as-code can mitigate this to a certain extent, but I doubt this issue can ever be fully eliminated. Good CI pipelines should always be our first line of defense. Testing environments sometimes sound more appealing because you don't have to write tests, but that's a trap.


One important way testing environments and production environments necessarily differ is very germane to discussion of this problem. If you have an issue in 1 of 64 cores on a CPU on 1 out of 1000 CPUs, what are the chances your testing environment will ever see the issue? Do you have 10,000 CPUs in your testing environment? I know I don't.

At scale we really do need to think about monitoring, automated diagnosis, and recovery.


Before cloud it was frustratingly common for QA to have lesser hardware than the production hardware, but the problem with that is that at best you could get a vendor to overnight you replacements for a hardware failure.

Meanwhile if your QA team has the same hardware as prod, you can cannibalize a QA machine to replace the production hardware that failed.

Your QA probably cares less about being down overnight than your customers do.


I think we need to kill the fantasy of having testing environments for most software. For stuff like Airplane control, pacemakers etc. it’s absolutely worth it. For most organizations serving APIs with an acceptable nonzero failure rate, using canary deployments and other techniques provide almost all the value you would need.


There is a solution to this problem, known for a few decades now: the vital coded processor. Basic idea: each operation can simultaneously compute a result and an arithmetic « signature » that can be pre-calculated for a program. The length in bits of the signature defines the probability of missing an error in the underlying hardware. Think ECC but the error correcting code defines the correct _execution_ of some machine code.

For some reason, it does not catch up with the mainstream but it truly would make sense for such application, albeit at the cost of a computational overhead and more work for the compiler.


And that's why IBM mainframes have duplicate CPUs with comparison. [1]

[1] https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=c...


In the Facebook bug listed in the post this specific mitigation would probably not have been enough since the bug was due to invalid instructions emitted by the JIT.

Under exact same workloads, two duplicate JIT running on two duplicate CPU would have most likely emitted the same erroneous code.


> due to invalid instructions emitted by the JIT.

That's not how I understood the blog post.

> Next they needed to understand the specific sequence of instructions causing the corruption. This turned out to be as much of a nightmare as anything else in the story. The application, like most similar applications in hyperscale environments, ran in a virtual machine that used Just-In-Time compilation, rendering the exact instruction sequence inaccessible. They had to use mutiple tools to figure out what the JIT compiler was doing to the source code, and then finally achieve an assembly language test:

>> The assembly code accurately reproducing the defect is reduced to a 60-line assembly level reproducer. We started with a 430K line reproducer and narrowed it down to 60 lines.

It sounds like the JIT produced accurate (although hard to find) machine code. Then when the CPU ran that machine code it executed it incorrectly, but only when executed on core 59.


Nit: if you're actually running multiple CPUs to mitigate data corruption, you need a third CPU to break ties.


Only if your response is to continue. If your response is to mark the processor pair as faulty and take it out of service, two is sufficient. I worked on the kernel for a fault-tolerant system in 1991 based on this model plus checkpointed memory. The sales story was that such a design was more cost-effective than the "pair and spare" approach used by competitors like Tandem and Stratus.


But you only need two to detect an issue, assuming they don't fail the same way, though that's an issue in the mitigation department if you let it limp along.


What they do in IBM zSeries is they compare the instruction executions on two cores, if they agree. If they disagree, they will retry the operation. If they disagree again, they will take the CPU (both cores) offline (and probably call home to IBM to bring a replacement).


Fascinating read. Kudos to the folks who no doubt had to work hard to identify these esoteric bugs. I do wonder if it is worth it for these companies to identify these issues - it sounds more economical to just isolate the faulty machine and get rid of it, instead of bothering to investigate. On the other hand, they perhaps can't rule out that the issue isn't something more widespread.


At scales this large, finding the root cause is the best choice. By finding the cause you can then do tests on all boxes via cron to make sure everything works as expected; allowing you to pull hardware before an issue becomes widespread; can then rerun the boxes inputs on another machine if it was handling more than serving http requests. Robust hardware validation and testing is a must when dealing with millions of cores.


I once worked where a faulty machine out of a massive fleet was finally sent back to intel. They put it under some kind of exotic test fixture and confirmed it was fucked. It was a defect never seen again or before. Root-causing that stupid chip was a complete waste of everyone's time.


That's sort of the opposite of what I've read from the companies operating at this scale; most seem to assume that failures are inevitable in a sufficiently large distributed system, that tracking the root cause of every failure is infeasible, and that the solution instead is to design resilient systems.


It's probably best to find a root cause for the first instance of a new class of problems. At first there was no way to be sure what part of the stack some of these were in. They ruled out their software, third-party libraries, and memory before focusing on a CPU and then which core of the CPU. Knowing that sometimes a single core goes bad means you can't randomly assign one test run to a CPU with no core affinity and reliably detect it. Yes, the fix is to disable that core or core complex if you have that kind of resolution, or to ditch the CPU if you must, or to pull the whole server if your scale demands it. But knowing it's a hardware error in the CPU and not chasing it through layers of code every time is invaluable.


You can design resiliency to specific faults. Server goes down, bad code deploy, cosmic rays flip a bit. Each resiliency design looks different.


What is a faulty machine? How do you find one? Some of the failures experienced happened with specific data and workloads assigned to specific cores of specific machines. How do you identify software bugs vs hardware issues? Given a suspect machine how do you figure out if it's actually the machine's fault? Is it the CPU? RAM? Motherboard that's faulty?


The issue at scale is that you don't know if there's an underlying flaw that will manifest all at once across enough machines to cause a real outage or major service disruption.


Somewhat of a deviation from the topic of this article, but I enjoyed this terminology from the Facebook bug analysis:

> After a few iterations, it became obvious that the computation of Int(1.1^53) as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU

The use of the word "obvious" here reminds me very heavily of this particular SMBC:

https://www.smbc-comics.com/comic/how-math-works

Not two paragraphs earlier they're talking about how discovering this required multiple entire teams of engineers. Lots of things are obvious after you've done all the hard work involved in understanding it!


I read it differently. More like after a bunch of debugging it was clear there was one and only one possible problematic instruction (vs any of several).


Oh the intent is pretty obvious (heh) - shorthand for "at this stage no further investigation was necessary to be confident" - but I just find the more broad interpretation amusing.


I never ceased to be amazed at how awful we all are with statistics. I struggled in school with that class and I think that made me more sensitive to this phenomenon.

You can't move anyone to action with conversations about failure percentages or probabilities. The light bulbs only ever click on when you rephrase in terms of interval between incidents. If I tell you that you have a high chance of being pre-empted today to deal with operational bullshit, people don't like that. But as a fraction it just isn't compelling to most people.

Be it servers or deployments or CI issues, one negative event per day can be a very small proportion as you scale up. One substantial event per week is an even lower probability. People also don't get that 'all the time' means 'once a week and three times in a row while I was in a bad mood'. So even if your incidence isn't that bad, statistical clustering will earn you enemies in other departments or among your customers, so your perceived rate of failure is much worse than your actual.


Another in a long line of unexpected causes for system errors, the classic being the cosmic ray:

"When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe.

While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices. It's called a single-event upset or SEU.

During an SEU, particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet."

https://www.computerworld.com/article/3171677/computer-crash...

Those of us who do high-level development generally treat these underlying systems as infallible, but as we continue to scale - and more money and lives are on the line - we'll need to get used to the idea of not only not trusting the underlying hardware as the article states, but may have to get to the point where we have "ECC at the system level." We already have this in various distributed systems tech, but this tends to be application-specific. The next step would be to incorporate it directly into datastores generally.

This also suggests that heterogeneous hardware architectures can have an advantage in situations where data integrity is critical, even with the increased administration, hardware, and ops costs. Finally, it also highlights the importance of data audits and reconciliations for even non-suspect data on a regular basis, preferably with the aforementioned heterogeneous setup.


In a cloud environment, I worry how often there is a Market for Lemons situation going on.

If I have a machine that seems not to be behaving, am I going to spend as much time doing root cause analysis as the people in this article did? Or am I going to allocate a new VM and dump the old one? I would expect over time for the permanently allocated machines to see a smaller density of bad hardware, while the pool of available machines now has a higher than average rate of hardware failures.

This is bad both for newcomers and for anyone who tries to autoscale hardware - because hardware error density now increases with request rates, instead of remaining stable.


I think it is a quote from "Release It" that says something along the lines of: In our field (so many computations are done) as to make astronomical coincidences happen on a daily basis.

BTW: That entire book is great IMO.


Hardware failures become much more obvious once you hash your data and check the hashes on each access. You start to see that computers, even servers with ECC memory, aren't quite as reliable as you wished.


> even servers with ECC memory, aren't quite as reliable as you wished.

This made me recall a recent post by Linus criticizing Intel[1]. I didn't even know things like this could happen until I read that.

[1] https://www.realworldtech.com/forum/?threadid=198497&curpost...


it always gave me great pause when people (even on HN) decried ECC as unnecessary, often citing Googles first servers as evidence of that.

I operated at considerably smaller scale than Google and saw enough memory errors to curl my toenails- to the point I even bought a laptop with a Xeon processor and "ECC ram" though it was exhorbitantly expensive, hard to find and doesn't have a viable successor.

And, for what it's worth, Google recanted that idea and are operating with ECC now.


It is breathtaking how arrogance by Intel has reuined an entire industry and, I bet, cost people money and ruined lives. We should have never allowed CPU instructions to be protectable.

Similarly, even though ZFS and BTRFS have checksums and data integrity, we continue to push unreliable data storage to consumers with no replication.

We have no checks and balances on this infustry, and I dont know how to fix it without bringing down the legal hammer.


Hopefully these efforts from the big players result in more reliable hardware for everyone.


How about reliable software ?


I think it goes without saying that you can't build a castle on a beach.

So too goes for reliability, you need foundations on which to build.


Counterexample: the Internet. Reliable communication over unreliable networks. Any highly-available system with better reliability than any individual component also shows that it's not _impossible_ to build from unreliable components, though it might be difficult.


If you think the internet is reliable then your monitoring is lying to you.

Keeping a persistent connection open to a machine 1 ISP hop away for more than a couple of days is basically impossible, or at the very least improbable.

I ran an always-online game which required a non-breaking TCP session, this was the single largest cause of user frustration. I also run an IRC network which requires the same thing, and you need only look at the number of people ungracefully disconnecting in a day.

Stateless requests help of course, but I'm pretty sure you and I brush failures away because we've internalised the idea that the internet is not really reliable; whether it by wifi, a faulty cable, DNS, the router, "the site" or so on.


A single TCP socket may not be reliable, but I understood the parent comment at the application layer. With enough redundancy, checks, retries and caches, it's possible to conduct business reliably[1] over the internet, tolerating failures from nearly any component involved along the way.

[1] By "reliable" I mean with an uptime satisfactory for the application involved. Nothing is ever 100% but we can get pretty darn close in terms of probabilities.


Yes, that's what I had in mind. I can make an HTTP request and (usually) get a response, even with intermediate routing failures, repeated packets, dropped packets, out-of-order packets, etc.


Not the only reason to be sure, but this is my goto explanation for why software engineering didn't have the same discipline, apprenticeship, professional licensing, S&P, etc of other engineering degrees.

The systems we work in are fundamentally unsustainable beyond very limited scales and periods of time compared to other disciplines due to the nature of the medium.

If you had these same features in chemistry or civil engineering you wouldn't have nuclear power or the Brooklyn Bridge.


The "petabyte for a century" problem statement he mentions near the top is fun: can you preserve 1 PB with better-than-even odds of all the bits coming back right in a century? He wrote something about how he saw the durability situation in 2010: https://queue.acm.org/detail.cfm?id=1866298 . He seems to define the problem to allow for maintenance, e.g. check-and-repair operations are discussed.

Broadening a little, I read it as "if you're trying to be more cautious than you usually want for commercial/academic online storage or even backup, what do you do? And would it work?"

A lot (but not all) of the author's Queue piece talks about stats from online storage, which doesn't have some wins you can get if you're entirely about long-term durability.

In online systems, heavy ECC seems out of favor compared to replication for performance reasons, but additional LDPC or RS at the app layer can absorb a substantial % of your volumes having problems, or a ton of random bad blocks. (In a more near-line context, Backblaze uses RS: https://www.backblaze.com/blog/vault-cloud-storage-architect... ) Same for tape -- offline LTO is slow and drives are costly, but the cost/TB and the rated lifespan seem like advantages over HDDs for this specific goal, at least in a narrow engineering sense.

A pile of cryptographic hashes that fits on one device can help you check for and localize errors without sending everything over a network/doing the full ECC dance. If the hash being broken and data tampered with is in your threat model (hey, weird things can happen in a century), you can also hash with a secret nonce you keep separate.

Initially loading a PB with a good chance of no mistakes is a thing too, and durability measures don't totally address that. Maybe your multi-site strategy loads up the original on different hardware at different locations with independent software implementations, and you compare those hashes after you do it.

With all that the hundred-years part is still deeply tricky in a couple ways.

"Lasting a hundred years" is just technologically different from "very low error rate at 5 years." Widely used media like LTO tape seem to max out at a 30y rating, and exotic archive media like the "hardened film" at GitHub's code vault has the big disadvantage of no ecosystem to read it. So seems like you really want refreshes of some sort at intervals, and being sure a task will be done decades from now is hard (assuming high-tech civilization is around and all that--some things just have to be outside the scope of the problem for it to be meaningful).

From that angle, maybe having an online copy of the data is a better investment than the pure engineering perspective would suggest: if other folks can grab a copy of the archive it has a better chance of outliving your organization.

Two, a lot of unknown unknowns crop up at that timescale. A couple decades back we didn't have the experience at scale we do now, and CEEs and other causes of SDC were less on anyone's radar. We could discover something else significant after it's too late to fix. The world can also change in ways that disrupt the durability picture substantially (changing laws or disaster risks, say), short of the types of change that make the whole problem meaningless.

Anyway, fun question, and if you find it fun too, you might like https://www.youtube.com/watch?v=eNliOm9NtCM , a talk on backups from someone at Google that talks about the tape restore after the big GMail glitch and various dimensions of resilience. And I'm sure there are storage papers and Long Now-ish stuff I'm not plugged into about things like this, wouldn't mind hearing about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: