Network is much much more reliable than it used to be. In a typical large-scale ...

ignoramous · on June 5, 2022

> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

> We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

> We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).

jiggawatts · on June 5, 2022

It blows my mind a tiny bit that ordinary cloud VMs are running on hosts with 100 Gbps NICs.

Even “medium” VMs can get gigabytes per second of throughout..

wizzard0 · on June 5, 2022

> Network is much much more reliable than it used to be

Within DCs, sure. At the same time, a lot more company networks become spread thin over 4G/5G networks and finicky cable ISPs in recent years with WFH.

This caused quite a strain on internal LOB apps which could reasonably assume zero packet loss and sub-1ms LAN latency beforehand :)

3np · on June 5, 2022

At the same time, IoT, mobile and roaming make for unreliability and consistent topology reshuffling. Granted different environment and for various reasons properly distributed solutions on these networks aren't widespread so far. Just saying there's another side to it.