Hacker News new | past | comments | ask | show | jobs | submit login

Network is much much more reliable than it used to be. In a typical large-scale datacenter today, it is becoming common to deploy clos networks with multiple ECMP paths to get from one server to another. Also, cost of network relative to cost of the server is quite low (less than 8-10% of overall cost). Also, fat connectivity between datacenters in a metro area is much much cheaper than ever before (due to huge increases in fibre capacity in the last 10 years). If anything, network has become cheaper faster than compute has in the last 10 years. (And storage has done the same at an even higher rate). So, this affects architectural decisions in very interesting ways. Of course, public cloud make a huge profit on network bandwidths by billing to customers based on old mental models.



> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

> We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

> We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).


It blows my mind a tiny bit that ordinary cloud VMs are running on hosts with 100 Gbps NICs.

Even “medium” VMs can get gigabytes per second of throughout..


> Network is much much more reliable than it used to be

Within DCs, sure. At the same time, a lot more company networks become spread thin over 4G/5G networks and finicky cable ISPs in recent years with WFH.

This caused quite a strain on internal LOB apps which could reasonably assume zero packet loss and sub-1ms LAN latency beforehand :)


At the same time, IoT, mobile and roaming make for unreliability and consistent topology reshuffling. Granted different environment and for various reasons properly distributed solutions on these networks aren't widespread so far. Just saying there's another side to it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: