Fallacies of distributed systems

henrik_w · on June 5, 2022

These fallacies can also lead to one of my favorite programming quoutes:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport

photochemsyn · on June 5, 2022

You'd think a well-designed distributed system would be able to handle the failure of a component and fallback into a different configuration to handle the problem. That's the fundamental feature of Internet routing systems, i.e. they expect components to occasionally fail. Building a distributed system that can't handle such failures just seems like sloppy engineering.

xwolfi · on June 6, 2022

The joke is that no matter how many redundancies, the complexity is so beyond human's ability to map at once, that you always miss a single point of failure.

Each failure scenario must be gracefully handled in each subpart to take into account all possible impact on the whole system.

You'll always get a random crap you never heard of destroying an assumption you made in your own crap.

I remember an instance where a national mobile phone provider in France got a single db outage but during a large sms day (think new year), so the failover happened but totally overloaded the new servers, and it created a chain reaction to other providers to a point you couldnt read your email for a day or two because of inability to get smses. Why is that random db in a provider I dont even use impacting my google account ? Because the system is distributed.

ketzu · on June 5, 2022

In our distributed systems lecture, my professor started the introduction lesson with some descriptions on 'what is a distributed system.' The one by Lamport was by far the most quoted in exams, as it is the most memorable, because it is funny.

After watching Lamports introduction to TLA+, I got the impression he's just a really funny guy (while being a genius, too).

vinay_ys · on June 5, 2022

Network is much much more reliable than it used to be. In a typical large-scale datacenter today, it is becoming common to deploy clos networks with multiple ECMP paths to get from one server to another. Also, cost of network relative to cost of the server is quite low (less than 8-10% of overall cost). Also, fat connectivity between datacenters in a metro area is much much cheaper than ever before (due to huge increases in fibre capacity in the last 10 years). If anything, network has become cheaper faster than compute has in the last 10 years. (And storage has done the same at an even higher rate). So, this affects architectural decisions in very interesting ways. Of course, public cloud make a huge profit on network bandwidths by billing to customers based on old mental models.

ignoramous · on June 5, 2022

> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

> We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

> We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).

jiggawatts · on June 5, 2022

It blows my mind a tiny bit that ordinary cloud VMs are running on hosts with 100 Gbps NICs.

Even “medium” VMs can get gigabytes per second of throughout..

wizzard0 · on June 5, 2022

> Network is much much more reliable than it used to be

Within DCs, sure. At the same time, a lot more company networks become spread thin over 4G/5G networks and finicky cable ISPs in recent years with WFH.

This caused quite a strain on internal LOB apps which could reasonably assume zero packet loss and sub-1ms LAN latency beforehand :)

3np · on June 5, 2022

At the same time, IoT, mobile and roaming make for unreliability and consistent topology reshuffling. Granted different environment and for various reasons properly distributed solutions on these networks aren't widespread so far. Just saying there's another side to it.

siddontang · on June 5, 2022

Thanks so much for sharing this article, it reminded me of a fallacy before when we built TiDB cloud (a distributed database service) at first, the fallacy is Transport cost is zero.

We deploy TiDB in three available zone(AZ) in one region in AWS. Each AZ contains a SQL computing server, and a columnar storage replica server, sometimes the computing server will send request to the replica server in another region and get the data result, if the data acquisition volume is high, this will cost us much money. Unfortunately, we only found out about this when we received the bill from AWS. After that, We start optimizing the size of the transferred data :sob:

googletron · on June 5, 2022

Thanks for sharing that experience. What did you end up doing to lower your costs?

siddontang · on June 5, 2022

I have to say, this problem has not been well solved. But we've still been doing something:

- We used gRPC with no compression before, and we tune an acceptable compression level to achieve a balance of performance and data volume.

- Improve our SQL optimizer to consider the network cost more accurately and data placement rule to balance data more logically to reduce data transfer crossing AZ.

- As a cloud service, we will also expose the cost transparently to our customers like MongoDB Atlas does. :-)

ddorian43 · on June 5, 2022

The only way to fix this is you have to build a new cloud that has lower networking costs and force the existing competitors to lower their prices.

azurezyq · on June 5, 2022

There are "clouds" with lower networking costs, like Digitalocean/linode. But they are not "capable" enough to support a sizable company. Actually, networking is like a "tax" on top of the whole platform.

One of the real problems is that cloud providers can just waive cross-AZ traffic for their own managed services but charge a lot for cloud customers.

ddorian43 · on June 5, 2022

> One of the real problems is that cloud providers can just waive cross-AZ traffic for their own managed services but charge a lot for cloud customers.

That is why no matter how smart your software is they'll always have the upper hand and build things better themselves (why everybody using the BSL license).

googletron · on June 5, 2022

There is one administrator. I always found that fallacy the hardest to realize.

It’s like owning the entire supply chain. I remember this bit from someone who used to work at Heroku. It’s impossible for us to switch we spent so much time making the system work with AWS, switching or adding another provider would be a kin to starting over.

jerryjerryjerry · on June 5, 2022

What can everyone do before that "new cloud" happens? The approaches like the above post mentioned how their cloud database tries different things from technology side seems the right and practical way to pursue though.

ddorian43 · on June 5, 2022

> What can everyone do before that "new cloud" happens?

Suffer and try to vote with your wallet (example oracle cloud has lower bandwidth pricing).

jerryjerryjerry · on June 5, 2022

emm... moving and rebuilding whole systems onto another cloud is another wallet-ache thing, but yes, things may change if everyone votes with their wallet. The reality is that the big players like Snowflake or Databricks can have more bargain power than smaller players on the market, and as long as those big players stay with these providers, it may not change the game.

hinkley · on June 5, 2022

Cross region or cross AZ?

siddontang · on June 5, 2022

Crossing-AZ.

We didn't pay attention to the cost of crossing AZ before, just thought this might be cheap. Of course, we were wrong :sob:

hinkley · on June 5, 2022

Some Conway’s Law hooey on my current job means I don’t even have that information. I’m doing work to reduce cluster size and maybe we’d be better off if I worked on cross-AZ chatter instead. No idea. Suspicions, yes. Dark, cynical suspicions, but no idea.

Matthias247 · on June 5, 2022

A few other ones that come up regularly:

- Servers/applications never restart

- Startup order is fully deterministic

- All instances run the same software revision (maybe somewhat covered by 8)

- Hardware is reliable

- All clients are playing nicely along (and I'm not even talking about attacks)

- Logging and metrics are cheap

- There are no software bugs in any layer

dijit · on June 5, 2022

> Startup order is fully deterministic

It still is for the minority of systems.

On single OSs Google runs an init that is deterministic; as do the BSDs and some distributions of Linux.

For distributed systems your ops team controls for this; usually on the service discovery layer. (Don’t publish until ready, don’t start service until you can establish a connection to required endpoints).

Matthias247 · on June 5, 2022

Many people think it is - but the realitiy is that it depends a lot on the system one is working on, can be super tricky, and often relying on it will cause some pain points later on.

e.g. let's take the example of an init system starting up all processes. Now what happens if a if one of the processes crashes and gets restarted by a processmanager? Now the order already changed, and e.g. a former process which relied on the restarted one might work based on outdated data. Similar things can also happen on other layers - e.g. one of the services in a dependency chain might disappear and reappear.

Another example is a developer/administrator manually changing the config of a certain service and restarting it to take effect - that could also trigger dependency problems.

Now those are absolutely solvable - either by making sure all services operate gracefully with any startup order or by other mitigations (e.g. "always reboot the full box"). But like everything else in the list, it still is a problem that is observed in very often in distributed systems.

snug · on June 5, 2022

These are fallacies of distributed systems, most of these are the reasons you want to have a distributed system. You build distributed systems because you know - Servers restart, so you replicate and load balance - Dependencies matter, so you build circuit breakers, and health checks - Infra is heterogeneous, this is why you use containers - Hardware is unreliable, again, replicate, load balance, and HA - Provide APIs for your clients to communicate with your service - Logging and metrics are expensive, collect what you need, prom helps with this, logging is not fully solved - Do people really think there are no bugs?

macintux · on June 5, 2022

“The network is reliable” is also the title of a relatively famous collaboration between Kyle Kingsbury and Peter Bailis which generated a long list of counter-examples.

* https://cacm.acm.org/magazines/2014/9/177925-the-network-is-...

Discussions here:

* https://news.ycombinator.com/item?id=5820245

* https://news.ycombinator.com/item?id=8162720

aliswe · on June 5, 2022

The original article seems to be from 1994.

https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com...

grymoire1 · on June 5, 2022

This was first mentioned in 1994, BTW.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

capn_duck · on June 5, 2022

This is like a meme in software development reposts and I hate it because it just says "things can go wrong with component X of your system". It doesn't say when, it doesn't say what the threshold is. There's no equations, no direction for someone that hasn't dealt with one of these problems before. It's less than useless.

pjmlp · on June 5, 2022

And this is why before "fixing" the monolith by placing a network in the middle, one should start to learn about modules, packages and libraries on the language of choice.

dagss · on June 5, 2022

YES! It is a shame one feels one has to fix organizational and coordination issues by introducing network barriers incidental to the real problems. I cannot wait for microservices to become unfashionable for cases where they are not solving technical scalability issues.

Software engineering is not a mature field but microservices feels like a dead alley everyone is marching on.

ignoramous · on June 5, 2022

Relevant discussion on Modules, Monoliths, and Microservices, https://news.ycombinator.com/item?id=26247052 (2021).

delusional · on June 5, 2022

Any time you interact with a third party API you're creating a distributed system. The principles of distributed systems are useful for many more things than "microservices".

pjmlp · on June 5, 2022

An API isn't a network call and a WebAPI is a distributed service by definition of doing a network request.

googletron · on June 5, 2022

Here is the full illustration https://architecturenotes.co/content/images/size/w2000/2022/...

googletron · on June 4, 2022

Short little post about Fallacies of Distributed Systems. These are some thoughts I feel often get downplayed when engineers are designing microservices.

ignoramous · on June 5, 2022

Anti-dotes exist too. I particularly like go to these ones from AWS back in the day when distributed systems weren't as popular:

- Amazon S3 design requirements and principles (2006), https://archive.is/EP6HU#selection-1205.0-1425.66

- Amazon S3: Architecting for resiliency (2009), http://web.archive.org/web/20100719163331/https://qconsf.com...

nraynaud · on June 5, 2022

I really don't think anybody believes the network is reliable, it's mostly that you have no freaking clue how the software layers lower than you will react to various conditions and the way an error will be visible to your app will be completely random.

"oh, today the underlying layer decided to wait 1min30 for a timeout",

"oh, the DNS resolution is broken",

"no route to host???",

"invalid certificate, did somehone hijack the DNS query?"

"this library decided to wrap the errors, now I have learn an entire new vocabulary of failure modes"

samsquire · on June 5, 2022

Don't underestimate how performant a single thread is! Some problems are embarrassingly parallel.

Of course coordinating threads is difficult.

I am designing an LMAX disruptor system that translates a synchronous piece of code into nested LMAX disruptors.

Essentially you can do high IO and high CPU and not block other requests as they're pipelined.

https://github.com/samsquire/ideas4#51-rewrite-synchronous-c...

Don't block the UI thread or the node js event loop. We don't have to pick between tokio and rayon. We can have multiple event loops.

rsaxvc · on June 5, 2022

> Pictured on the right is the time to access memory in a modern system, on the left the time it takes to do a round trip across the world.

These seem swapped.

maffydub · on June 5, 2022

Even if they're swapped, the maths doesn't work out - the ratio of squares is 1:42 while the ratio of latencies is 1:1500.

adwn · on June 5, 2022

You're off by a factor of 1000: 150 ms / 100 ns = 150e-3 / 100e-9 = 1.5 million. So it's even worse!

googletron · on June 5, 2022

I am aware just didn’t want to correct the person correcting the inaccuracies. I suppose I could have given the red squares a unit value in the illustration.

googletron · on June 5, 2022

Definitely true! Good eye. Drawing out 1500 squares is tough.

googletron · on June 5, 2022

Thanks for pointing this out I will fix it.

avinassh · on June 5, 2022

related: I love the illustration of fallacies by Denis Yu. It has cats!

http://deniseyu.io/art/sketchnotes/topic-based/8-fallacies.p...

googletron · on June 5, 2022

That’s really awesome!

beckingz · on June 5, 2022

This is a good summary of many of the most annoying problems of distributed systems.

googletron · on June 5, 2022

Thanks. I would say quorum issues with flakey nodes is up there too.

macintux · on June 5, 2022

Failing/misbehaving nodes are the worst. Everything from slow performance to data corruption.

Much better to lose a node immediately than to have it fail gradually.

googletron · on June 5, 2022

Hard agree.

qaq · on June 5, 2022

And higher level issue for some types of workloads a disturbed system might never achieve the same level of performance and/or will cost orders of magnitudes more than a vanilla system making it impractical.

ramraj07 · on June 5, 2022

I’d love if someone created fallacies of big data technologies: most important being if you’re using spark, unless you’re really, really good at spark, you’re not actually working with big data.

anonymousDan · on June 5, 2022

Can you expand on what you mean?

ramraj07 · on June 6, 2022

My experience has been that spark actually doesn’t work with real big data without significant babysitting. It’s agorithms, be it window functions or joins, can take forever if not never finish on actual data that’s hundreds of terabytes large (even tens of tb). You immediately need to worry about crap like garbage collectors, worker memory, cache and data distribution. The majority of the data engineers out there can not actually deal with these problems but spark + actually not big data let’s them think they’re actually good at their jobs when in reality they’re not.

gxt · on June 6, 2022

This is usually the product of not having to worry about costs combined with elastic platforms. Bad or inefficient methodology will still work but cost more which may result in surprisingly slower feedback loops before their work comes back to bite them.

kirdie · on June 5, 2022

Hey there’s a typo in the article. In the image of memory vs round trip says that roundtrip is 100ns and memory access 150ms I guess you meant the other way around :P

googletron · on June 5, 2022

Thanks. I believe it has been fixed now.

vira28 · on June 5, 2022

The illustrations look really cool. Wondering what software/tool is being used?