Patterns of Distributed Systems (2022)

withinboredom · on June 27, 2023

> The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized.

Sentences like this will make me never regret to moving my infrastructure to bare-metal. My clocks are synchronized down to several nano-seconds, with leap-second skew and all kinds of shiny things. It literally took a day to set up and a blessing from an ISP in the same datacenter to use their clock sources (GPS + PTP). All the other servers are synchronized to that one via Chrony.

LennyWhiteJr · on June 28, 2023

Even if are "down to several nano-seconds", a slight clock drift can be the different between corrupt data or not, and when running at scale, it's only a matter of time before you start running into race conditions.

For a small web app, fine, but if you're running enterprise level software processing billions of DB transactions per day, clocks just don't cut it.

withinboredom · on June 28, 2023

That’s why you buy NICs with hardware timestamp support and enable PTP. You can detect clock drift within a few packets.

Race conditions are mitigated, not by clocks, but by other logics. The clock was just something done after frustrations in reading distributed logs and seeing them out of order. Logs are basically never out of order any more and there is sanity.

antonvs · on June 27, 2023

It’s not just about bare metal or not. The sort of distributed systems these patterns apply to are not small local clusters in a single datacenter.

codemac · on June 27, 2023

And the really big ones do much more epic clock management, and keep things wildly in sync across the globe. See the recent work with PTP that fb and others are trying to do openly. Google and others have their own internal implementations.

withinboredom · on June 27, 2023

In the cloud, you have very little control over the clocks. If you have baremetal, there's almost always the ability to configure (at least) GPS time sync for a server or two. If you get to the point where you have entire datacenters, there's no excuse NOT to invest in getting good clocks -- and is likely a requirement.

bastawhiz · on June 27, 2023

"Not bare metal" does not imply "the cloud". You might be part of a company that doesn't give teams access to raw servers, you might be paying for VMs with dedicated resources from a local data center, or any number of other situations.

Not everyone has the luxury of being able to procure and install hardware and/or run an antenna to someplace with gpc reception.

withinboredom · on June 27, 2023

I’d call that a “private cloud” and I think that’s the actual term for it. You can run a private cloud on bare metal, and if you are, then you can (as in physically able to) have control over things. If you have network cards with timestamp support, you can use PTP, or any number of things. If your org doesn’t support that, that doesn’t mean it isn’t a possibility, it just means you need to find someone else to ask.

preseinger · on June 28, 2023

bare metal doesn't solve this problem, clock synchronization works until it doesn't

ntp can fail, chrony can fail, system clocks can always drift undetectably

you can treat the system clock as an optimistic guess at the time, but it's never a reliable way to order anything across different machines

withinboredom · on June 28, 2023

That’s why you buy NICs with hardware timestamp support and enable PTP.

preseinger · on June 28, 2023

this doesn't magically fix the problem

node clocks are unreliable by definition, it's a fundamental invariant of distributed systems

withinboredom · on June 28, 2023

No, but you can detect skew in just a few packets and decide if you want to drain the node. If a node continues to have issues, put that thing on eBay and get another. Or, send back the motherboard to the manufacturer if it’s new enough.

Node clocks can be plenty reliable, but like any other hardware, sometimes they get defects.

preseinger · on June 28, 2023

node A is connected to nodes B, C, D, E, F

the A->B link is under DDoS or whatever and delivers packets with 10s latency

the A->C link is faulty and has 50% packet loss

the A->{D,E,F} links are perfectly healthy

node B has one view of A's skew which is pretty bad, node C has a different view which is also pretty bad for different reasons, and nodes D E and F have a totally different view which is basically perfect

you literally cannot "detect skew" in a way that's reliable and actionable

issues are not a function of the node, they're a function of everything between the node and the observer, and are different for each observer

even if clocks were perfectly accurate, there is no such thing as a single consistent time across a distributed system. two events arriving at two nodes at precisely the same moment require some amount of time to be communicated to other nodes in the system, that time is a function of the speed of light, the "light cone" defines a physical limit to the propagation of information

withinboredom · on June 28, 2023

I think you’re missing the forest for the trees a bit. In the code, order mostly doesn’t matter and where it does matter there is a monotonic clock decoupled from physical time, using an epoch framework (https://tli2.github.io/assets/pdf/epochs.pdf).

The clock sync is just to keep human-readable logs in order for debugging. It’s ok if it is sometimes out of order, though in practice, it never is.

preseinger · on June 28, 2023

i'm not sure we're talking about the same thing

flaminHotSpeedo · on June 28, 2023

It's easy to say you've solved distributed computing problems when you're not actually doing distributed computing

forkbomb123 · on June 27, 2023

The reason services like AWS or Azure are distributed is so that your resources are not concentrated which helps with fault tolerance. If your datacenter goes down, your whole service goes down. Also as was mentioned in other comments, asynchrony also applies in the same datacenter.

withinboredom · on June 27, 2023

That’s what insurance is for. Far less expensive than maintaining fault tolerance at the current scale. If there were a fire or something (like Google’s recent explosion in France from an overflowing toilet), we’d lose at least several minutes of data, and be able to boot up in a cloud with degraded capabilities within ten minutes or so. Not too worried about it.

plandis · on June 28, 2023

Not sure what you’re doing but it sounds like a great example for your competitors to point to and tell customers that they can avoid these issues.

withinboredom · on June 28, 2023

It’s a project for fun (a hobby), and given away for free. 10 minutes of downtime every 20-30 years is perfectly acceptable to me.

blackoil · on June 28, 2023

You can do same in the AWS by using TimeSync and limit yourself to one AZ.

slt2021 · on June 27, 2023

I never understood why "The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized." is considered True even with working NTP synchronization?

withinboredom · on June 27, 2023

A few milliseconds difference can mean all the difference in the world at high enough throughput (which is about the best you can get with NTP). When you can control the networking cards and time sources, you can get it within a few nanoseconds across an entire datacenter, with monitoring to drain the node if clock skew gets too high.

slt2021 · on June 28, 2023

And why applications are so sensitive for such small difference in time?

Seems like poor engineering practice.

preseinger · on June 28, 2023

usual problem is when you try to model logical causality (a before b) with physical time (a.timestamp < b.timestamp)

logical causality does not represent poor engineering practice :)

slt2021 · on June 28, 2023

this only applies if you carry over physical time from one machine to another, assuming perfect physical synchronization of time.

if you stick to a single source of truth - only one machine's time is used as a source of truth - then the problem disappears.

for example instead of using java/your-language's time() function (which could be out of sync across different app nodes) just use database's internal CURRENT_TIMESTAMP() when writing to db.

another alternative is compare timestamps with up to 1 minute/hour precision, if you carry over time from one machine to another. That way you have a a buffer of time for different machines to synchronize clocks over NTP

preseinger · on July 4, 2023

if you can delegate synchronization/ordering to the monotonic clock of a single machine, then you should definitely do that :)

but that's a sort of trivial base case -- the interesting bit is when you can't make that kind of simplifying assumption

plandis · on June 28, 2023

You can rephrase this question in terms of causality. Why does it matter that we know if some process happens before some other process at some defined(?) level of precision.

There are ways around this but they are restrictive or come at the cost of increased latency. Sometimes those are acceptable trade offs and sometimes they are not.

slt2021 · on June 28, 2023

root of problem is using clocks from different hosts (which could be out of sync), and carrying over that time from one machine to another - essentially assuming clocks across different machines are perfectly synchronized 100% of time.

if you use a single source of truth for clocks (simplest example is use RDBMS's current_timestamp() instead of your programming language's time() function), and the problem disappears

justsomehnguy · on June 28, 2023

Imagine you have an account holding $200.

Now two operations come, one adding $300, other one withdrawing $400. What the result would be, depending on thd order of operations?

shsbdncudx · on June 28, 2023

It is, agree. Imho in most cases the right answer is to build it to not require that kind of clock synchronisation

dikei · on June 28, 2023

You can build system that do not require physical clock synchronization, but using physical clock often lead to simpler code and major performance advantage.

That's why Google built True Time, which provides physical time guarantee of [min_real_timestamp, max_real_timestamp] for each timestamp instant. You can easily know the ordering of 2 events by comparing the bounds of their timestamps as long as the bounds do not overlap. In order to achieve that, Google try to keep the bound as small as possible, using the most accurate clocks they can find: atomic and GPS clocks.

plandis · on June 28, 2023

Yes, that is essentially the point of logical clocks :)

Solvency · on June 28, 2023

Can someone explain why, in an interview context, someone with the technical ability to understand and assess and write and communicate a deep set of domain-specific knowledge like this....might still be asked to do some in-person leetcode tests? How does on the fly recursive algo regurgitation sometimes mean more than being able to demonstrate such depth of knowledge?

galkk · on June 28, 2023

Because world is full of bullshiters who know enough buzzwords and buzzsentences and interview time is too short.

On lower levels leetcode style stuff gives more quantifiable signals per minute/session.

Plus, do you really want to fully explain something like paxos or raft in interview context?

My personal pet peeves are

1. “event sourcing” and derivatives. It really attracts people who love to talk but never built anything large using it

2. Adepts of uncle martin

flaminHotSpeedo · on June 28, 2023

If you can't differentiate buzzwords from signals you shouldn't be interviewing.

Plus how do you expect to get good signals from a leetcode whiteboard interview for someone who spends most of their time designing systems and only writes code when it gets to the point where it's faster and less frustrating to pair program vs explaining what needs to get implemented?

To clarify, I don't have a good answer: I still participate in leetcode style interviews (though system design is another component) - but although I sing the song and can't come up with anything better, I don't think it's the best way to go

mrkeen · on June 28, 2023

Raft is for event sourcing.

austin-cheney · on June 28, 2023

Or what about having a repository proving these very concepts with almost 2000 commits?

In my experience things like publications, online code repositories, and facts are more than irrelevant but not much more because people don’t know how to independently evaluate these things. Worse, attempting to evaluate such only exposes the insecurity of people not qualified to be there in the first place.

Far more important are numbers of NPM downloads and GitHub stars. Popularity is an external validation any idiot can understand. But popularity is also something that you can bullshit, so just play it safe treat everyone as a junior developer and leet code them out of consideration.

jstx1 · on June 28, 2023

It's easy to fake projects and commits. And when they aren't faked, you can't have an expectation of every candidate having these because it's so much work. And if you decide to bypass some of your usual interview process because a candidate has a project, now you aren't assessing all candidates equally.

austin-cheney · on June 28, 2023

All I see from your comment is that candidate evaluation requires too much effort, so you exchange one bias for another.

yencabulator · on June 28, 2023

I've interviewed people for roles as low-level C++/kernel programmers who did not know what hexadecimal was. Having a quick "What's 0x2A in decimal? Feel free to use paper & pen."[1] question can be a significant time-saver / direct people to more appropriate roles, if any.

[1] Starting to do math with non-base-10 numbers was already a pass, regardless of the number you reach, you'd normally use a computer for that. But it really isn't too hard to do in your head directly, for anyone who's dealt with binary data.

austin-cheney · on June 28, 2023

I am not sure how that has anything to do with what I posted.

jstx1 · on June 28, 2023

It's a choice to treat all candidates equally.

And intentionally excluding candidates is the whole point of designing an interview process, it's lazy to lift your arms up and declare that they're all biased.

jstx1 · on June 28, 2023

It's not one or ther other; many interviews assess both because they're both meaningful signals about a candidate.

If you're coming from the position that one is

"understand and assess and write and communicate a deep set of domain-specific knowledge"

and the other is

"on the fly recursive algo regurgitation"

then it will be hard to change your mind about any of this.

---

You could have just as easily called system design interviews regurgitation and coding interviews "communicating deep domain-specific knowledge".

totetsu · on June 28, 2023

On that point, can anyone recommend any good reading, info sources about technical interviewing methods etc? I recently had an interview that was just being asked to remember linux commands like for a certification exam off the top of my head, and It made me wonder what the point was, and if there is better ways.

lll-o-lll · on June 28, 2023

Because puzzle solving is an iq proxy and iq is correlated with job performance? But really, just do an iq test. Maybe because interviewers are bad at distinguishing gpt style BSing from actual knowledge and need a baseline test.

popinman322 · on June 28, 2023

The correlation between IQ and job performance is typically weak[0] (weaker than the correlation between conscientiousness+agreeableness on job performance in some studies[1]) with a more modest correlation for "high complexity jobs".

Interesting excerpt from [0]:

> Finally, it seems that even the weak IQ-job performance correlations usually reported in the United States and Europe are not universal. For example, Byington and Felps (2010) found that IQ correlations with job performance are “substantially weaker” in other parts of the world, including China and the Middle East, where performances in school and work are more attributed to motivation and effort than cognitive ability.

[0]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4557354/ [1]: https://www.academia.edu/download/50754745/The_Interactive_E...

Sorry about the PDF link to [1]. The APA link has a paywall otherwise I'd link there.

lll-o-lll · on June 30, 2023

Sorry for not replying earlier, but I am really grateful for you providing those links. I had not known that the iq-job perf had been challenged and it means I need to adjust my priors when looking at candidates.

da39a3ee · on June 28, 2023

What are you referencing? Did someone make Martin Fowler do a leetcode-style interview? (I wouldn't be against that, just curious.)

Solvency · on June 28, 2023

Not Martin per se but there have been a few submitters in the past with thoroughly robust content who have also shared bad interview experiences.

irreducible · on June 28, 2023

https://twitter.com/mxcl/status/608682016205344768

djkivi · on June 28, 2023

It seems Unmesh Joshi wrote this article on MartinFowler.com.

westurner · on June 28, 2023

Distributed computing > Theoretical foundations: https://en.wikipedia.org/wiki/Distributed_computing#Theoreti...

Distributed algorithm > Standard problems: https://en.wikipedia.org/wiki/Distributed_algorithm#Standard...

Notes from "Ask HN: Learning about distributed systems?" https://news.ycombinator.com/item?id=23932271 ; CAP(?), BSP, Paxos, Raft, Byzantine fault, Consensus (computer science), Category: Distributed computing

"Ask HN: Do you use TLA+?" (2022) https://news.ycombinator.com/item?id=30194993 :

> "Concurrency: The Works of Leslie Lamport" ( https://g.co/kgs/nx1BaB )

Lamport timestamp > Lamport's logical clock in distributed systems: https://en.wikipedia.org/wiki/Lamport_timestamp#Lamport's_lo... :

> In a distributed system, it is not possible in practice to synchronize time across entities (typically thought of as processes) within the system; hence, the entities can use the concept of a logical clock based on the events through which they communicate.

Vector clock: https://en.wikipedia.org/wiki/Vector_clock

> https://westurner.github.io/hnlog/#comment-27442819 :

>> Can there still be side channel attacks in formally verified systems? Can e.g. TLA+ help with that at all?

westurner · on June 30, 2023

One initial time synchronization may be enough, given availability of new quantum time-keeping methods (with e.g. a USB interface).

"Quantum watch and its intrinsic proof of accuracy" (2022) https://journals.aps.org/prresearch/abstract/10.1103/PhysRev... https://www.sciencealert.com/scientists-just-discovered-an-e...

esafak · on June 28, 2023

The book is available for pre-order, and shipping in September: https://www.amazon.com/dp/0138221987/

mdaniel · on June 28, 2023

> Kindle $9.99

> Paperback $49.99

what.the.hell.

Also, in order to be constructive: the kindle version was published 21-June and thus is available right now. Sadly the kindle preview does not include the table of contents

sidcool · on June 28, 2023

I know the author of the blog and the book personally. And the author has built small storage engines himself and grokked code bases of Cassandra and other DBs to understand the patterns in code and not just as theoretical concepts. The blogs has code excerpts as well. Highly recommended read for the hands on folks.

supriyo-biswas · on June 28, 2023

Commenters here are missing the point — the intent is to build otherwise isolated systems with properties that are very difficult to control, such as varying amounts of clock skew, arbitrary process pauses due to GC cycles or CPU consumption and build a system on top that allows for the storage of mutable state. An example would be a cluster of etcd or dqlite instances (which Kubernetes in multi-master setups also use BTW), or at a larger scale, something like DynamoDB.

It’s one of the more easily approached resources on the design of distributed systems, and a good read.

nijave · on June 27, 2023

https://microservices.io/patterns/index.html has some patterns as well that aren't necessarily specific to a microservice architecture.

timf · on June 27, 2023

> The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized.

Data center/cloud system clocks can be tightly synchronized now in practice. Still never perfect and race conditions abound.

But that doesn't mean you can't rely on a clock to determine ordering, Google popularized a different approach with TrueTime/Spanner: https://cloud.google.com/spanner/docs/true-time-external-con...

preseinger · on June 28, 2023

you definitely cannot rely on clocks to determine (reliable) ordering across machines

truetime doesn't provide precise timestamps, each timestamp has a "drift" window

timestamps within the same window have no well-defined order, applications have to take this into account when doing e.g. distributed transactions

dikei · on June 28, 2023

> you definitely cannot rely on clocks to determine (reliable) ordering across machines

You can if you consider "unknown/possibly concurrent" also a valid outcome: for any 2 events A and B, True Time can definitely answers whether A is before B, A is after B, or A is possibly concurrent with B.

preseinger · on June 28, 2023

"reliable ordering" usually implies a specific and well-defined order, without concurrent events :)

timf · on June 28, 2023

Agree, and the timestamp comparisons you can make when the windows do not overlap are the basis of the ordering guarantees (across machines, even globally).

preseinger · on June 28, 2023

kind of? not really?

ordering is a logical property which can be informed by physical timestamps, but those timestamps aren't accurately described as "the basis" of that ordering

erikerikson · on June 28, 2023

They do have a well defined order, just not one based on time.

timf · on June 28, 2023

I don't see how it isn't based on time. They use GPS and atomic clocks to establish the time, establish an uncertainty window, and in Spanner's case will have transactions wait out that uncertainty to guarantee an ordering (globally).

erikerikson · on June 28, 2023

Look up the case where two transactions affecting the same record share a timestamp (or have overlapping error ranges). An non-timestamp tiebreaker determines the order between the two overlapping commits. It is not unlike the pre-agreement on conflict resolution mechanisms for CRDTs.

The context to my first comment included "timestamps within the same window"

sidcool · on June 28, 2023

There's a book that compiles all these patterns:

https://learning.oreilly.com/library/view/patterns-of-distri...

dang · on June 28, 2023

dikei · on June 28, 2023

This looks pretty good, similar to Designing Data-intensive applications, but with more focus on Distributed Systems.