Hacker News new | past | comments | ask | show | jobs | submit login
Patterns of Distributed Systems (2022) (martinfowler.com)
235 points by eclectic29 on June 27, 2023 | hide | past | favorite | 71 comments



> The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized.

Sentences like this will make me never regret to moving my infrastructure to bare-metal. My clocks are synchronized down to several nano-seconds, with leap-second skew and all kinds of shiny things. It literally took a day to set up and a blessing from an ISP in the same datacenter to use their clock sources (GPS + PTP). All the other servers are synchronized to that one via Chrony.


Even if are "down to several nano-seconds", a slight clock drift can be the different between corrupt data or not, and when running at scale, it's only a matter of time before you start running into race conditions.

For a small web app, fine, but if you're running enterprise level software processing billions of DB transactions per day, clocks just don't cut it.


That’s why you buy NICs with hardware timestamp support and enable PTP. You can detect clock drift within a few packets.

Race conditions are mitigated, not by clocks, but by other logics. The clock was just something done after frustrations in reading distributed logs and seeing them out of order. Logs are basically never out of order any more and there is sanity.


It’s not just about bare metal or not. The sort of distributed systems these patterns apply to are not small local clusters in a single datacenter.


And the really big ones do much more epic clock management, and keep things wildly in sync across the globe. See the recent work with PTP that fb and others are trying to do openly. Google and others have their own internal implementations.


In the cloud, you have very little control over the clocks. If you have baremetal, there's almost always the ability to configure (at least) GPS time sync for a server or two. If you get to the point where you have entire datacenters, there's no excuse NOT to invest in getting good clocks -- and is likely a requirement.


"Not bare metal" does not imply "the cloud". You might be part of a company that doesn't give teams access to raw servers, you might be paying for VMs with dedicated resources from a local data center, or any number of other situations.

Not everyone has the luxury of being able to procure and install hardware and/or run an antenna to someplace with gpc reception.


I’d call that a “private cloud” and I think that’s the actual term for it. You can run a private cloud on bare metal, and if you are, then you can (as in physically able to) have control over things. If you have network cards with timestamp support, you can use PTP, or any number of things. If your org doesn’t support that, that doesn’t mean it isn’t a possibility, it just means you need to find someone else to ask.


bare metal doesn't solve this problem, clock synchronization works until it doesn't

ntp can fail, chrony can fail, system clocks can always drift undetectably

you can treat the system clock as an optimistic guess at the time, but it's never a reliable way to order anything across different machines


That’s why you buy NICs with hardware timestamp support and enable PTP.


this doesn't magically fix the problem

node clocks are unreliable by definition, it's a fundamental invariant of distributed systems


No, but you can detect skew in just a few packets and decide if you want to drain the node. If a node continues to have issues, put that thing on eBay and get another. Or, send back the motherboard to the manufacturer if it’s new enough.

Node clocks can be plenty reliable, but like any other hardware, sometimes they get defects.


node A is connected to nodes B, C, D, E, F

the A->B link is under DDoS or whatever and delivers packets with 10s latency

the A->C link is faulty and has 50% packet loss

the A->{D,E,F} links are perfectly healthy

node B has one view of A's skew which is pretty bad, node C has a different view which is also pretty bad for different reasons, and nodes D E and F have a totally different view which is basically perfect

you literally cannot "detect skew" in a way that's reliable and actionable

issues are not a function of the node, they're a function of everything between the node and the observer, and are different for each observer

even if clocks were perfectly accurate, there is no such thing as a single consistent time across a distributed system. two events arriving at two nodes at precisely the same moment require some amount of time to be communicated to other nodes in the system, that time is a function of the speed of light, the "light cone" defines a physical limit to the propagation of information


I think you’re missing the forest for the trees a bit. In the code, order mostly doesn’t matter and where it does matter there is a monotonic clock decoupled from physical time, using an epoch framework (https://tli2.github.io/assets/pdf/epochs.pdf).

The clock sync is just to keep human-readable logs in order for debugging. It’s ok if it is sometimes out of order, though in practice, it never is.


i'm not sure we're talking about the same thing


It's easy to say you've solved distributed computing problems when you're not actually doing distributed computing


The reason services like AWS or Azure are distributed is so that your resources are not concentrated which helps with fault tolerance. If your datacenter goes down, your whole service goes down. Also as was mentioned in other comments, asynchrony also applies in the same datacenter.


That’s what insurance is for. Far less expensive than maintaining fault tolerance at the current scale. If there were a fire or something (like Google’s recent explosion in France from an overflowing toilet), we’d lose at least several minutes of data, and be able to boot up in a cloud with degraded capabilities within ten minutes or so. Not too worried about it.


Not sure what you’re doing but it sounds like a great example for your competitors to point to and tell customers that they can avoid these issues.


It’s a project for fun (a hobby), and given away for free. 10 minutes of downtime every 20-30 years is perfectly acceptable to me.


You can do same in the AWS by using TimeSync and limit yourself to one AZ.


I never understood why "The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized." is considered True even with working NTP synchronization?


A few milliseconds difference can mean all the difference in the world at high enough throughput (which is about the best you can get with NTP). When you can control the networking cards and time sources, you can get it within a few nanoseconds across an entire datacenter, with monitoring to drain the node if clock skew gets too high.


And why applications are so sensitive for such small difference in time?

Seems like poor engineering practice.


usual problem is when you try to model logical causality (a before b) with physical time (a.timestamp < b.timestamp)

logical causality does not represent poor engineering practice :)


this only applies if you carry over physical time from one machine to another, assuming perfect physical synchronization of time.

if you stick to a single source of truth - only one machine's time is used as a source of truth - then the problem disappears.

for example instead of using java/your-language's time() function (which could be out of sync across different app nodes) just use database's internal CURRENT_TIMESTAMP() when writing to db.

another alternative is compare timestamps with up to 1 minute/hour precision, if you carry over time from one machine to another. That way you have a a buffer of time for different machines to synchronize clocks over NTP


if you can delegate synchronization/ordering to the monotonic clock of a single machine, then you should definitely do that :)

but that's a sort of trivial base case -- the interesting bit is when you can't make that kind of simplifying assumption


You can rephrase this question in terms of causality. Why does it matter that we know if some process happens before some other process at some defined(?) level of precision.

There are ways around this but they are restrictive or come at the cost of increased latency. Sometimes those are acceptable trade offs and sometimes they are not.


root of problem is using clocks from different hosts (which could be out of sync), and carrying over that time from one machine to another - essentially assuming clocks across different machines are perfectly synchronized 100% of time.

if you use a single source of truth for clocks (simplest example is use RDBMS's current_timestamp() instead of your programming language's time() function), and the problem disappears


Imagine you have an account holding $200.

Now two operations come, one adding $300, other one withdrawing $400. What the result would be, depending on thd order of operations?


It is, agree. Imho in most cases the right answer is to build it to not require that kind of clock synchronisation


You can build system that do not require physical clock synchronization, but using physical clock often lead to simpler code and major performance advantage.

That's why Google built True Time, which provides physical time guarantee of [min_real_timestamp, max_real_timestamp] for each timestamp instant. You can easily know the ordering of 2 events by comparing the bounds of their timestamps as long as the bounds do not overlap. In order to achieve that, Google try to keep the bound as small as possible, using the most accurate clocks they can find: atomic and GPS clocks.


Yes, that is essentially the point of logical clocks :)


Can someone explain why, in an interview context, someone with the technical ability to understand and assess and write and communicate a deep set of domain-specific knowledge like this....might still be asked to do some in-person leetcode tests? How does on the fly recursive algo regurgitation sometimes mean more than being able to demonstrate such depth of knowledge?


Because world is full of bullshiters who know enough buzzwords and buzzsentences and interview time is too short.

On lower levels leetcode style stuff gives more quantifiable signals per minute/session.

Plus, do you really want to fully explain something like paxos or raft in interview context?

My personal pet peeves are

1. “event sourcing” and derivatives. It really attracts people who love to talk but never built anything large using it

2. Adepts of uncle martin


If you can't differentiate buzzwords from signals you shouldn't be interviewing.

Plus how do you expect to get good signals from a leetcode whiteboard interview for someone who spends most of their time designing systems and only writes code when it gets to the point where it's faster and less frustrating to pair program vs explaining what needs to get implemented?

To clarify, I don't have a good answer: I still participate in leetcode style interviews (though system design is another component) - but although I sing the song and can't come up with anything better, I don't think it's the best way to go


Raft is for event sourcing.


Or what about having a repository proving these very concepts with almost 2000 commits?

In my experience things like publications, online code repositories, and facts are more than irrelevant but not much more because people don’t know how to independently evaluate these things. Worse, attempting to evaluate such only exposes the insecurity of people not qualified to be there in the first place.

Far more important are numbers of NPM downloads and GitHub stars. Popularity is an external validation any idiot can understand. But popularity is also something that you can bullshit, so just play it safe treat everyone as a junior developer and leet code them out of consideration.


It's easy to fake projects and commits. And when they aren't faked, you can't have an expectation of every candidate having these because it's so much work. And if you decide to bypass some of your usual interview process because a candidate has a project, now you aren't assessing all candidates equally.


All I see from your comment is that candidate evaluation requires too much effort, so you exchange one bias for another.


I've interviewed people for roles as low-level C++/kernel programmers who did not know what hexadecimal was. Having a quick "What's 0x2A in decimal? Feel free to use paper & pen."[1] question can be a significant time-saver / direct people to more appropriate roles, if any.

[1] Starting to do math with non-base-10 numbers was already a pass, regardless of the number you reach, you'd normally use a computer for that. But it really isn't too hard to do in your head directly, for anyone who's dealt with binary data.


I am not sure how that has anything to do with what I posted.


It's a choice to treat all candidates equally.

And intentionally excluding candidates is the whole point of designing an interview process, it's lazy to lift your arms up and declare that they're all biased.


It's not one or ther other; many interviews assess both because they're both meaningful signals about a candidate.

If you're coming from the position that one is

"understand and assess and write and communicate a deep set of domain-specific knowledge"

and the other is

"on the fly recursive algo regurgitation"

then it will be hard to change your mind about any of this.

---

You could have just as easily called system design interviews regurgitation and coding interviews "communicating deep domain-specific knowledge".


On that point, can anyone recommend any good reading, info sources about technical interviewing methods etc? I recently had an interview that was just being asked to remember linux commands like for a certification exam off the top of my head, and It made me wonder what the point was, and if there is better ways.


Because puzzle solving is an iq proxy and iq is correlated with job performance? But really, just do an iq test. Maybe because interviewers are bad at distinguishing gpt style BSing from actual knowledge and need a baseline test.


The correlation between IQ and job performance is typically weak[0] (weaker than the correlation between conscientiousness+agreeableness on job performance in some studies[1]) with a more modest correlation for "high complexity jobs".

Interesting excerpt from [0]:

> Finally, it seems that even the weak IQ-job performance correlations usually reported in the United States and Europe are not universal. For example, Byington and Felps (2010) found that IQ correlations with job performance are “substantially weaker” in other parts of the world, including China and the Middle East, where performances in school and work are more attributed to motivation and effort than cognitive ability.

[0]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4557354/ [1]: https://www.academia.edu/download/50754745/The_Interactive_E...

Sorry about the PDF link to [1]. The APA link has a paywall otherwise I'd link there.


Sorry for not replying earlier, but I am really grateful for you providing those links. I had not known that the iq-job perf had been challenged and it means I need to adjust my priors when looking at candidates.


What are you referencing? Did someone make Martin Fowler do a leetcode-style interview? (I wouldn't be against that, just curious.)


Not Martin per se but there have been a few submitters in the past with thoroughly robust content who have also shared bad interview experiences.



It seems Unmesh Joshi wrote this article on MartinFowler.com.


Distributed computing > Theoretical foundations: https://en.wikipedia.org/wiki/Distributed_computing#Theoreti...

Distributed algorithm > Standard problems: https://en.wikipedia.org/wiki/Distributed_algorithm#Standard...

Notes from "Ask HN: Learning about distributed systems?" https://news.ycombinator.com/item?id=23932271 ; CAP(?), BSP, Paxos, Raft, Byzantine fault, Consensus (computer science), Category: Distributed computing

"Ask HN: Do you use TLA+?" (2022) https://news.ycombinator.com/item?id=30194993 :

> "Concurrency: The Works of Leslie Lamport" ( https://g.co/kgs/nx1BaB )

Lamport timestamp > Lamport's logical clock in distributed systems: https://en.wikipedia.org/wiki/Lamport_timestamp#Lamport's_lo... :

> In a distributed system, it is not possible in practice to synchronize time across entities (typically thought of as processes) within the system; hence, the entities can use the concept of a logical clock based on the events through which they communicate.

Vector clock: https://en.wikipedia.org/wiki/Vector_clock

> https://westurner.github.io/hnlog/#comment-27442819 :

>> Can there still be side channel attacks in formally verified systems? Can e.g. TLA+ help with that at all?


One initial time synchronization may be enough, given availability of new quantum time-keeping methods (with e.g. a USB interface).

"Quantum watch and its intrinsic proof of accuracy" (2022) https://journals.aps.org/prresearch/abstract/10.1103/PhysRev... https://www.sciencealert.com/scientists-just-discovered-an-e...


The book is available for pre-order, and shipping in September: https://www.amazon.com/dp/0138221987/


> Kindle $9.99

> Paperback $49.99

what.the.hell.

Also, in order to be constructive: the kindle version was published 21-June and thus is available right now. Sadly the kindle preview does not include the table of contents


I know the author of the blog and the book personally. And the author has built small storage engines himself and grokked code bases of Cassandra and other DBs to understand the patterns in code and not just as theoretical concepts. The blogs has code excerpts as well. Highly recommended read for the hands on folks.


Commenters here are missing the point — the intent is to build otherwise isolated systems with properties that are very difficult to control, such as varying amounts of clock skew, arbitrary process pauses due to GC cycles or CPU consumption and build a system on top that allows for the storage of mutable state. An example would be a cluster of etcd or dqlite instances (which Kubernetes in multi-master setups also use BTW), or at a larger scale, something like DynamoDB.

It’s one of the more easily approached resources on the design of distributed systems, and a good read.


https://microservices.io/patterns/index.html has some patterns as well that aren't necessarily specific to a microservice architecture.


> The main reason we can not use system clocks is that system clocks across servers are not guaranteed to be synchronized.

Data center/cloud system clocks can be tightly synchronized now in practice. Still never perfect and race conditions abound.

But that doesn't mean you can't rely on a clock to determine ordering, Google popularized a different approach with TrueTime/Spanner: https://cloud.google.com/spanner/docs/true-time-external-con...


you definitely cannot rely on clocks to determine (reliable) ordering across machines

truetime doesn't provide precise timestamps, each timestamp has a "drift" window

timestamps within the same window have no well-defined order, applications have to take this into account when doing e.g. distributed transactions


> you definitely cannot rely on clocks to determine (reliable) ordering across machines

You can if you consider "unknown/possibly concurrent" also a valid outcome: for any 2 events A and B, True Time can definitely answers whether A is before B, A is after B, or A is possibly concurrent with B.


"reliable ordering" usually implies a specific and well-defined order, without concurrent events :)


Agree, and the timestamp comparisons you can make when the windows do not overlap are the basis of the ordering guarantees (across machines, even globally).


kind of? not really?

ordering is a logical property which can be informed by physical timestamps, but those timestamps aren't accurately described as "the basis" of that ordering


They do have a well defined order, just not one based on time.


I don't see how it isn't based on time. They use GPS and atomic clocks to establish the time, establish an uncertainty window, and in Spanner's case will have transactions wait out that uncertainty to guarantee an ordering (globally).


Look up the case where two transactions affecting the same record share a timestamp (or have overlapping error ranges). An non-timestamp tiebreaker determines the order between the two overlapping commits. It is not unlike the pre-agreement on conflict resolution mechanisms for CRDTs.

The context to my first comment included "timestamps within the same window"


There's a book that compiles all these patterns:

https://learning.oreilly.com/library/view/patterns-of-distri...


Related:

Patterns of Distributed Systems (2020) - https://news.ycombinator.com/item?id=26089683 - Feb 2021 (58 comments)


This looks pretty good, similar to Designing Data-intensive applications, but with more focus on Distributed Systems.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: