The CAP theorem. The Bad, the Bad, & the Ugly

zer00eyz · 2024-03-08T13:58:10.000000Z

This conflict comes up everywhere:

Newtonian physics/mechanics: good enough in a lot of cases.

Einstein accurate, but unnecessary in most cases.

In many cases CAP is good enough for us to have the conversation about how the system works. One can then formulate plan for when it doesn't. The fact that it's imperfect at a formal level is academically interesting, but technically irrelevant for a LOT of conversations where it has utility.

sgift · 2024-03-08T14:03:06.000000Z

Agreed. Sure, you can expand it with PACELC, but the important part is that you have something which starts the conversation/thinking about the fact that distributed systems have specific requirements and challenges. And CAP is sufficient for that.

throw0101d · 2024-03-08T15:59:57.000000Z

TIL:

> In theoretical computer science, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

* https://en.wikipedia.org/wiki/PACELC_theorem

tootie · 2024-03-08T14:10:30.000000Z

My first thought was The Structure of Scientific Revolution. CAP was a very useful theory on top of which years of research were done until the paradigm was broken and replaced. That's how progress works. It doesn't mean CAP was a scam it was our very good first draft at formalizing distributed database theory. And the fact that we still talk about it is proof of that.

jameshart · 2024-03-08T14:18:58.000000Z

Fundamental principles often just formally state the obvious.

People who want CAP to be profound and useful are missing the point - it just tells you some things that you might want to try to achieve are provably impossible. Just like Newton’s first law tells you things can’t accelerate without a force (duh) and the first law of thermodynamics tells you you can’t get energy out of a perpetual motion machine (duh) and the pigeonhole principle tells you if you put n things in less than n groups, at least one group has more than one thing in it (duh); CAP tells you you can’t build a distributed, partitionable data system that is 100% consistent and 100% available (duh).

It’s not that deep or profound but it is proven and it eliminates a whole class of ideas from meriting further thought because they’re demonstrably impossible. That lets us get on with making the most of what is possible.

ithinkso · 2024-03-08T18:00:41.000000Z

Your 'duh's escalated quite quickly, I had a Euclid postulates feelings when reading it - first four are 'duh' and then the fifth one comes swinging.

(by the way I think you have your Newton's law order wrong, that's the second one you're referering to)

jameshart · 2024-03-08T19:13:12.000000Z

First law (inertia) basically states that in the absence of a force, no acceleration can happen (sometimes stated as ‘an object in motion will remain in motion, an object at rest will remain at rest’).

Second law gets more specific, and clarifies that the size of the acceleration is proportional to the size of the force.

jph · 2024-03-08T13:23:04.000000Z

The CAP theorem expands to the PACELC theorem. PACELC stands for Partition, Availability, Consistency, Else, Latency, and Consistency. The theorem states that in the event of a network partition, a distributed system must choose between availability and consistency; otherwise, it must choose between latency and consistency.

https://en.wikipedia.org/wiki/PACELC_theorem

peterbell_nyc · 2024-03-08T13:51:57.000000Z

It seems like people make this really complicated.

There will occasionally be network partitions. When there are, a given node can either respond (potentially inconsistently) or not. So you pick some balance between consistency and availability. Latency is really just a proxy for availability - as latency tends towards infinite, availability tends towards zero. Of course you can wait until the network partition is resolved and the nodes are caught up - but I think it's simpler to consider that as not being available for a period of time rather than as having a high latency at that time.

Consistency or Availability - you don't have to pick one, but the more consistent you want to be (in the case of network partitions) the less available you'll be.

justin_hancock · 2024-03-08T15:30:05.000000Z

Disagree on latency, for certain classes of applications it really matters. Latency can be caused by a myriad of things, but if that thing is your means for cut over, you haven't failed to respond, you've failed to respond in a timely way. The telco space has strict latency requirements. If you have a failure which results in erratic behaviour on the network due to the database taking time to handle a node failure you aren't in a good place, it has a financial impact.

There are databases that can provide high consistency and availability.

harkinian · 2024-03-08T18:41:32.000000Z

Put an SLA on the latency, and it becomes availability.

DandyDev · 2024-03-08T20:24:01.000000Z

That’s a great way of putting it!

pachico · 2024-03-08T14:10:56.000000Z

If your decision between Consistency and Availability has business implications, you definitely aren't making this "really complicated" but rather a business logic to adhere to.

Failing to understand the implications of one over the other is often sign of immature architecture.

harkinian · 2024-03-08T18:29:22.000000Z

Was going to say, latency doesn't need to be in there.

jameshart · 2024-03-08T13:54:57.000000Z

The ‘network partitions are not optional’ interpretation of CAP is forgetting about the fact that non-distributed-systems are a thing.

Sure, as soon as you decided to distribute your system across a network you opted into a world where partition can happen, and you will have to give up consistency or availability.

Mainframes, though, provide consistency and availability by being unpartitionable except through use of a chainsaw.

valenterry · 2024-03-08T14:18:17.000000Z

They don't provide availability, because if they go down, they are down. For example, if there is a fire, the whole thing is gone. That's a choice you make.

jameshart · 2024-03-08T14:23:01.000000Z

Mainframes have a ton of internal redundancy that enables insanely high avaialability, absent destruction. All ‘availability’ rests on some assumed radius of safety.

A nuclear blast in northern Virginia can take out all the availability zones for AWS US-East-1 - that doesn’t invalidate the availability claims of a system that is distributed across three AZs, it just puts an upper limit on it.

valenterry · 2024-03-08T14:34:01.000000Z

Sure. But the thing is: is there literally not a single component in that mainframe that cannot cause a single point of failure to the whole system?

If so, I agree - in that case it would be similar to running two severs in the same AZ. Which provides a certain form of availability.

Otherwise I disagree, because as soon as you have a single point of failure, your system is limited by the availability of that specific part, whereas in a distributed system you can by design increase availability by adding more nodes, at least up to a certain point.

Think about it: I can run the two servers I mentioned about in the same AZ. But I can also move them to a separate AZ each. For a distributed system it doesn't matter conceptually. That doesn't mean that there is no impact on the availability or performance, it just means that I don't have to change the logic of my system.

For a mainframe however that's not true - and that is what makes the charm of it: because I don't have (and don't want) to care, which simplifies things a lot; at the expensive of (the option of) the availability of a distributed system.

filleokus · 2024-03-08T14:40:28.000000Z

I've been "arguing" AGAINST mainframes on HN in the past, and after that I've seen many really cool presentations about them. One, quite commonly known, is that two separate mainframes can run completely in lock-step if they are less than ≈ 50 km / 30 mi apart. So if one explodes mid instruction it will still continue fine. As for a single system, I don't think there is any single point of failure, you can swap cpu's mid operation.

valenterry · 2024-03-08T17:56:37.000000Z

First of all, I'm not arguing against mainframes or anything! I'm just trying to clarify/explain terms and remove confusion. Just to get that out of the way.

> One, quite commonly known, is that two separate mainframes can run completely in lock-step if they are less than ≈ 50 km / 30 mi apart

Since they are 50km apart, they are connected to each other by one or multiple cables. I assume that they are also both connected to the internet (or something else that matters) so that if one burns down the other can take over. Correct so far?

If so, then please answer this questions: how does the system behave if all the cables between the two machines are severed, for an unknown time?

gwbas1c · 2024-03-08T21:36:30.000000Z

More importantly, for a lot of the things that people build today, a mainframe is powerful enough to handle however large the load will get.

The ultra-scalable techniques that Google, Facebook, ect, use are great, but most applications do not need that kind of scalability. Throwing a little more money on a fancy mainframe is much, much cheaper than the programmer time needed to build a distributed system.

jameshart · 2024-03-08T18:31:39.000000Z

I’m not arguing for mainframes here. I’m just saying that when you opt to improve your resiliency and availability by distributing your system, rather than by, say, adding more redundant power supplies and burying your computer in a deeper bunker, you are making your system partitionable, and that means you have to give up either consistency or availability.

Which is odd, when you think about it, since you allegedly made this choice in order to get better availability.

harkinian · 2024-03-08T18:38:46.000000Z

Right, if absolutely everything in the mainframe were redundant, that'd be two mainframes connected by a network, so P.

largbae · 2024-03-08T14:28:27.000000Z

"All availability rests on some assumed radius of safety", thank you for that! If this is not a named law could you name it? I have needed this statement in the past but it never came together like that.

anyonecancode · 2024-03-08T17:54:11.000000Z

I agree it's a good point. A nuclear blast is extrinsic to the system. I think what something like CAP gets us to think about is failures/limitations that are _intrinsic_ to the system. Someone walking around the data center and unplugging all the machines, then flying to the next data center and unplugging them, etc, would also cause all of CAP to fail.

Though, thinking about it more, what it also tells us is that we need to be clear about what "the system" is. There are, after all, some systems that _do_ take nuclear blasts into account, or systems that take "physical breach by a hostile actor" into account.

Less dramatically, we make trade-offs all the time as to what we consider "in" system vs "out", and it's good to be conscious and explicit about them. We see this a lot in UI, especially web UI in terms of things like what browsers to support, whether we care about users' battery life or data cap limits, if we foresee ourselves running the system in a jurisdiction with different regulations around data collection and usage, etc.

harkinian · 2024-03-08T18:36:33.000000Z

No matter what, some hardware in that mainframe is making mutexes or some other kind of locking work, and that is a single point of failure. You can't have two of those be authoritative for the same thing, that's the two generals problem.

samatman · 2024-03-08T17:15:13.000000Z

It's essential to rule out certain happenings when performing analysis, and acts of God are commonly among them.

For example, if you need to account for cosmic rays, you can't prove anything meaningful about software. Redundancy won't get you out of this: you add redundancy, I'll add more cosmic rays.

All that does is force a useful analysis into a muddy and probabilistic one. It's actually pretty important to account for cosmic rays! But you do that on top of an analysis which presumes that the hardware performs correctly. It's not a useful reality to expose to a proof assistant.

To anticipate an objection: yes, you absolutely can rule out network partitions in an analysis, if that's a useful thing to do. For a packet-switched network, it isn't useful. But different network topologies exist, or at least used to, ones where once a circuit is negotiated, you can say useful things about the network without accounting for the kind of equipment failure which might break that circuit. For packet switching, you're going to have a really bad time if you don't account for partitions, because those are expected behavior, one of the properties of the system under consideration.

valenterry · 2024-03-08T18:26:08.000000Z

That's not what this discussion is about. OP said:

> as soon as you decided to distribute your system across a network you opted into a world where partition can happen, and you will have to give up consistency or availability.

So we are talking in the context of distributed networks. Then:

> Mainframes, though, provide consistency and availability by being unpartitionable except through use of a chainsaw.

using the words "consistency" and "availability" here is clearly refering to the same words in the sentence before. Hence we are still in the context of distributed networks. And therefore you cannot just rule this out as an "act of god".

Had OP said "A mainframe has high availibility compared to my laptop and for me that is more than enough." then I wouldn't have said anything.

samatman · 2024-03-08T20:11:26.000000Z

Your fire is just the chainsaw in the original post.

valenterry · 2024-03-08T22:13:55.000000Z

Yeah, that's exactly my point.

djha-skin · 2024-03-08T14:24:10.000000Z

Datacenter fires cause downtime no matter what consistency model you choose.

Also, mainframes are pretty big, multi-processor machines. They have their problems, but having all those components give out all at once? I'm sure it happens, but I've never heard of it.

valenterry · 2024-03-08T14:27:07.000000Z

> Datacenter fires cause downtime no matter what consistency model you choose.

Why that? I can distribute my nodes over multiple datacenters. Netflix does that even with whole regions (to my knowledge) to avoid downtime when a region goes down.

djha-skin · 2024-03-08T14:30:53.000000Z

In theory yes, and certainly if we're talking about Cassandra. In practice, most people do not distribute over multiple data centers. Doing so is very difficult and costly.

valenterry · 2024-03-08T14:38:37.000000Z

Haha, I've been there. Three companies I've been with have done that - and every time we had more downtime overall than had we just run a mirrored postgres.

But yeah, the theory is what I'm talking about. Not putting the theory into practice is a whole different matter. :-)

harkinian · 2024-03-08T18:34:55.000000Z

It can be useful to say that everything is a distributed system because it actually matters in a lot of contexts, just not for CAP. There are sorts of networks within a single machine or even a single CPU, and generally "further apart" things have more latency between them. For example, task on <cpu 1, core 1> talking to another also on <1, 1> vs <1, 2> vs <2, 1> vs all the way off on the GPU... A single failure probably takes down the entire system, but not necessarily.

throwawaymaths · 2024-03-08T15:58:27.000000Z

You can have a distributed system that is partially available too. Consider a GPU cluster. A network partition occurs. Both nodes can suspend jobs or refuse to accept jobs that require more than one nodes worth of compute until the partition is healed, while still independently serving jobs that require only one node, while healing all accounting details when reconciliation is possible.

YetAnotherNick · 2024-03-08T15:29:09.000000Z

> The ‘network partitions are not optional’ interpretation of CAP is forgetting about the fact that non-distributed-systems are a thing.

Not only that, a weird network partition is highly rare and I have never seen it. In most of the cases, there is only one network and either the server is up and connected to "the" network or is not connected to the network. And I believe engineers have tendency to overengineer for this downtime and not thinking about much more probable ones.

I have asked many who says we need three nodes and three pods for each service minimum, and no one could answer why.

jlokier · 2024-03-08T18:35:33.000000Z

I have seen it happen even within a local dual-redundancy cluster, with machines connected by redundant ethernet links.

Some maintenance person disconnected power to the two servers, then reconnected the power assuming they would reliably recover. The admin instructions were to never do this - always boot up in sequence.

They booted up simultaneously, and the fully redundant network switches took too long to boot fully (Cisco), but started passing internet traffic. After a timeout the servers each assumed they were the master in a degraded cluster, so proceeded to make divergent modifications on the DRBD replicated storage and to serve requests.

I never found out why this happened in spite of the direct ethernet links between the servers which they were supposed to use for synchronisation decisions, but it did.

Recovery required manually comparing changes in files and databases to decide whether to merge or discard.

This problem was avoidable but an adequate fix evidently wasn't in place (mea culpa, limited time and budget).

It did not help that Pacemaker+Corosync was used, before Kubernetes was popular, and Ubuntu Server shipped a very buggy alpha version of Corosync that corrupted itself and crashed often, despite upstream warning it was an unreliable version. I had to manually build a different version of those tools from Red Hat source, because it was too late to change distro. This is one of two reasons I don't recommend Ubuntu Server in professional deployments any more, even though I still use it for my own projects.

Three servers, or two servers and a third special something for arbitration, is a standard solution to this problem.

But it's only useful for a stateful distributed system, like a database or filesystem with some level of multi-master or automatic failover.

There's no need for three nodes or any particular number, for stateless nodes like a web service whose shared state is all calls to a database or filesystem on other nodes.

Technically you don't need three servers. It's enough to have a cheap component or low-cost tiny computer to arbitrate. Even sending commands to the network switches to disable ports (if the switch doesn't behave too strangely, as the Cisco switches did in the above!), or IPMI commands to the other server's BMC. Just about anything can be used, even a high latency, offsite tiny VM, as it isn't needed when the main servers are synchronised.

jameshart · 2024-03-08T18:23:52.000000Z

Minimum of three is generally for one of two reasons:

1) to ensure you can deliberately take one offline for maintenance and still have redundancy in case a single node goes offline

2) in some systems an odd number of nodes is needed to ensure no ties in leader elections or decision votes. Three is the smallest odd number that has any redundancy.

YetAnotherNick · 2024-03-08T23:54:45.000000Z

1) I have taken a node in maintenance mode many times deliberately by launching second node and draining first. Its not an issue at all with just one node. For redundancy, is there ever a case where single node could go down in practice. Nodes are basically EC2 instance and they could run for years without going down.

2) 1 is odd and could win the election;)

jameshart · 2024-03-09T03:56:15.000000Z

EC2 instances can be rebooted at any time. The underlying hardware they are on fails from time to time and they get moved.

Running three nodes is a rule of thumb, not a hard rule for minimally guaranteeing availability.

YetAnotherNick · 2024-03-09T17:15:54.000000Z

> EC2 instances can be rebooted at any time

No, they can't be rebooted any time. Where are you getting this information from?

jameshart · 2024-03-09T17:47:43.000000Z

https://repost.aws/knowledge-center/ec2-linux-degraded-hardw...

Note in particular:

> For instances that launched from an Amazon EC2 Auto Scaling group, the instance termination and replacement occur immediately

YetAnotherNick · 2024-03-10T04:36:02.000000Z

> Amazon EC2 also sends a notification in your email

Interesting. I had seen EC2 instances running for multiple years and never saw this so I assumed that this isn't possible. I know for a fact that GCP has live migration where while the service would be degraded for few seconds, you don't need to do anything so I assumed AWS also had something similar.

_dark_matter_ · 2024-03-08T13:45:44.000000Z

In my opinion this dead horse is beat. Anyone working on distributed systems knows that CAP is not useful in isolation. See Eric Brewer's followup article from 2012: https://www.infoq.com/articles/cap-twelve-years-later-how-th...

_visgean · 2024-03-08T13:48:04.000000Z

I had an interview question about it, I tried to cite Kleppmann from the distributed systems book and the interview manager got quite offended.

adhamsalama · 2024-03-08T14:06:47.000000Z

Same, but I actually forgot why he didn't like it, so I didn't.

12345hn6789 · 2024-03-08T17:08:13.000000Z

care to elaborate? Not picking apart anything you said but am curious. I am slowly making my way through DDIA. Nothing he says is particularly egregious (so far).

_visgean · 2024-03-09T14:21:03.000000Z

the guy interviewing me just wanted a definition. I guess I should have started with that and then point out why its a bad concept.

swyx · 2024-03-08T18:42:57.000000Z

sounds like you dodged a bullet.

super_mario · 2024-03-08T14:58:16.000000Z

I view CAP theorem as statement about the universe we live in (information travels at speed of light, "instantaneous" is observer dependant etc), and not as a consequence of information theoretic definitions we chose to adopt (strength of consistency or availability). Yes, when stated simplistically (you have 3 properties C, A, P choose 2) can be extremely misleading, since you can't really choose P, you can only decide what to do in case of P. But like any mathematical theorem one has to understand the preconditions when the theorem applies, understand the various computing model assumptions etc. to actually use it.

Notwithstanding, I still find it very useful as a general guide when designing distributed systems.

fl0ki · 2024-03-08T15:47:13.000000Z

A very common failure mode I see even among experienced senior engineers is to talk about the availability only of their endpoint, not the user experience of the system to meaningfully address user requests. Well-intentioned SLOs around error rate, 99th latency, etc. become meaningless if you don't understand how they affect the various objectives of various production clients.

We generally accept that clients may have to perform retries, definitely jittered and hopefully bounded. Given that's the case, what does it matter whether a single instance takes 5 seconds to come back online after failure (possibly rescheduled on a new node), or multiple instances take the same 5 seconds to recognize leader failure and elect a new leader? Sure, they're unlikely to be identical numbers, but say they're within an order of magnitude and both have error margins so it's a wash.

An architecture astronaut will put forward a design with complex leader election, endpoint discovery, distributed locking, strong consistency, etc. for the latter solution and pat themselves on the back for their expertise and professionalism. They'll waste a lot of time for both dev and SRE as long as the service lives.

A reasonable person should be able to acknowledge that, given either solution will have the client retrying about the same amount of time, the simplest solution that delivers that experience is sufficient from the client's point of view and vastly preferable from a maintainer's or operator's point of view.

Of course not every scenario will be like this. Sometimes startup is unavoidably a lot slower than re-election. It's just rare you see this evaluated in terms of client-facing numbers before a very costly architecture decision is made.

senderista · 2024-03-08T16:41:46.000000Z

Latency eventually becomes availability eventually becomes durability. The difference is only quantitative not qualitative.

j-pb · 2024-03-08T16:42:00.000000Z

There is a theorem called CALM[1] (Consistency as Logical Monotonicity) that shows that systems are eventually consistent iff they are monotonic.

I find that reasoning about consistency is much easier from that perspective, for example it immediately gives you an intuition on why individual CRDTs work.

1: https://arxiv.org/abs/1901.01930

klabb3 · 2024-03-08T18:01:29.000000Z

Sounds interesting. Is there an intuitive definition of monotonicity for the purposes of distributed systems? I mean, I thought I knew what it means, but I’m not too sure now.

j-pb · 2024-03-08T22:46:56.000000Z

You never take something back. A grow set for example only ever grows. A monotonic counter only ever grows. Even a set where you can insert and delete once is like a set of counters that only ever grow (nothing: 0, added:1, removed:2).

Forgetting for example is fine, but deletion is not.

samatman · 2024-03-08T17:03:07.000000Z

> Since you have to account for network partitions, you have to choose between consistency and availability if and when a partition occurs. In other words, the CAP divides the world into CP and AP systems.

No, it's two out of three. CA systems don't use a network. Consider a database and a work queue which communicates with the database, where they run on the same server. You can achieve consistency and availability in such a system, but only by eliminating partitioning, and there's only the one way to do that: don't run the systems on a network. Nor is this an unrealistic architecture! Far from it, it is in fact the one you should choose if you need both consistency and availability.

The parts of the article about the CAP theorem and its consequences are on fairly solid ground as I see it. But the observation about the CAP conjecture is bootless, the conjecture that CAP is "pick two" holds up to scrutiny. Not that it proves the conjecture, just that all three of the "pick two" options describe meaningful systems.

AtlasBarfed · 2024-03-08T13:41:08.000000Z

Whenever I see questions about CAP as a useful thought exercise in distributed systems, it seems to originate from people trying to defend database systems that don't respond well to the realities of partition inevitably.

That is people doing implicit or explicit marketing of some enterprise data landgrab telling you things like they can scale arbitrary SQL joins across distributed tables. Because large data sets do not teleport across the wire for merging.

Foundation db, Cassandra, and kinda dynamo (I don't trust their new global replication) actually scale, but I've never used gcps db techs.

CAP is imo one of the best distributed systems principles ever stated. It is simply state, compare to say the nitty gritty of say paxos vs raft. It is a very useful first principles exercise for deconstructing whatever harebrained claim some marketroid is foisting upon you.

justin_hancock · 2024-03-08T15:23:46.000000Z

Volt is another good example of solving this well. It's consistency guarantees are strong, and offers very high availability through it's replication which commits transactions to replicas simultaneously. In theory it's weaker on the partition side, but in practise with modern hardware and networks it's impact is not felt. It also offers multi data centre replication, with all sites being active.

stevefan1999 · 2024-03-08T16:19:23.000000Z

> On the other hand, given a non-permanent partition and the requirements that a request receives a response eventually, that is, in unbounded time, both strong consistency as well as weak consistency can be achieved.

Is that why S3 works so well?

rubiquity · 2024-03-08T17:38:07.000000Z

The point they're trying to make is that the A in CAP is ambiguous. If your timeout is unbounded and a request completes an hour/day/month/year or more later is the system available?

margorczynski · 2024-03-08T16:07:41.000000Z

Doesn't strong consistency always imply some form of explicit locking that collides with availability? As in a distributed system it needs to go something like:

1. Get write request to a node

2. Node sends lock on that updated data to rest of nodes

3. After they send back OK the write happens

4. Propagate to rest of nodes

5. After receiving OK on the update from all of the nodes send notification to nodes to lift the lock

So if a partition happens the system fails if it is CA while with e.g. CA it cannot guarantee strong consistency without resolving the split brain (killing of one of the partitions based on e.g. quorum)?

senderista · 2024-03-08T16:36:01.000000Z

If by "strong consistency" you mean "linearizability" then no, you can build a distributed atomic register without locking (ABD algorithm[0][1]). The trick is to force readers to complete a possibly-incomplete write returned from a read quorum by writing it to a write quorum before they can return their result (in practice you can elide the write-on-read in the common case where the result from the read quorum already is written to a write quorum). Similarly, writers must find the latest possibly-written version from a read quorum to obtain the version number for their write.

If you need stronger primitives than atomic read/write (e.g. CAS), then ABD is insufficient and you need consensus in the asynchronous model (although synchronized clocks can allow even CAS to be implemented with ABD + leader election with leases[2]).

[0] https://dl.acm.org/doi/pdf/10.1145/200836.200869

[1] https://www.cl.cam.ac.uk/teaching/2223/ConcDisSys/dist-sys-n...

[2] https://arxiv.org/pdf/1702.04242.pdf

dtornow · 2024-03-08T17:11:34.000000Z

My favorite mental framework to reason about consistency in distributed systems, Invariant Confluence, was formulated by Peter Bailis at al in Coordination Avoidance in Database Systems (overview and link to paper http://www.bailis.org/blog/when-does-consistency-require-coo...)

Invariant confluence, determines whether an application requires coordination for correct execution.

You can tailor the invariants to your requirements making invariant confluence a much better tool than CAP

rufius · 2024-03-08T17:18:28.000000Z

This blog post felt unsatisfying. I’m not sure I’ve talked to anyone in the last decade about distributed systems where:

- CAP came up

- all parties immediately agreed that CA was impossible

I still think it’s a useful model for forcing people to think about their failure modes though. Systems, largely, tend to fall into either CP or AP or be horrendously expensive.

You have to choose your trade offs.

I wish the author had expounded a little further on better options. I’ll read the paper linked, but there’d be more punch with more inline content.

mbfg · 2024-03-08T17:45:46.000000Z

My understanding of

[[The “Pick 2 out of 3” interpretation implies that network partitions are optional, something you can opt-in or opt-out of.]]

is you can choose to not have a distributed system.

kqr · 2024-03-08T17:54:01.000000Z

This is also how I understand it. You choose networked or not networked, and if you choose networked you get a 2-for-1 with partitions.

IgorPartola · 2024-03-08T14:14:41.000000Z

I worked on a wide-area distributed system on 2010-2012. CAP and NoSQL was all the rage then (MongoDB is webscale, after all).

The approach we took was to have a single database node be the primary and another replicating from it asynchronously. When a node went offline, a third monitoring node made a decision to promote the backup to primary and communicate this decision to all clients. The three nodes were located in three different cities and as such the possibility of all three having issues at once was reduced. By doing this, we could literally pull the plug on the primary node and within 30 seconds the clients were talking to the backup-now-primary. If the monitoring node failed, of course we had no way of automatically switching database nodes but again it had to be a major issue for a node in Seattle to get cut off at the same time as the nodes in Dallas or DC. This worked because of the kinds of data we processed so that our 30 second failover time was acceptable AND a small window of lost writes was also OK.

I think another interesting approach is to have your nodes explicitly communicate write status when responding. Basically when you write(node1, key, val) and expect node1 to normally propagate the write to node2, in the case of a network partition node1 would respond with ack(key, nodes_written=[node1]) explicitly excluding node2 from nodes_written if it was unable to push the write to node2 in a timely fashion. Similarly, read(node1, key) could have node1 return resp(key, val, inconsistent=true) or again spelling out which nodes do or do not know this value.

This would give the application a way to decide if the write should be considered successful or not based on what key represents. For example, updating the current position of a fast moving object that sends updates all the time could easily lose a write or two without it being a problem for the user, but a financial transaction could not.

Lastly, the approach I never explored but was curious about is the idea of the client being responsible for pushing all values to all nodes rather than replication happening in the background. This would effectively allow the write to happen simultaneously and also know if the write was a fail, a partial success, or a full success. Then a read could be done from just one node, but before returning it would poll other nodes and return the value that is consistent amongst the majority of the nodes or a failure if it could not reach any of them.

cybereporter · 2024-03-08T14:23:54.000000Z

Yeah, the post doesn't go into quorum decision making with the "monitoring node", which helps either enforce eventual or strong consistency. I've found that you typically want minimum 3 nodes + a monitoring node (or could be client side) to have a read quorum (at least 2 nodes have consistent data).

hodgesrm · 2024-03-08T16:37:01.000000Z

> Lastly, the approach I never explored but was curious about is the idea of the client being responsible for pushing all values to all nodes rather than replication happening in the background.

You can client replication at industrial scale by having clients push writes to Kafka and then have multiple, independent systems read and apply them. You still have a SPOF on Kafka of course. Another practical issue is that if you take this approach you'll likely need a utility to detect and correct data drift. That's been a feature of the systems I've seen that use this approach successfully.

IgorPartola · 2024-03-08T17:32:27.000000Z

So what I was picturing was more of where the client directly pushes to all the backend nodes simultaneously. That way the client right away knows if it was a success or a failure or a partial success. The whole problem with any of this is that you can have writes be a partial success and what that means cannot be correctly interpreted by the database cluster alone in the general case. Like it’s bad when it happens but is catastrophic or can it be recovered from really depends on what the data is and it should be up to the application to decide that. Instead we are trying to treat databases like it’s always success or failure with maaaaybe some eventual consistency semantics and when a partition happens or a node is simply down the application doesn’t really know if any given piece of data is actually fully durably committed or just partially committed.

willsmith72 · 2024-03-08T14:09:16.000000Z

I don't find this position particularly helpful, unless you're talking to someone who takes CAP literally as a law and picks only 2. That's never been the case in discussions I've had.

It's still extremely useful as a mental model for tradeoffs in designing a system.

All it's telling you is there's a pendulum from availability to consistency and generally. The more you want of one, generally the less you get of the other, so choose the amounts you want of each deliberately.

People take things too literally and seriously. Has anyone in their company really argued that, "we should choose availability instead of consistency"? It's way more nuanced than that, and as far as I know, everyone knows that

deathanatos · 2024-03-08T22:45:38.000000Z

> Has anyone in their company really argued that, "we should choose availability instead of consistency"? It's way more nuanced than that, and as far as I know, everyone knows that

… I'm usually experiencing the CP/AP split from the seat of a user, using the system. From where I sit … yes, it certainly feels like people are, somewhere, saying "we should choose availability", given the number of systems I've had to interact with that are trivially not CP.

Specifically, refer to https://jepsen.io/consistency — these are better, more specific terms, IMO; the number of systems that don't obey "Read Your Writes" (at the very bottom of the tree!) is pretty stark. Many Azure services, for example, are not read-your-writes; this means I'm perpetually wrapping things in loops that attempt to wait for the upstream system to become consistent, which is impossible to do in any manner that's foolproof, prior to moving on to the next API call (which would otherwise fail, if it depends on a write from the prior API call, but can't read it, because read-your-writes). Off the top of my head, I've seen empirical violations of read-your-writes in all of ARM, AAD, ACR. My latest container registry is also not read-your-writes.

S3 used to fail read-your-writes, in certain circumstances. (That have since been fixed.)

taneq · 2024-03-08T14:15:00.000000Z

Yep, seems reasonable. Kind of like Carnot efficiency for heat engines, it’s old and boring and we should stop talking about it.