Hacker News new | past | comments | ask | show | jobs | submit login
An Illustrated Proof of the CAP Theorem (mwhittaker.github.io)
209 points by networked on July 14, 2018 | hide | past | favorite | 71 comments



"CP/AP: a false dichotomy" https://martin.kleppmann.com/2015/05/11/please-stop-calling-... . Martin Kleppman is the author of "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" https://www.amazon.com/Designing-Data-Intensive-Applications...


I think that the CP/AP dichotomy is a good example of how we treat fundamental, but not hard tradeoffs as hard.

E.g. often we have a fundamental tradeoff between latency and throughput, and it's impossible to get 100% in both metrics. However, we can still do very well in both, and that's what matters in practice.

We have the same thing with CAP. You can build a CP system with high availability.


Fundamental tradeoffs in CAP are latency/time for consistency, since real world networks are always partitioned. You are just using a different meaning of availability when talking about high availability, not the one from CAP.


When I speak about Datomic, my most frequent questions are about CAP theorem, which Datomic seemingly subverts to do seemingly impossible things (like aforementioned Spanner): http://www.dustingetz.com/:datomic-cap-theorem/


Once it's clear what the terms mean, the theorem is so obvious as to scarcely need proof. If G1 and G2 can never talk to each other, of course you can't write data to G1 and read it back from G2.

I think the point is more about timing. If your client writes to G1, then you pretty much have to wait until the write has propagated to the entire network to acknowledge the write or accept some risk that some other client will read back stale data after the write has been acknowledged. I should think this would be obvious to coders also, but it is not at all obvious to managers.

I say "pretty much" above because there is technically a ridiculous third option where you essentially lock down the whole network with every read. But it still doesn't get you around CAP.


The combination of partition tolerance being mandatory and the timing caveat is why I prefer to think in terms of an extended CAP theorem called PACELC[1].

The first part, PAC, is your traditional CAP theorem - in the presence of partitions (P), you can provide either availability (A), or consistency (C). The second part, ELC, describes the system characteristics during the normal, non-partition case. It reads as "else (E), you can provide either low latency (L) or consistency (C)".

Even tho apart from some curious outliers most systems are either PA/EL or PC/EC, I find the framing helpful for reasoning about a system in more than just the partition or failure case.

[1] https://en.wikipedia.org/wiki/PACELC_theorem


Your first paragraph is spot on. The rest is overstating things though.

"If your client writes to G1, then you pretty much have to wait until the write has propagated to the entire network to acknowledge the write"

PAXOS, for example, just needs it to be written to a majority of nodes.

If you write it to node "key % n", where n is the number of nodes, then only that single node needs to be up & available to all writers / readers.


I'm not really a fan of that proof because it's a proof by contradiction in the least helpful sense. It doesn't tell you why the theorem is true; only that some assumption was wrong. Much better would be if the proof were accompanied by examples of systems that had CA but not P, CP but not A, and so on, to show that this result is best possible.


So you are saying that to show that less is the best you can do would be the next step, and by extension you could ommit the first step and go directly to show cp or ap was optimal. But showing the contradictiin us what motivates the first step. Going the other way around, "thus cap is not in the set of optimals" shows less and begets the question.


"begets the question" is a beautiful solution to the "begs the question" error, but I think in this case you meant "begs the question"?


There's really very little reason to ever use that expression. It's not a particularly good translation of petitio principii, which is itself not a good translation of a previous text. The Latin is a less ambiguous way to refer to the logical concept, and "begets" or "provokes" similarly avoids confusion with the common sense. That said, I do think you might be right about which they wanted.


As far as I can tell begets is the original form, begs is a mondegreen. I am using this deliberately only since I came across [1], to say "escapes the question". Alas that's just an unsubstantiated hunch. Maybe I should just write "escapes".

[1] https://en.wiktionary.org/wiki/Reconstruction:Proto-Indo-Eur...

... which is akin to Latin fugio, fugere -- to flee. The analysis of OE begietan as be+get strikes me as faux-etymology fashioned after behold and the like, where beholden has little to do with holding, all the same.

This is a hobby of mine to the point of becoming neurotic. It's all in vein if a play of words needs explaining. It's still kind of insightful, if you'll entertain me a little longer.

If bʰegʷ- means to run, flee, then how is bʰeg-, to break, related? Why is to break rather reconstructed as bʰreg-. Why is bʰegʷ- given an alternative form bʰewg-? Does preḱ-, to ask, fit in here; or prey- or what it was with a related meaning? Maybe wreg-, whence wreck?

What about bag, pray, pay, fag, vag, way, weigh, etc etc.

Maybe to break away from always meant ''departure'', ''to go, run away''. Why are hurried news breaking? Why does German have "brandaktuell" instead, burning news? I have to break it to you, I don't know. I can already hear you begging the question to stop. But I have one more. Is a beggar someone living out of bags? I really don't want to know.


>It doesn't tell you why the theorem is true; only that some assumption was wrong.

A proof establishes truth, it does not typically explain. And yes, some assumption was wrong. The assumed premise was wrong. A system which is consistent, available, and partition-resistant does not exist. The truth of the theorem that no such system can exist is proven. I imagine it's not the proof you dislike, but the article about it which you wish had more explanation which is reasonable. It would be a different matter to discuss which solution is "best possible", though, as that would vary greatly on the application. If you're running a bank, you would never want to sacrifice consistency. If you're running a blog, availability and partition-resistance is probably more important.


> If you're running a bank, you would never want to sacrifice consistency.

Banks do sacrifice CAP consistency. As it doesn't mean you don't have consistency at all or anything like that. Just that you don't wait for writes to become visible to all nodes.


Proof by example or analogy is not a proof.


Proof by example can be a proof. It depends on the theorem statement.


Sure — but, given the proof by contradiction, the examples would then illustrate the trade offs that the theorem forces.


Proove that you can walk. Walk. Prooved by an example. Basically to prove that something exists or that a property is not valid for every element of a set, you can proove with an example.


Proove that you can't walk. Don't walk.

Hmmmmmm.

Of course you can prove something exists by an example. And you can prove a property does not always hold with a counter example.

But if you want to prove that something doesn't exist, or a property always holds, then you shouldn't be looking for examples.


This is exactly what I've said. My initial "proof by example" was an answer for the op's statement:

> Proof by example or analogy is not a proof.

So as you also said "Of course you can prove something exists by an example".

To answer this: > Proove that you can't walk. Don't walk.

As I said, proof by example works only

> to prove that something exists or that a property is not valid for every element of a set.

Divide your lifetime in seconds. With that, you're trying to proove with the property "walk" isn't valid for every element of the set lifetime. To proove that the property can't walk dosen't hold for every element of the set, just show an example of second while you were walking and you're done.


The reverse proves something different. If you walk, you've proved that you CAN walk. Not walking is not a proof for lack of ability. It does however give a possible counter example to the assumption "If you can walk, you will walk"


This sounds wrong to me. Do visual proofs not count?


This proof is missing a step showing that the problem exemplified by the simple two-node system applies to all system topologies. This website only proves: "there exists a system for which CAP is impossible." I do not think the generalizing argument will be complicated, but it should be included.


I noticed this as well. Now I'm wondering -- how would one generalize this proof, keeping the same easy-to-grok style?


Once you understand exactly what is meant by the terms, the theorem seems immediately obvious. If your servers G1 and G2 cannot communicate with each other, of course you can't write data to G1 and read the updated data from G2.


What if you added a rule that if 2 nodes lose communication then after some period T the lowest ID node kills itself? And you add a delay of 2T to all client/server communication to give that a chance to happen.

Then in the example, client writes to V1, V1 waits T time to communicate with V2, it doesn't manage to so it returns failure to the client and kills itself. Then the client writes + reads to V2 instead.

Does this break availability because returning an error that you can't act on a client request counts as ignoring it in this informal definition?


Returning an error is basically saying "I'm not available", because there is no guarantee there is an alternative.

Your example can't deal with there being no path at all between V1 and V2. So even the client can't reach V2. In that case the entire system V is unavailable to the client.

Availability in CAP is about being able to reach a node, but not all of them and because of that the system doesn't function. I guess killing a node when a problem occurs can be argued to defeat CAP (all the nodes that are up can reach each other), but it definitely doesn't improve the situation.

Also what does you system do when V5 out of a 5 node cluster crashes? Do the nodes V1-V4 kill themselves, because they can no longer reach a higher ID node?


Ah I see what you mean, fair point - I'd agree that doesn't really work!

TBH my 2-node system was a simplified version of Corosync which I've worked with a bit, in that you allow quorum if you have half+1 of nodes up - so 4/5 up will maintain quorum. The highest-node lives rule is just a tie breaker in the case of a half/half split which the 2 node system always has when one node goes down. Good point though, that rule alone is definitely not a good idea.


Is partition intolerance the only piece of the three that disproves it? I'm curious to know more about how partition intolerance is handled.


The proof only shows that you can't get all three CAP. You can get any two out of three though.

In practice, partition tolerance is the most important property to have, since there's no point in having a distributed system if it can't function if any part of the system is down (it has to be fault tolerant).


There's no point in having a system if it's not available.

There's no point in having a system if it gives incorrect answers (inconsistent).

In the real world, you have to make tradeoffs.

Partition tolerant is "most important", it's awkwardly factored out of Availability -- "tolerant" is a variant of available. Your system can't be "partition intolerant "just because that means "not always unavailable".

Another way of looking at it is what you imply: "partition tolerant" is a synonym for "distributed".


It seems you think about partition tolerance in terms of whether the whole system tolerates a partition. Usually partition tolerance is intended as the ability of every single node to accept operations (e.g. new writes) even if said node is on the minority side of a partition, and still operate within specs (i.e. the database offloads consistency problems to the application, usually also guaranteeing some form of eventual consistency)


> partition tolerance is the most important property to have, since there's no point in having a distributed system if it can't function if any part of the system is down

This is really a question of definitions, no?

If partition-tolerance is a defining property of a distributed database system, then of course they all have it. Plenty of database systems don't have partition-tolerance: non-distributed database systems.


Partition tolerance is not a defining property of distributed systems, just a very valuable one.

You only really want to sacrifice partition tolerance if the precise correctness of your data is not tremendously important. A CA system can always respond and appear correct, but network partitions result in desyncs. I believe this approach is fairly popular in video games, where a CA system for multiplayer can provide each client with a coherent gameplay experience at the expense of things just breaking if a partition forms. Sorry, game over, please try again.

Most serious systems, though, require P.

One potentially interesting example that uses the current fashion of the day could be to compare payment networks. Visa's distributed payments system is CP, which is to say that it's a consistent system and getting a successful response back means your payment has gone through, but you have no guarantee of getting a successful response. The bitcoin network is AP, which is to say that it's a highly _available_ system which will never fail a response (so long as you can reach one node), but getting a successful response is no guarantee that your payment has gone through.

In all cases, the sacrificed letter is not gone, merely imperfect. This does not mean that Visa is not Available, it just isn't _perfectly_ Available. It does not mean Bitcoin is not Consistent, it just isn't _perfectly_ Consistent. It does not mean that your video game is not Partition Tolerant, it just is not _perfectly_ Partition Tolerant.

You can sacrifice any of the three and still have a distributed system, it will just behave differently.


Plenty of database systems don't have partition-tolerance: non-distributed database systems.

Non-distributed database systems are inherently partition-tolerant because they can not break into partitions in the first place.


If partition tolerance is the most important property to have, you can have it, as long as you don't need "C", consistency.


Partition tolerance isn't about a system's behavior when some part of the system is down. Partition tolerance is about preventing diverging state in multiple sets of nodes that can't reach the nodes in another set, usually because of networking problems.


>when some part of the system is down

>usually because of networking problems.

In distributed systems, those are the same thing.

If I am a node -- n1, communicating with various other nodes, the only way I can tell if another node -- n2 is alive or not, is sending messages to it and receiving responses. If the network is having issues and I can't communicate with it, for all intents and purposes, that node n2 is down from my perspective n1.

Edit: Also, diverging of state relates to consistency and not partition tolerance.


There is a difference. If you can not communicate with a group of nodes they may still process requests if they are just unreachable and not down. So you must assume they are up and potentially processing requests instead of assuming they are down and not causing any trouble if you care about consistency.


Partition tolerance is about preventing inconsistency (as you say) OR availability (turn off the system when it partitions). That's what the CAP theorem is about.

You can get partition tolerance at the expense of availability,for by refusing writes during a partition.


Are there solutions that don't meet the formal definition but still perform just as good as if they did in practical situations?

(As in the formal definition is not what we actually need to get a consistent and reliable system for our real-world application. We need something a little less and then suddenly it's possible to get a similar solution that actually provides C A and P.)


Google Spanner


Is it just me, or the proof appears to be incorrect? In the last example, the write operation should simply block before returning "done" to the client, before it's able to replicate the state to the other server. In fact, in a fault-tolerant system there cannot be just two servers, because it would be impossible to achieve a majority vote.


The example looks fine to me. If the writer blocks until the partition is resolved, then it isn’t Available for writes, so only meets CP.


OK, if you define the availability of writes in this way, that they must complete immediately, then indeed the availability part of CAP is not met.

However his definition reads: "every request received by a non-failing node in the system must result in a response"

It doesn't say anything about the timing of response. The network partition will be resolved eventually and thus the write operation will complete. Or it can time out and return an error, which is also fine based on the definition of availability (must result in a response).


Yes, but if you're allowed to timeout, then any system is trivially available, even a system where the nodes are always offline (simply timeout every request).


I was under the impression that the partition need never be resolved.


Either the proof is incorrect or he/she set it up in a way to make it look extremely trivial.

The way I saw the proof go is: to satisfy CAP, you need replication, and you need high availability. To be consistent you need replication. But if you don’t have a reliable network, you don’t get reliable replication. Therefore you can’t be consistent. To me, this is a trivial result, and I’m not sure that this is what the CAP theorem intends to say?

It would be better to do, as you say, assume that a system is CA, And prove why it can’t be partition tolerant. Saying “network is unreliable” isn’t a suitable jump in argument IMO.


It's trivial because (when you say things like "high availability", which is not precise nor does it apply to CAP) you're ignoring the precision required for a valid proof.


Trivial is subjective. The sketch of the proof is simple, but you skipped the details and it's not obvious to everyone at first.


I know this is probably a stupid question, but please humor me. Since the Client node has bidirectional communication with both servers, why can it not be used as a channel to facilitate communication between the two servers when a direct link is not available?


I think in that case the client would essentially be another broker. It would need to understand the cluster topology and manage replication awks etc. It would resemble a p2p protocol like BitTorrent.

I think that might also make split brain situations more likely if all the client brokers didn’t have the entire cluster topology known. You’d end up with different data in different places if the clients are partitioned in addition to the brokers.


this proof is sound (for this simple structure) but does it imply you can never have all three (CAP) in some scheme where you can draw from many nodes and edges (not infinite) to handle partition failure while always being available and consistent? Perhaps a proof that you need to draw from infinite nodes and edges to achieve CAP would be interesting.

disclaimer: i am not well educated on the literature in Distributed systems :(


I was thinking the same thing. The ability to handle partition failure comes from having more nodes. This proves this is not possible when haviing just two nodes. Kind of makes sense as it stops being a distributed system if you disconnect all nodes.


The CAP Theorem thinks about guarantees, not probabilities.

There are various ways to reduce the impact and probability of partitions, but increasing the quantity of nodes does not make partitions impossible -- in fact, you now have to worry about n possible partitions (n being the number of nodes).

the only way to guarantee partition tolerance is by majority votum. If you do that, you can no longer guarantee availability. For example, a cluster of 11 nodes might be partitioned thrice. (2 + 4 + 5) None of the nodes are allowed to answer, breaking the availability guarantee.


If for any two nodes there is some path by which the nodes can communicate, the network is not partitioned. That's what partitioning means.


In asynchronous networks you never have that. You never know whether you can communicate with the other nodes, until you try and get the response back. You can only reason whether you were able to communicate with other nodes about communications that already took place.


Google’s Spanner database is very likely an example of overcoming CAP. There’s more details here: https://ai.google/research/pubs/pub45855. From what i’ve read, Google is able to do this because they control the physical data transfer infrastructure (the nuts, bolts, cables) which in many cases is the reason why CAP is defeated.


>Does this mean that Spanner is a CA system as defined by CAP? The short answer is “no” technically, but “yes” in effect and its users can and do assume CA.

>The purist answer is “no” because partitions can happen and in fact have happened at Google, and during (some) partitions, Spanner chooses C and forfeits A. It is technically a CP system. We explore the impact of partitions below.

...

>Conclusion

>---------

>Spanner reasonably claims to be an “effectively CA” system despite operating over a wide area, as it is always consistent and achieves greater than 5 9s availability.

From the paper


The jist of the argument is that no DB gives 100% availability. That makes no sense. What if my client can't talk to anything? I'd still have to return a value.

So instead we talk about high availability. All sorts of things cause downtime besides network partitions. If the network is reliable enough, then you can still achieve high availability.


Spanner is a CP system. Nobody can overcome proven theory that CAP is impossible to achieve all at once.


Is there a paper about proving whether or not an arbitrary system is subject to the CAP theorem?


Having never seen this theorem before, it seems so trivial that it's completely useless. If you have no guarantee of having connected systems, then of course you can't always have consistency between them if they're required to answer all requests. Am I missing something here? What are the applications of this idea, beyond assigning a name to a triviality?


The immediate application of it is that when you choose a DB, you should be able to tell if it's either CP or AP. The field is full of subtleties (eg Network is not 100% reliable so no system should be CA - it's way subtler but that is the gist of it. I'm no expert btw).

Lots of DB vendors have tried to circumvent the therorem by introducing their own concepts or stretching the definitions. "We can beat the CAP theorem"


is CA even possible in a distributed system though? I mean, it’s not like “the network never fails” is a design decision you’re allowed to make (short of removing the network altogether and therefore not being a distributed system at all)


That's what Google spanner is claiming, that Spanner is Consistent and highly-Available (nine fives, or 99.99999% of the time available), which "in practice" they say is "CA".


From the perspective of the CAP theorem, Spanner is unquestionably a CP system, as the original paper [1] makes clear. (It relies on Paxos to elect leaders, and Paxos chooses consistency over availability.)

The "CA in practice" claims are purely marketing.

[1]: https://storage.googleapis.com/pub-tools-public-publication-...


It would be nice to have an overview of existing/possible workarounds.


The point is that eventually it's found out that there aren't any. They're all snake oil.

Usually companies find this out the hard way, with catastrophic loss of business data.


What about partial partitions, let's say you have 7 nodes, and 4 of them can't see the other 3, while the client can see all of them?

There are systems tolerant to these partitions (see PAXOS)


Your definition of tolerant is incompatible with availability from CAP. The Paxos 3/7 minority in your example can't make progress when partitioned from the other 4 machines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: