“high availability through Paxos based replication”. I thought Paxos is supposed...

chousuke · on May 30, 2021

HA isn't really about 100% availability. Any single system that promises such is extremely likely to be misleading you somehow. Your in-flight query is going to get interrupted no matter how fancy your clustering is, and I struggle to even come up with hypothetical use cases where this is something you can't afford to have happen, ever.

All you need is that in the event of a failure the clustered system can still recover quickly enough (to a well-defined state!) that the application layer can deal with the transient failure without significant impact on users, maintaining the illusion of availability.

chii · on May 30, 2021

> I struggle to even come up with hypothetical use cases where this is something you can't afford to have happen, ever.

a rocket is using this query to adjust their trusters ;)

chousuke · on May 30, 2021

Such a rocket would likely have two or more independent systems that would each have to agree on the adjustment, so one of them temporarily failing would not pose a problem. Though I doubt there are any rockets using database queries as part of their control system.

In those kinds of systems I suspect the approach is to enumerate every possible scenario and prove that the system behaves correctly in all of them, and if you can't do that, the system may be too complex and you need to redesign it to be simpler so that you can guarantee that it does not fail.

jlokier · on May 30, 2021

You have N >= 3 nodes, or N >= 2 and a non-compute arbitrator. One of them goes down, stops responding. You still have quorum, data processes continue just fine. That's high-availability in a CP system.