PigPaxos: Removing the Scalability Bottlenecks in Paxos

excerionsforte · on March 18, 2020

Interesting idea moving the reception of messages to other nodes and getting back smaller set of messages. Not only that but distributing the reception responsibility into relay groups where it can be load balanced among peers.

The leader using relay nodes reminds me of how humans organize into boss and worker groups and the boss shouts out orders to group leaders.

I would like to see how dynamic relay groups will perform under stress and I wonder who would communicate new relay groups. Choice of leader imposing relay group structure or groups self organizing.

hinkley · on March 18, 2020

I use Raft for the same reason I use Ethernet. As my distributed computing teacher said, it’s a terrible protocol but the best one we have.

Raft, and I believe Panos, contain some of the 8 Fallacies of Distributed Computing. The most unavoidable being assuming the network is heterogenous. I won’t say this keeps me up at night, but it distracts me when I’m eating lunch.

Mere mortals don’t have all identical hardware or all equidistant network ping times. And even if you start that way, upgrades will at least temporarily change that dynamic. Geographical redundancy will really change that dynamic. And often it’s when we are changing things that they break. So it’s all fine until you get new customers and have to upgrade and then all of your new customers get to watch you crash and burn during an upgrade.

Rabbit and Consul reduce this surface area by having two classes of users. I’ve heard of Raft variants with three (pure consumers, full voting members, and new or returning members who are still catching up) so that you can use machines unprepared for leadership and they will never try to elect themselves leader.

Right now we are still in a weird era where networking is so stupid fast that it’s faster to get data from another machine than from your own storage, but that can’t last forever. Berkeley did some really crazy things when this inversion of costs happened in the 80’s, but by the time I learned about it less than 10 years later, we were already tut tutting about how foolish that was.

I think the big thing I’ve been waiting for from a relay group type solution is for the participants to discover the network topology and elect a local representative to receive and rebroadcast the inbound stream of updates.

I don’t know that you have to funnel all traffic through that intermediary, the way statsd does. Just the cumulative event stream could represent an order of magnitude decrease in packets with only a slight increase in latency.

dnautics · on March 18, 2020

> Raft, and I believe Panos, contain some of the 8 Fallacies of Distributed Computing. The most unavoidable being assuming the network is heterogenous.

Did you mean homogeneous?

Also, Raft doesn't assume the network is homogeneous.

> I’ve heard of Raft variants with three (pure consumers, full voting members, and new or returning members who are still catching up) so that you can use machines unprepared for leadership and they will never try to elect themselves leader.

What's the purpose of these non-leader-competent nodes? Is it merely to increase the quorum membership and make consistency more difficult? Why wouldn't you architect this as a client/server relationship?

grogers · on March 18, 2020

Regarding the other types of nodes, pure consumers are useful as "standby" nodes. They are ready to be swapped in when another host fails, but don't participate in voting until they are swapped in. Swapping them in still requires a quorum of the previous nodes, so they are only useful in small numbers to be swapped in when another node fails. Nodes that are catching up are just a special case of this pure consumer type that shouldn't be promoted to be a full voting member until they are fully caught up.

I'll add another type of node that sometimes comes in handy in certain network topologies. I've typically heard this called a "witness" - it is a voting member but doesn't store any actual data (just metadata) so it can't be a leader. The typical use for this is if you have a data center cluster with only two DCs.

In that scenario you can make the number of nodes in each DC unbalanced so if the minority DC goes down you still have a quorum. But you can't make it so that if either DC goes down you can still have a quorum. So you add a third DC over a WAN that acts as an arbitrator between the two. Because it's far away, you don't want it to store all the data - keeping it in sync would be expensive. You only need it to know enough metadata to vote correctly.

This obviously changes the failure characteristics quite a bit vs having 3 DCs in the cluster like normal. Such as how many real nodes and witness nodes you can tolerate failing. Latency while one of your your main DCs are down is much higher than normal (because you now need acks from the witnesses) but you at least keep running.

dnautics · on March 19, 2020

Do these alternative topologies still have the proven guarantees that raft has?

senderista · on March 19, 2020

See Cheap Paxos.

hinkley · on March 18, 2020

> Did you mean homogeneous?

Yes

> Also, Raft doesn't assume the network is homogeneous.

Raft assumes that if the leader dies, that quick and orderly election relies on all members doing their 'duty' and putting themselves forward as the new leader if some counter says they are next. That means it assumes that any server can be leader.

What is implicit in this state of affairs is that all cluster members are equipped to perform the duties of the leader, which includes rebroadcasting all state changes - directly - to all other members. So every single machine has to have enough bandwidth to every other machine to comfortably ferry all of that data across.

That is a huge assumption, both historically speaking, and with geographical distribution of servers.

dnautics · on March 18, 2020

ah ok, so you mean the "node membership is homogenous". i thought you were talking about the transport layer.

hinkley · on March 18, 2020

I think we might still be missing each other.

Everybody has to be able to see everybody. Everyone sends their updates to the leader. The leader sends those to everyone else. When the leader dies, consensus is used to elect a new leader, based on 1) who asks first, and 2) whether everyone agrees they have the latest stuff. That is the consensus moment.

It is possible for one server in a remote data center to have enough inbound bandwidth to be 100% up to date when the election happens, but be throttled for outbound traffic. It may get itself elected and then be voted out the very next time that a spike of traffic occurs. If you have a couple of these, they could spend all day swapping around leadership.

And to be crystal clear: I'm not talking hypotheticals here. I'm talking about corner cases that people have had to confront in real systems.

I brought up the case of Consul earlier. I could have a bunch of servers locally, harassing a local Consul server to avoid making higher-latency requests over a potentially constrained interface from, for instance, a branch office. You wouldn't want any of the branch offices to ever be leader unless the home office is completely offline. One of the 3 servers located there should always be online, even in a double fault situation (eg, someone is upgrading one server and restarts the wrong one).

dnautics · on March 19, 2020

Cool, that's a great example, and I see exactly why it's a problem (I've implemented raft).

Thank you for clarifying.

excerionsforte · on March 18, 2020

I was thinking about datacenter/network topology aware dynamic groups as one of the ideas for how a leader would impose structure or how a group would self organize. All nodes could keep statistics, as well as use user defined metadata, and when they are chosen as leader or group leader to figure out group structure to minimize latency considering latency time is based on slowest path to leader.

eternalban · on March 18, 2020

[edit: comparative performance to epaxos is noted in OP - a x10 improvement is claimed.]

Egalitarian Paxos (2013): https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf

code: https://github.com/efficient/epaxos

nano_o · on March 18, 2020

Great idea and great work!

A couple nitpicks: it would be nice to see what happens when the leader fails. Optimizing for the case of a stable leader might have impact on recovery time.

Another important aspect for fault-tolerance is whether you can really survive any minority crashing. For example, if only the strictly necessary number of nodes keep up with the leader, then if most of those crash the system will have a really hard time recovering due to the backlog accumulated at slow nodes which now need to catch up for the system to continue operating.

A performance number that does not take those things into account may not be very realistic. Nevertheless the idea is pretty good.

tptacek · on March 19, 2020

Doesn't Multi-Paxos already have stable leaders? My understanding was that the innovation here was to relay prepare/promise/accept/accepted across a random relay network.

nano_o · on March 19, 2020

Yes, it's a nitpick. The comparison to Multi-Paxos seems fair because it makes similar assumptions (unless re-configuring the relay network after a leader failure is somehow difficult, but I wouldn't expect that).

My point is that it would be nice to benchmark protocols that take into account the issues I brought up, and measure what happens in the worst failure scenarios they are supposed to tolerate. Otherwise we get a false sense of what performance can be achieved if one really cares about fault-tolerance.

This small issue does not diminish the main contribution of the paper in any way.

hadronzoo · on March 18, 2020

Is this a similar concept to compartmentalized consensus? https://mwhittaker.github.io/publications/compartmentalized_...

mad44 · on March 18, 2020

The compartmentalized consensus separates follower as acceptor and replica, and divorces command-log replication from data replication. It also seems to use two acceptors group as in horizontal scaling.

The compartmentalized consensus does not have relay nodes that do relay/aggregation. The idea in PigPaxos is simply that using randomized relay/aggregators have surprising power for vertically scaling Paxos.

The bipartisan Paxos, seems to apply the compartmentalized consensus idea to EPaxos.

zinclozenge · on March 18, 2020

Interesting that there is another Paxos variant that was published to arxiv around february called BipartisanPaxos https://mwhittaker.github.io/publications/bipartisan_paxos.p.... It too aims at removing bottlenecks.

mad44 · on March 18, 2020

The compartmentalized consensus separates follower as acceptor and replica, and divorces command-log replication from data replication. It also seems to use two acceptors group as in horizontal scaling.

The compartmentalized consensus does not have relay nodes that do relay/aggregation. The idea in PigPaxos is simply that using randomized relay/aggregators have surprising power for vertically scaling Paxos.

The bipartisan Paxos, seems to apply the compartmentalized consensus idea to EPaxos.

senderista · on March 19, 2020

You can achieve constant write throughput and read latency and linearly scalable read throughput (at the cost of write latency) with LCR, a criminally unknown uniform total order broadcast protocol.