What happens to your cluster when your network experiences intermittent packet l...

armon · on Oct 23, 2013

I'd highly recommend taking a look at this page: http://www.serfdom.io/docs/internals/gossip.html. One of the great attributes of the gossip protocol is it is very robust to intermittent network failures. Under minimal packet loss conditions (<5%), the rate of false positives should be very low. This is due to a few techniques, one of which is indirect probing, and another is a novel "suspicion" mechanism. In the case of a network partition, the parts of the cluster can run in isolation and will recover when the partition heals. If you are interested, the paper referenced there ("SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"), is the foundation of Serf. In the paper you can find more details about the behavior of the cluster, false positive rates under packet loss, and partition handling.

tl;dr the systems is in fact designed with network errors in mind, as opposed to handling them being an afterthought.

0xbadcafebee · on Oct 23, 2013

What you're saying is it's designed with the knowledge that it's going to cause false positives, and basically doesn't work well under anything more than minimal packet loss. I think this is probably an important factor to note in the description (and I still fail to see how this is considered highly available or fault tolerant, as described in Intro pages)

armon · on Oct 24, 2013

I think we are maybe just working with different definitions. High Availability for Serf means that it can continue to handle changes in topology and deliver user events in the face of node failures and network problems. However, it is inevitable that there will be a degradation in it's performance given network failures. If there are serious packet loss issues, Serf will mark a node as failed.

I'm not saying it "won't work well". It works as it is designed to. It will be available for operations, it will automatically heal when the partition recovers, and the state will be resynchronized with the "failed" nodes. The system will be in an eventually consistent state, which is expressly documented and is it's normal mode of operation.

If you consider 5% packet loss "minimal", I'm not sure what applications you are running. TCP degrades at over 0.1% packet loss, and most UDP streaming protocols have serious degradation over 5%.

0xbadcafebee · on Oct 24, 2013

I'm still confused. You mention resynchronizing when the "partition" "recovers". First, can you clarify what a partition is? Second, can you define "recovery"? I'm not worried about performance degradation, i'm worried about nodes being marked down when they aren't down.

Please correct me if i'm wrong, but it sounds like this software only works reliably when you have two sets of nodes that suddenly can't communicate at all, and are eventually connected. Sometimes that does happen on a real network, but often the cause of failures is intermittent and undetermined for hours, days, or weeks. In this case, how would this program work? Would network nodes keep appearing and disappearing, triggering floods of handler scripts, loading boxes and keeping services unavailable?

Yes, tcp performance does degrade under packet loss. It also continues to operate (at well over 50% loss) and automatically tunes itself to regain performance once degradation ends. And it does not present false positives.

It maintains its own state (ordered delivery), checks its own integrity, stands up to Byzantine events (hacking), and is supported by any platform or application. Unfortunately, due to its highly-available nature, it will eventually report a failure to an application if one exists. But if latency is more of a priority than reliability, UDP-based protocols are more useful.

If you're designing a distributed, decentralized, peer-to-peer network, that's cool! But I personally wouldn't use one to support highly-available network services (which is three out of the five suggested use cases for Serf)