Is no-one else running into FastLeaderElectionFailed? When you have a system that writes a lot of offset/transaction info to zookeeper you can push the zxid 32-bit counter to rollover in a matter of days. When this happens it can bring zookeeper to a grinding halt for 15 minutes after 2 nodes try to nominate themselves for leadership and the rest of the cluster sits back and waits for a timeout.
Requests (can't find them in JIRA at the moment, so I need to paraphrase) in the past to have a call to initiate a controlled leadership move to another node have been turned down as "you don't need this" yet leadership election fails in some circumstances! In addition there's no command or configuration to disable FastLeaderElection.
So the zookeeper maintainers keep operators limited to having to flip nodes off and on again, which is really a bad way to manage software because it impacts clients as well as leadership (and even if clients recover, most code that I've seen like to make some noise when zk connections flap). I would really like to eliminate all use cases for zookeeper where there is a chance that the zxid will exceed the size of its 32-bit counter component in the span of, say, a decade so that as an operator I don't have to set alerts on the zxid counter creeping up, and having to reset zookeeper and restart all of its clients (many versions of many zookeeper clients don't retry after connection loss, don't retry after a timeout, don't cope with the primary connection failing, will have totally given up after 15 minutes, etc.).
I think that the kafka maintainers have been doing a better job of actively maintaining their code and ensuring it works in adverse conditions, so I'm on board with this proposal.
Zookeeper isn't magic, it's just pretty good at most of what it does, and I think that projects that understand when they've pushed zookeeper into a bad corner may benefit from this kind of move, if they also have a good idea of how they can do better.
https://issues.apache.org/jira/browse/ZOOKEEPER-2164
https://issues.apache.org/jira/browse/ZOOKEEPER-2791
Requests (can't find them in JIRA at the moment, so I need to paraphrase) in the past to have a call to initiate a controlled leadership move to another node have been turned down as "you don't need this" yet leadership election fails in some circumstances! In addition there's no command or configuration to disable FastLeaderElection.
So the zookeeper maintainers keep operators limited to having to flip nodes off and on again, which is really a bad way to manage software because it impacts clients as well as leadership (and even if clients recover, most code that I've seen like to make some noise when zk connections flap). I would really like to eliminate all use cases for zookeeper where there is a chance that the zxid will exceed the size of its 32-bit counter component in the span of, say, a decade so that as an operator I don't have to set alerts on the zxid counter creeping up, and having to reset zookeeper and restart all of its clients (many versions of many zookeeper clients don't retry after connection loss, don't retry after a timeout, don't cope with the primary connection failing, will have totally given up after 15 minutes, etc.).
I think that the kafka maintainers have been doing a better job of actively maintaining their code and ensuring it works in adverse conditions, so I'm on board with this proposal.
Zookeeper isn't magic, it's just pretty good at most of what it does, and I think that projects that understand when they've pushed zookeeper into a bad corner may benefit from this kind of move, if they also have a good idea of how they can do better.