KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

mayank · on Aug 2, 2019

Note: this is not about replacing ZooKeeper in general with Kafka as the title might suggest, it's about Kafka considering an alternative to its internal use of ZooKeeper:

"Currently, Kafka uses ZooKeeper to store its metadata about partitions and brokers, and to elect a broker to be the Kafka Controller. We would like to remove this dependency on ZooKeeper. This will enable us to manage metadata in a more scalable and robust way, enabling support for more partitions. It will also simplify the deployment and configuration of Kafka"

sonabinu · on Aug 2, 2019

The current heading is ambiguous as well.

jimktrains2 · on Aug 2, 2019

Honestly, the new one is even less informative than the original.

tus88 · on Aug 3, 2019

Well if the original gave the impression that kafka and zk were the same kind of software it must have been pretty bad.

kevsim · on Aug 2, 2019

In the days before hosted Kafka implementations were readily available, I tried to set up Kafka in our AWS infrastructure. Getting Zookeeper working in a world based on autoscaling groups was a nightmare. It felt like it was built for the days where each server was a special snowflake.

Looking forward to seeing if this gains traction.

necubi · on Aug 2, 2019

Using ASGs with zookeeper is kind of a pain, but also not very necessary. Production zookeeper clusters are almost always either 3, 5, or 7 nodes and it's very rare to change that once a cluster is in production. Given that this is all very static, the easiest thing is to just use terraform or similar to create the ec2 instances. If a node dies, you re-run your terraform to create a replacement.

Alternatively, you can create an ASG-per-node so that you get auto-replacement.

In my experience, ZK is one of the easiest and most reliable distributed systems to operate. I've only seen issues when it's used as a database instead of a distributed coordination service.

coredog64 · on Aug 2, 2019

We ran into additional troubles based on ZK client bugs in the version that Kafka used. It would only ever resolve the ZK hostname once, at startup. Our workaround required some additional work to allocate an EIP for each host (3 AZs -> 3 EIPs) and then have the ZK host grab the EIP on startup.

Even though Netflix hasn't updated it in a while, Exhibitor was helpful as well, in that it allowed ZK to bootstrap nodes off of state stored in S3. That did come at the cost of an extra 2-3 minutes per node on initial quorum startup.

uberduper · on Aug 3, 2019

Might just need to set networkaddress.cache.ttl=60 in java.security.

rad_gruchalski · on Aug 2, 2019

This is just an FYI type of comment. with zookeeper, one needs an odd number of machines, usually not greater than seven. Servers are indexed from 1 to n. They’re replaced one for one, so the server id matches. Scaling it up and down is a bit of a pita because every server needs to know all other servers. Changing the ensemble members requires restarting every zookeeper instance because zookeeper doesn’t support config reload.

Zookeeper does expect a single “host” for every server but it runs perfectly fine with docker. But that’s no different from consul or etcd.

hinkley · on Aug 2, 2019

There are some other consensus domains that have trouble when members disappear or arrive permanently, and this has kinda been kicking around my head for a while:

Should it not be the case that one of the responsibilities of the quorum is to vote new members in or out? I mean, as a first-class feature of the system.

The main problem I see is that in most consensus systems, any member can be nominated as leader. But if the member was inducted during a partition event - which one could do if the partition were long duration - then nominations for new leaders will go out. What happens if the new member gets elected? How do the partitioned machines find that leader when they return?

So the process of joining the quorum would have to be incremental. Because demanding a unanimous vote to add a machine means you can never replace a dead one (except by impersonating it)

benesch · on Aug 3, 2019

> Should it not be the case that one of the responsibilities of the quorum is to vote new members in or out? I mean, as a first-class feature of the system.

Indeed. What you're describing is one of the main motivations for Raft. Paxos showed that distributed consensus was mathematically sound, but did little to guide implementors in actually building such a system. Raft is not a fundamentally new consensus algorithm; just an incremental improvement that formalizes many of the improvements that you needed to make to Paxos anyway, like membership changes, log compaction, and multi-decree support from the get go.

If you're interested, the Raft paper is quite readable and goes into this in detail. [0]

> The main problem I see is that in most consensus systems, any member can be nominated as leader. But if the member was inducted during a partition event - which one could do if the partition were long duration - then nominations for new leaders will go out. What happens if the new member gets elected? How do the partitioned machines find that leader when they return? How do the partitioned machines find that leader when they return?

This isn't actually the tricky bit, as it turns out. Communication flows from the leader to the other nodes, so the leader, even if it's a newly-inducted node, will initiate the connections to the partitioned nodes when the partition resolves. (The leader necessarily knows the addresses of the partitioned nodes, because the leader knows about all committed entries, and the identities of all the nodes in the cluster are committed into the Raft log.)

Again, the Raft paper does a great job explaining cluster mebership changes—much better than I can!

[0]: https://raft.github.io/raft.pdf

hinkley · on Aug 4, 2019

I've seen quite a few visualizations, a few descriptions and a number of conversations about Raft and somehow adjusting the membership automatically never came up.

Goes to show you should always go back to the source at some point, even if the third party descriptions have better facility.

vetrom · on Aug 2, 2019

Zookeeper has implemented dynamic reconfiguration since circa 2013-2015ish, but it required support for the feature in clients, as well as servers. That said, implementing an autoscaler that manages that as well was not trivial, in my experience.

lars_francke · on Aug 2, 2019

This is only half true. Yes, the dynamic reconfiguration features has been in trunk for a while but ZooKeeper 3.5 which is the first GA release to include this has only been released 4-8 weeks ago.

thomaslee · on Aug 2, 2019

Another FYI type comment I guess. :) Some of my more gripey statements here may be outdated info, so DYOR I guess.

Dynamic reconfig in 3.5 addresses the "restarting every zookeeper instance" problem. [0] You stand up an initial quorum with seed config, then tie in new servers with "reconfig -add". Not sure how well it would tie into cloudy autoscaling stuff though. I wouldn't start there myself.

A much bigger pain IMO is the handling of DNS in the official Java ZK client earlier than 3.4.13/3.5.5 (and by association, Curator, ZkClient, etc.). [1] The former was released mid 2018 and the latter this year, so tons of stuff out there that just won't find a host if IPs change. If you "own" all the clients it's maybe not a problem, but if you've got a lot of services owned by a ton of teams it's ... challenging.

Even with the fix for ZOOKEEPER-2184 in place I'm pretty sure DNS lookups are only retried if a connect fails, so there's still the issue of IPs "swapping" unexpectedly at the wrong time in cloud environments which can lead to a ZK server in cluster A talking to a ZK server in cluster B (or worse: clients of cluster A talking to cluster B mistakenly thinking that they're talking to cluster A). I'm sure this problem's not unique to ZK though.

Authentication helps prevent the worst-case scenarios, but I'm not sure if it helps from an uptime perspective.

TL;DR: ZK in the cloud can get messy (even if you play it relatively "safe").

[0] https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.ht... [1] https://issues.apache.org/jira/browse/ZOOKEEPER-2184

rad_gruchalski · on Aug 2, 2019

Good to know! Thank you! Haven’t dealt with zk in detail since ~2016.

fizx · on Aug 3, 2019

ASGs + Zookeeper is really easy, but you have to use https://github.com/soabase/exhibitor. It coordinates replacing downed ZK nodes using S3, and also provides a bunch of extra management tools.

I had a single Zookeeper ASG running under load for over three years without maintenance. I pinged my old co-founder to see if he's willing to open source the CloudFormation template.

WaxProlix · on Aug 2, 2019

Relieving to hear this as I've had almost exactly the same experience and it made me feel like an actual idiot.

RantyDave · on Aug 3, 2019

Yes, it was. More specifically Zookeeper is designed as a piece of infrastructure you put together then rely on.

mykowebhn · on Aug 2, 2019

Isn't managing consensus extremely hard to do? Wouldn't one want to rely on a proven solution rather than spinning up a new solution?

binarymax · on Aug 2, 2019

My thoughts exactly. I don't work with Kafka, but I do heavy work with Solr that also has ZK as a dependency.

To end users of whatever platform it is keeping together, Zookeeper is often seen as an unwanted dependency. The unwanted sentiment arises because maintaining zookeeper is not a gimme, and it takes additional knowledge and overhead to maintain. You need a quorum to be kept alive with an odd number (greater than one) of instances. So the question often arises, "how can we get rid of zookeeper?".

lukaszkorecki · on Aug 2, 2019

The document mentions using Raft for consensus and coordination - it's the same approach used by Etcd, Consul, Serf, RethinkDB and other systems. AFAIK it's easier to implement and understand than Zookeeper's consensus solution.

https://raft.github.io

ypcx · on Aug 3, 2019

Kafka was thinking about etcd:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-273+-+...

Considering most Kafkas will probably run in Kubernetes at some point, they could have shared the etcd used by Kubernetes.

edude03 · on Aug 3, 2019

There is also Zetcd[0] if you want to use etcd and kafka together today. We use it right now and it seems to work

[0] https://coreos.com/blog/introducing-zetcd

lukaszkorecki · on Aug 3, 2019

Sounds like a bad idea, no? Last ting you want is whole cluster going down because Kafka is misbehaving

ypcx · on Aug 3, 2019

I don't know. First, there may be some separation achievable - but I'm just guessing. Second, if Kafka is your primary workload, you don't want it to misbehave in any case. But of course I get your point. I'm just saying I've seen people thinking about it that way in some Github Issues.

summarity · on Aug 2, 2019

Not to mention that ZooKeeper passed Jepsen (on the first try IIRC). Don't replace something so extraordinarily dependable just because it's a dependency.

rad_gruchalski · on Aug 2, 2019

Zookeeper can also be embedded.

justnoise · on Aug 3, 2019

As can etcd.

Edit: Oh right, the fact that etcd is golang might make that an issue for Kafka...

helper · on Aug 2, 2019

As an operator, having fewer dependencies I have to deploy and worry about is a boon to my productivity.

If you were building a kafka like system in house for private use only, then yes I agree with your sentiment. If you are building something to be used by hundreds or thousands of organizations then the cost benefit tradeoffs shift to where it probably makes sense to pull the consensus logic into the primary application itself.

agallego · on Aug 3, 2019

Note that you still have the same number of fault domains if you read the actual proposal.

It does not simplify the system. It is simply replacing the ZK ensemble w/ its own raft impl.

Same number of JVMs, same operational complexity.

helper · on Aug 3, 2019

Thanks for pointing that out. Thats what I get for not actually reading the proposal.

cafxx · on Aug 2, 2019

As an operator, shouldn't you worry if something battle-tested, that implements extremely critical functions that are tricky to implement correctly, is replaced with something that is unproven?

As someone who enjoys sleeping at night, I wouldn't poke something like that with a stick before it has matured for a few years, and before many brave (?) operators have smoothed out most rough edges by landing on them repeatedly with their faces.

helper · on Aug 3, 2019

Yeah, don't deploy it when its in beta or its first few releases. At some point it will be stable and safe at which point everyone from then on will benefit.

mangatmodi · on Aug 2, 2019

Finally somebody said it. Yes it is hard. Zookeeper is time tested and I don't think there is anything revolutionary to keep state less than 2n nodes. Either Kafka will compromise on consistency or give a solution which is as performant as Zookeeper. May be they are looking for something which is more specific to Kafka and can be done in a much easier algorithm than Zab?

czinck · on Aug 2, 2019

Reading the proposal, it seems like they want to replace it because the ZooKeeper API is causing some issues for them, so even if ZooKeeper has perfect consensus, Kafka itself won't necessarily. From the "Metadata as an Event Log" section

> although ZooKeeper is the store of record, the state in ZooKeeper often doesn't match the state that is held in memory in the controller

Maybe it would be better to put the effort in to changing ZooKeeper instead of writing their own consensus system in Kafka, but I would assume they know that tradeoff better than us since they work so closely with ZooKeeper.

rad_gruchalski · on Aug 2, 2019

I do find that statement really confusing. That implies they could have errors in the code handling the consensus? Hard to believe but otoh, they’re not really saying anything about zookeeper causing problems.

Maybe the correct solution is to understand why the states don’t match in the first place?

atombender · on Aug 2, 2019

No, contrary to the sibling comments I would disagree that consensus is difficult. The theory can be somewhat daunting, and ZooKeeper's consensus protocol is complex. The most advanced consensus protocol today, Paxos (and its variations), is notoriously hard to understand.

But then we got the Raft paper, which has single-handedly essentially managed to commoditize distributed consensus. Raft is an elegant and simple protocol that has been implemented in many languages. To implement primitives like leader election and consistent, distributed reads/writes, you can grab an off-the-shelf library like Hashicorp's Raft library, which does all the heavy lifting as long as you implement the state storage yourself. It's absurdly trivial to turn a single-node app into a replicated, fully redundant one.

Of course, a project like Kafka might have particular requirements (I wouldn't know) that would mean Raft would not be a suitable solution for them, or would require modifications to the protocol. CockroachDB, for example, had to modify Raft ("MultiRaft") in order to achieve the scalability they needed within Raft's paradigm. Raft is a starting point more than a finished solution.

necubi · on Aug 2, 2019

I'd read through some of Aphyr's jepsen tests [0] if you think implementing consensus algorithms correctly is easy. Nearly every distributed database he's looked at has consistency bugs (with the notable exception of zookeeper).

[0] https://jepsen.io/analyses

atombender · on Aug 2, 2019

I've read every single one of Aphyr's tests. Aphyr's analyses probe a broader range of distributed consistency problems in replicated stateful databases. Consensus is a necessary mechanism to achieve distributed consistency, but if you look at the problems highlighted in products like CockroachDB and TiDB, they relate to secondary effects of failures to ensure linearizability of reads and writes.

You can have a perfectly functioning consensus algorithm and still battle inconsistent reads across nodes, loss of data on node failure or network splits, write skew, and so on. For example, TiKV uses Raft, yet the problems uncovered in TiKV have nothing to do with Raft itself. Conversely, those found in Elasticsearch are absolutely caused by its consensus algorithm, Zen, which was invented in ignorance of modern consensus research (Paxos, etc.).

If you read my comment carefully, you'll note that it's above all an endorsement of Raft, which is a milestone in that it democratizes advanced consensus and makes it a tool anyone can use. My point isn't that anyone can trivially whip up an infinitely scalable, consistent database because consensus is a solved problem, but that we have solid foundational theory to build on, to the point where it is a well-understood problem (albeit with many complicated implementations) and not black magic.

Also, I did not use the term "easy". But I also think the OP's phrasing "extremely hard" is not true, unless your requirements are extremely hard – which, again, may be the case with Kafka; scalability tends to be diametrically opposed to consistency.

hinkley · on Aug 2, 2019

Raft is not going to be the last consensus protocol.

If that assertion is true, then the replacement will exist because something in Raft is difficult or impossible to do. In which case consensus is hard, for some quantity of hard.

The one that's most obvious to me is that Raft violates one of the 8 Fallacies of Network Computing: the network is homogeneous.

The star topology of messages means that some messages - identical messages with different destinations - will be competing for bandwidth. It also means electing the slowest machine on the longest network link is probably a bad idea. I've certainly heard of this sort of problem happening.

james-mcelwain · on Aug 2, 2019

There are a number of Java libraries that implement Raft already -- it's a fairly well understood algorithm.

rad_gruchalski · on Aug 2, 2019

TIL: there is an Apache Incubator raft project: http://ratis.incubator.apache.org/.

RantyDave · on Aug 3, 2019

Exactly, and tbh this kind of thing pisses me off. The "Kafka people" seem to imagine that what Zookeeper does is dead simple and they can bash together their own in a weekend and it'll be better.

Zookeeper is great. Weird, for sure, and a bit of a pain in the arse. But it will neither lie to you nor lose data and this is much, much harder to achieve than you might think.

chrsig · on Aug 2, 2019

Yes, on both counts. However, requiring zookeeper as a dependency is also a barrier to adoption and increases the operational surface area, so offloading consensus management to zookeeper is not without trade offs.

leetbulb · on Aug 2, 2019

There are other proven solutions that can be embedded such as Raft.

empath75 · on Aug 2, 2019

Well, there's etcd, consul, etc..

edit: i guess i should RTFA -- looks like they are going to build raft into kafka -- at this point, I'd think raft libraries are fairly mature.

noahdesu · on Aug 3, 2019

This sounds really similar to some of the work coming out of Vectorized on Redpanda [0]. They're building a Kafka API compatible system in C++ that's apparently achieving significant performance gains (throughput, tail latency) while maintaining operational simplicity.

[0]: https://vectorized.io/redpanda

shaklee3 · on Aug 3, 2019

Where do you see performance comparisons to Kafka? As far as I can tell, the product you linked doesn't exist outside of a blog, so this reads like a marketing post.

linuxhansl · on Aug 2, 2019

I don't quite understand why everybody and their mother are trying to remove Zookeeper from their setup.

In my past I've seen this many times and each time people went back to Zookeeper after a while, because - as it turns out - consensus is hard; and Zookeeper is battle hardened.

atombender · on Aug 3, 2019

First, because it's yet another dependency. Consensus-based systems like CockroachDB, Dgraph, Cassandra, Riak, Elasticsearch, ActorDB, Rqlite, Aerospike, YugaByte, etc. are wonderfully easy to deploy because they can run with no external dependencies. (Their consensus protocols may have different problems, but that's beside the point.)

Secondly, it's a pretty heavy dependency. The JVM is RAM-hungry and it's difficult to ensure that it always has enough RAM. Running multiple JVM apps on a single node must be done carefully to make sure each app has enough headroom. It consumes considerably more RAM than Etcd and Consul.

Thirdly, I think it's fair to say that ZK is showing its age. It's notorious for being hard to manage (see the other comments in this thread), with a fairly old design (based on the now-ancient Google Chubby paper) that, while resilient, is also less flexible than some other competing designs.

simtel20 · on Aug 3, 2019

Is no-one else running into FastLeaderElectionFailed? When you have a system that writes a lot of offset/transaction info to zookeeper you can push the zxid 32-bit counter to rollover in a matter of days. When this happens it can bring zookeeper to a grinding halt for 15 minutes after 2 nodes try to nominate themselves for leadership and the rest of the cluster sits back and waits for a timeout.

https://issues.apache.org/jira/browse/ZOOKEEPER-2164

https://issues.apache.org/jira/browse/ZOOKEEPER-2791

Requests (can't find them in JIRA at the moment, so I need to paraphrase) in the past to have a call to initiate a controlled leadership move to another node have been turned down as "you don't need this" yet leadership election fails in some circumstances! In addition there's no command or configuration to disable FastLeaderElection.

So the zookeeper maintainers keep operators limited to having to flip nodes off and on again, which is really a bad way to manage software because it impacts clients as well as leadership (and even if clients recover, most code that I've seen like to make some noise when zk connections flap). I would really like to eliminate all use cases for zookeeper where there is a chance that the zxid will exceed the size of its 32-bit counter component in the span of, say, a decade so that as an operator I don't have to set alerts on the zxid counter creeping up, and having to reset zookeeper and restart all of its clients (many versions of many zookeeper clients don't retry after connection loss, don't retry after a timeout, don't cope with the primary connection failing, will have totally given up after 15 minutes, etc.).

I think that the kafka maintainers have been doing a better job of actively maintaining their code and ensuring it works in adverse conditions, so I'm on board with this proposal.

Zookeeper isn't magic, it's just pretty good at most of what it does, and I think that projects that understand when they've pushed zookeeper into a bad corner may benefit from this kind of move, if they also have a good idea of how they can do better.

james-mcelwain · on Aug 2, 2019

There's a toy Kafka implementation written in Go that attempts to do this: https://github.com/travisjeffery/jocko

Previous HN discussion: https://news.ycombinator.com/item?id=13449728

rad_gruchalski · on Aug 2, 2019

The author now works for Confluent.

MrBuddyCasino · on Aug 3, 2019

To all the people wondering why "replace battle tested ZK" and how "consensus is hard": its right there under the motivation header:

> enabling support for more partitions

I don't know if anyone of you ever ran a high-throughput Kafka cluster with a large number of partitions (as in, thousands of them), but its not pretty. Rebalancing can easily take half an hour after a rollout, and throughput is degraded during that time. We recently had to move to shared topics because it became untenable.

This is a very welcome change!

camiloaguilar · on Aug 3, 2019

I’m not too convinced of the approach. I’ve been anxiously waiting for https://vectorized.io/ to release their message queue. It is built in modern C++, uses ScyllaDB Seastar framework to do IO scheduling in userspace, with better mechanical sympathy. And like Hashicorp’s Nomad and Vault, which I’m a fan of, it has built-in distributed consensus and easy operation.

rhacker · on Aug 3, 2019

It would be nice if all the cloud vendors agreed on a key/value and/or consensus protocol that all servers in a cluster can connect to - and maybe even supported via docker, natively even if there's just one cluster member. Like plug-n-play for clustering tech. (Bonjour basically but suitable for cloud/enterprise software)

ccleve · on Aug 2, 2019

I have written a Raft implementation in Java. If anyone from the Kafka project wants it, please contact me. It's not open source, but I own it and could make it so.

lars_francke · on Aug 2, 2019

Not to take away from what you did but the Apache Foundation actually has Ratis which is a Raft implementation: http://ratis.incubator.apache.org/

PunksATawnyFill · on Aug 3, 2019

Whoever wrote this doesn't know WTF a quorum is.

thilog · on Aug 3, 2019

Care to elaborate?

superapc · on Aug 3, 2019

we (alluxio.io) have gone through a similar process by replacing Zookeeper with CopyCat (a raft implementation) for both leader election and storing shared journal since Alluxio 2.0 . Works pretty well