Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for sharing!

If anyone has any questions, I'll do my best to get them answered.

(Disclaimer: I work at ClickHouse)




Thanks for this excellent article! Enjoyed it from start to finish. This gave me a good memory of the work we've done at docker embedding our own replicated and consistent metadata storage using etcd's raft library.

Looking at the initial pull request, is it correct that ClickHouse Keeper is based on Ebay's NuRaft library? Or did the Clickhouse team fork and modified this library to accommodate for ClickHouse usage and performance needs?


Yes, you are right ClickHouse Keeper is based on NuRaft. We did a lot of modifications for this library, both for correctness and performance. Almost all of them (need to check) are contributed back to upstream ebay/NuRaft library.


1. can this be used without clickhouse as just a zookeeper replacement? 2. am i correct in that its using s3 as disk? so can it be run as stateless pods in k8s? 3. if it uses s3, how are latency and costs of PUTs affected? does every write result in a PUT call to s3?


1. Yes, it can be used with other applications as a ZooKeeper replacement, unless some unusual ZooKeper features are used (there is no Kerberos integration in Keeper, and it does not support the TTL of persistent nodes) or the application tests for a specific ZooKeeper version.

2. It could be configure to store - snapshots; - RAFT logs other than the latest log; in S3. It cannot use a stateless Kubernetes pod - the latest log has to be located on the filesystem.

Although I see you can make a multi-region setup with multiple independent Kubernetes clusters and store logs in tmpfs (which is not 100% wrong from a theoretical standpoint), it is too risky to be practical.

3. Only the snapshots and the previous logs could be on S3, so the PUT requests are done only on log rotation.


2. ok. so can i rebuild a cluster with just state in s3? eg: i create a cluster with local disks and s3 backing. entire cluster gets deleted. if i recreate cluster and point to same s3 bucket, will it restore its state?


It depends on how the entire cluster gets deleted.

If one out of three nodes disappears, but two out of three nodes are shut down properly and written the latest snapshot to S3, it will restore correctly.

If two out of three nodes disappeared, but one out of three nodes is shut down properly and written the latest snapshot to S3, and you restore from its snapshot - it is equivalent to split-brain, and you could lose some of the transactions, that were acknowledged on the other two nodes.

If all three nodes suddenly disappear, and you restore from some previous snapshot on S3, you will lose the transactions acknowledged after the time of this snapshot - this is equivalent to restoring from a backup.

TLDR - Keeper writes the latest log on the filesystem. It does not continuously write data to S3 (it could be tempting, but if we do, it will give the latency around 100..500 ms, even in the same region, which is comparable to the latency between the most distant AWS regions), and it still requires a quorum, and the support of S3 gives no magic.

The primary motivation for such feature was to reduce the space needed on SSD/EBS disk.


Sometime back, I tried using clickhouse-keeper as zookeeper alternative with few other systems like kafka, mesos, solr, Wrote some notes here: https://pradeepchhetri.xyz/clickhousekeeper/


1. Absolutely. clickhouse-keeper is distributed as a standalone static binary or .deb package or .rpm package. You can use it without clickhouse as ZooKeeper replacement. 2. It's not recommended to use slow storage devices for logs in any coordination system (zookeeper, clickhouse-keeper, etcd and so on). Good setup will be small fast SSD/EBS disk for fresh logs and old logs + snapshots offloaded to S3. In such setup the amount of PUT requests will be tiny and latency will be as good as possible.


Is there a python client library you can recommend?


All ZooKeeper libraries are compatible with clickhouse-keeper. The most popular and mature is https://kazoo.readthedocs.io/en/latest/. We use it in our integration tests framework (with clickhouse-keeper) a lot.


The same library that you use for ZooKeeper - kazoo.

Note: our stress tests have found a segmentation fault in Python's kazoo library.

We only wanted to test Keeper, but found every bug around it :) Let me find a link.



Did not expect to see issue I created


What do you use for network stuff in C++, ASIO?


Yes, for internal RAFT implementation boost.asio is used.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: