Bizur: A Key-Value Consensus Algorithm for Scalable File-Systems

rystsov · on March 9, 2017

> Paxos-like algorithms which are used by existing distributed file systems, can have artificial contention points due to their dependence on a distributed log.

It's half true. Paxos is an ambiguous term. It includes Multi-Paxos and Single-Decree Paxos (aka Synod). The first one is a log based algorithm, while the second is a just a register. So your critic is valid only for the first part of the Paxos family.

Actually, Bizur is very similar (almost indistinguishable) to Single Decree Paxos - just compare it with the algorithm described in the video [1] by Heidi Howard or in my post "In search of a simple consensus algorithm" [2]

[1] https://www.youtube.com/watch?v=s8JqcZtvnsM

[2] http://rystsov.info/2017/02/15/simple-consensus.html

ezrahoch · on March 9, 2017

Single decree is a write-once algorithm. You can't use the same instance again, for another update of the associated value.

So just replacing bizur' should "register" implementation with a single decree paxos won't be sufficient. You'll need to do something more elaborate (creating more instances as you go along).

This kind of instance management isn't required in bizur.

I'm not sure what's the overhead off all that, w.r.t. memory foot print. Since this overhead is paid for each bucket, it might accumulate to quite a bit.

rystsov · on March 9, 2017

Not really :) Please follow the links. "Paxos Made Simple" describes the write once variant, but it's possible to choose a new value for the next ballot cycles to make it rewritable.

Proof of the algorithm: http://rystsov.info/2015/09/16/how-paxos-works.html

Moreover, I built a system on top of it and extensively tested it with fault injections: https://github.com/gryadka/js.

It seems that GoshawkDB[1] uses the similar idea, Treode[2] used it too.

[1] https://goshawkdb.io/

[2] http://treode.github.io/design/

ezrahoch · on March 9, 2017

Very nice work! Indeed we've done similar stuff :) I haven't read it all yet, I'll comment back once I do.

skyde · on March 9, 2017

Could you explain how Bizure compare to Gryadka. Are the performance tradeoff similar.

rystsov · on March 9, 2017

> We avoided reads (get operations) since the current version of etcd performs reads directly from the leader, without contacting the cluster, which doesn’t preserve the same consistency level as Bizur and ZooKeeper do.

It's wrong. Etcd provides "?quorum=true" option which makes reads go through the Raft consensus algorithm archiving linearizability see https://github.com/coreos/etcd/issues/741

ideal0227 · on March 9, 2017

etcd v3 read is linearizable by default.

Also I believe the authors of the paper tested against the old etcd v2 API/backend although they use etcd3.

The new API/backend has significant high performance.

I work on etcd.

mhf8o9a · on March 9, 2017

Bizur team member here.

For etcd benchmarking we used the 'benchmark' tool from etcd's source (commit d62ce55, ~3.1.0-rc1). AFAIK that tool does use the v3 api.

The paper was completed before etcd 3.1 was released. etcd's docs stated (at that time) that the client is not linearizable. We only tested writes.

As for performance, it's hard to say if our results pair up with the official etcd results (we were not aware of them - were they published at the time?). The hardware & software setup used for benchmarking is different.

rystsov · on March 9, 2017

I mean that without the ?quorum flag the stale reads are possible. I tested the etcd v3 (http api) and the reads were incredibly fast but when I set the flag, the read's latency became the same as write.

ideal0227 · on March 9, 2017

v3 does not have a native HTTP API. I suspect you still use v2 API. You need the gateway to make v3 works with HTTP: https://github.com/coreos/etcd/blob/master/Documentation/dev...

But that might impact performance quite bit depending on your workload.

rystsov · on March 9, 2017

I got it, I was using Etcd v3.1.0 but with v2 API.

sriram_malhar · on March 17, 2017

Unless I have missed something, I don't believe Bizur is "linearizable at the bucket level" (as the paper claims).

Suppose a server X was the leader, and subsequently a new leader was elected without X's knowledge. X does not learn of the new election unless it attempts a write. Until then, the read side may not detect the situation either since EnsureRecovery may return true (the bucket being read has the same elect_id as X.elect_id. Thus reads may return stale value. What we have here is "sequential consistency at the bucket level". which technically is cache consistency.

Separately, because the consistency claims are at a key/bucket level, and since there is no claim to global ordering (between different keys), a comparison to Zookeeper is not quite fair. Bizur cannot be used as a coordination service because it isn't sequentially consistent overall.

tzury · on March 9, 2017

Bizur in Hebrew mean distributing. I see all four authors[1] are Israelites, then, I guess, this is where the name comes from

[1]: Ezra N. Hoch, Yaniv Ben-Yehuda, Noam Lewis, Avi Vigder

rystsov · on March 9, 2017

What do you call by "queue depth". Is it the number of concurrent clients?

If it's true then it looks very suspicious that avg. latency decreases as the number of concurrent clients goes up. How do you measure avg. latency?

Let me illustrate what the avg. latency usually means.

Imagine that we have two concurrent users: Alice and Bob. During the testing lasting one second, Alice managed to send 3 requests per second, the first request took 0.5 seconds while the rest were executed in a quarter of a second each. In the same time, Bob did only 2 requests, each of them took 0.5 seconds.

So the average latency is the sum of execution time taken by each request divided by the number of all requests, in our case it's (0.5+0.25+0.25+0.5+0.5) / 5 = 0.4

Did you use the same definition?

mhf8o9a · on March 9, 2017

Latency does increase with concurrency. Latency decreases with the number of keys, as long as they are less than the concurrency. If the number of keys is more than the concurrency, there is no additional benefit and latency remains flat.

rystsov · on March 9, 2017

Thanks, I didn't notice that the clients operate over the same key set, so I didn't think about the contention. It makes sense.

unboxed_type · on March 9, 2017

I am no academic, but I read a lot of papers and going to write one this year. I skimmed the paper without getting into details about the protocol itself, but made some observations about the paper in general. My personal very subjective opinion on how the paper might be reinforced is presented below.

1. Add a system model description under which your protocol is expected to operate.

2. Add description of fault-tolerant properties of the protocol.

3. Do you plan to prove correctness of the algorithm more formally? Maybe you could model the algorithm before publishing. If you prove a safety property using model checking it could considerably reduce doubt about an algorithm correctness.

mhf8o9a · on March 9, 2017

Thanks for the feedback.

As for #3, unfortunately we don't currently have plans to publish a formal proof (we're building a product). Some of our internal documentation is more formal, but we decided on a more readable presentation for this paper.

Taek · on March 9, 2017

You will probably find it valuable to write that proof. Sometimes an edge case pops up or you end up proving that it doesn't work as expected. I've done this a few times in my work.

Unless you already have it but don't plan to publish, which seems a little odd to me. I get that you want readable documentation, but at least you could link to it somewhere for people who go digging.

unboxed_type · on March 9, 2017

Well personally I would rather suggest to write a model, not a rigorous proof. A formal proof of any real protocol is both long and tedious journey; production guys will hardly find it useful thing to do.

On the other hand a logical model can be encoded in a rather short amount of time by someone who is doing it periodically.

This hypothetical guy could benefit from this activity by co-authoring the paper. Original authors would benefit from it by reinforcing both the protocol and the paper.

daviesliu · on March 9, 2017

After reading the paper, I felt disappointed that: 1) Bizur is used for building a scalable object storage rather than a filesystem, which does not even support atomic rename for single key (the hash of key will change, it will be moved to another bucket without any transactional guarantee) 2) Bizur itself looks like a sharded Raft group, which is already used in TiDB (one raft group per region).

Please correct me if read it wrongly.

ideal0227 · on March 9, 2017

This (like a sharded Raft group) is somewhat correct :P.

If you can sacrifice consistency (causal instead of external), you can get EPaxos. If you can sacrifice more, you can get Single Paxos. Both of them can increase concurrency to some degree, and can improve availability over a leader based RSM system like Muti-paxos or Raft.

I believe there are other ways to make the trade-off, even making the trade-off dynamically. Hopefully, I can talk about it more soon.

I work on etcd/raft, which is a shared Raft implementation by TiDB(TiKV), Cockroachdb, etcd and many more.

rystsov · on March 9, 2017

> If you can sacrifice more, you can get Single Paxos

What do you mean? Single Decree Paxos is linearizable so it's consistent. When the multi-key updates are necessary - they can be implemented with client-side transactions such as 2PC, RAMP or Percolator.

Even with Raft we in the same situation. Once the data overgrows the size of one machine we need to shard them. As a result some keys can live on different shards and we need cross shards transactions to do multi-key updates :)

ideal0227 · on March 9, 2017

Well... Why do you need 2PC if nothing is sacrificed? :)

If you choose things like Raft (or other RSM), you can get a consistent key space to do txn over keys.

For a metadata store like etcd, one RSM group is OK. Performance/availability is better than something requires 2PC.

For database systems, there are quite a few txns that can actually fall into one shard and can bypass 2PC.

Of course, it is highly depend on the workload and use cases.

rystsov · on March 9, 2017

The paper is quite important - it's the first paper about the production usage of an algorithm similar to Single-Decree Paxos.

1) The atomicity usually isn't a problem, it can be added on top of the sharded storage with 2PC, RAMP or Percolator transactions, or with Sagas.

2) With Raft you have to implement log compaction and snapshotting, with Bizur and Single-Decree Paxos it isn't the case

ezrahoch · on March 9, 2017

One of the differences between a raft group and a bizur bucket is the granularity and the overhead of that granularity.

Bizur buckets can be very small (10s of kb), and the overhead for them is minimal (10s of bytes).

What would be the overhead of having a raft group for every 32KB? Would it be practical / a good use of the machine's resources?

skyde · on March 9, 2017

it seem that the performance gain of not having all update go to a single log could be also achieved by having one log per bucket. My understanding is that with Bizur all update for the same bucket also need to proceed sequentially.

Also with most system like Raft a failed machine trying to rejoin the cluster does not need to download the complete transaction log but can instead download a snapshot of the state machines then load log entry created after this snapshot.

Another aspect is that if the bucket contain a large number of key. From the Bizure paper it seem that the complete set of "key->value" need to be sent to replica each time the bucket is updated instead of only the entry that changed. This would produce much larger network traffic compared to Raft or Zab when all replica are mostly in sync

ezrahoch · on March 12, 2017

1) Keeping a log mechanism per bucket would be expensive, since it is a heavy mechanism. It will cost memory & network traffic. The additional information needed by Bizur for a bucket is minimal, basically, just a version.

2) In Bizur, rejoining the cluster includes the equivalent of copying the snapshot. However, that node can start participating in the Bizur even before it finished reading the entire snapshot, because of the key-value data model. Once a bucket has been copied, it can start serving it, which is much faster than log-based algorithms.

3) You are correct about the size of a bucket. In our implementation we aim to have up to 8-10 keys in a bucket, to avoid the overhead you mentioned. That's why it's especially important to be very efficient pet-bucket (see point #1).

skyde · on March 15, 2017

Thanks a lot for clarification.

xanton · on March 9, 2017

Am I right that in a system implementing algorithm explained in the article there is no way to execute operations on two key-value pairs stored in different buckets atomically? For example, two CAS operations that will either both succeed or both fail?

solidsnack9000 · on March 9, 2017

Their application is file systems and no file systems offer that capability. In order to provide transactionality for multi-file operations, you have to have a file that represents "the transaction" and move it or create/delete it.

In most file systems, you can't even update a file atomically, actually -- what you can update atomically is a disk page, 4-16k. Database servers might then maintain a separate, small file that contains only a transaction token, which is atomically updated to contain the last transaction written to disk in full.

xanton · on March 9, 2017

I don't judge them, I was confused by the choice of systems to compare with and thought that I didn't understand article. I think that comparison with MongoDB or RethinkDB would be more appropriate in this case.

mhf8o9a · on March 9, 2017

Keep in mind that the Bizur is being used for services internal to the file system. So the use case is more like etcd or ZooKeeper's, for example in one role it is a cluster configuration database (with high availablity, fault tolerance, consistency, etc.)

blopeur · on March 9, 2017

Yes, They guarantee consensus per key ( which is in this case per bucket as they aggregate them). You won't be able to have atomic operation across bucket neither you can have serialisation of transactions across bucket. It's a trade-off, my guess is that their use case does not require a consistent view of the overall k/v store.

ezrahoch · on March 9, 2017

Indeed, it is a tradeoff, and our solution doesn't require consistency across multiple keys.

More generally, any solution that doesn't require transactions across multiple keys, can use Bizur instead of classical Paxos, and achieve the same benefits (faster recovery from failures, higher concurrency, etc.)

ccleve · on March 9, 2017

Have you published the software anywhere? Do you have an open source implementation? (The Elastifile Github account doesn't appear to have one.)

ezrahoch · on March 12, 2017

It is not an open-source project, but rather a proprietary implementation.