Hacker News new | past | comments | ask | show | jobs | submit login
Jepsen: Etcd 3.4.3 (jepsen.io)
219 points by aphyr on Jan 30, 2020 | hide | past | favorite | 55 comments



Congrats on the community blog launch! <3

>We would love to see someone in the etcd community integrate the etcd Jepsen tests directly into the existing etcd release pipeline.

I consider this to a be an issue of higher priority than any of the bugs they just found, because this will ensure preventable bugs don't crop up in the future. It's shocking to me that Jepson goes through all the effort and than very few projects build a permanent pipeline for it. It's debatable these bugs would've existed if a Jepson pipeline had been consistently in use from the 0.4.x days. I'm sure it's no simple task, but neither is a lot of the existing testing infrastructure for etcd.


> It's debatable these bugs would've existed if a Jepson pipeline had been consistently in use from the 0.4.x days.

I don't think it would have helped: the Jepsen tests I wrote in 2014 only checked single-key gets, puts, and CaS operations; the problems we found in this report were in watches and locks.


That's a good point that I hadn't remembered (first-class locks weren't totally baked back then), but I think the intent of my original comment holds true: we should have valued the Jepson test suite more and continuously leveraged and improved our usage of it, rather than doing one-off tests every now and then. As a result, the community has no idea if there were regressions between now and then. I will admit what I don't know: I have no idea if this was actually feasible or desirable for us or you at the time, but I'd feel more comfortable if every release of etcd, Zookeeper, etc... had some kind of Jepson stamp of approval on it in terms of API coverage. I'm pretty sure CockroachDB has tried this[0], but it's been a few years and I don't know how it turned out for them long term.

[0]: https://www.cockroachlabs.com/blog/diy-jepsen-testing-cockro...


CockroachDB runs the Jepsen test suite nightly. We've been following along Aphyr's recent test additions (`multi-register` for instance, which immediately caught [0]), porting them over when appropriate. We definitely have work to be done incorporating the more DDL focused tests that tripped up YugaByte.

[0]: https://github.com/cockroachdb/cockroach/pull/40600


That's pretty damn cool. I wish I had more time to seriously try out CockroachDB.


A distributed state system should have jepsen as a bare minimum to qualify for serious consideration.

It would be nice to get follow-ups too, personally for Cassandra, Kafka, couch, and elastic.


Congrats on the community blog launch!

I miss the unique and informative style of the old reports.


> This is, apparently, not correct. Asking for revision 0 causes etcd to stream updates beginning with whatever revision the server has now, plus one, rather than the first revision. Asking for revision 1 yields all changes. This behavior was not documented.

I had worked on an alternative etcd impl and had to workaround this assumption as well. It is technically documented in the proto[0], and numeric 0 is of course "unset" or "default" in proto3 land.

One thing I would like to see tested is nested transactions where one txn child mutates something then the second sibling txn child uses that something. I've found that implementation is lacking.

0 - https://github.com/etcd-io/etcd/blob/53f15caf73b9285d6043009...


This is also one of the things I suggested that'd be nice to have in the etcd API, along with an AST for boolean operations over guard expressions, more flexible comparators etc. You can emulate these with a sequence of independent transactions, but it'd be nice if the micro-transaction system were a tad more general.



This is a powerful validation for etcd and its status as a mission critical backend. You don't see a lot of positive Jespen results these days!


I didn't think it was too positive. He found some documentation bugs, as well as what seems like really bad locking bugs.


It is not possible to implement distributed locking correctly in the general case. Fencing tokens only work if your data store supports RMW operations, in which case you could just implement locking inside the data store itself.


Have you seen Jepsen reports before? Silent data loss isn't uncommon.

https://aphyr.com/posts/284-call-me-maybe-mongodb ("Mostly, it just throws those writes away entirely: no rollback files, no nothing. I don’t really know why.")

https://aphyr.com/posts/317-call-me-maybe-elasticsearch ("When the cluster comes back together, one primary blithely overwrites the other’s state.")

etc etc


This is, apparently, not correct. Asking for revision 0 causes etcd to stream updates beginning with whatever revision the server has now, plus one, rather than the first revision. Asking for revision 1 yields all changes. This behavior was not documented.

Whoops, looks suspiciously like someone tested the revision integer for truthiness to see if something was passed.


Nope. It's because protobufs can't differentiate between nil/unset and a zero value in this field: https://github.com/etcd-io/etcd/blob/53f15caf73b9285d6043009...


Not all protobufs. Protobuf 2 can do it just fine. The decision to not have emptiness support is the dumbest part of Proto 3 IMO.


I don't think it's that crazy a decision, since it was also kind of crazy to pay the cost of "was this field set" for every single field, when the vast majority of the time you're not going to use it.

But it can require a bit more more thought in your protocol design. So for example in this case, the options would be

- design things such that 0 is ok to mean "most current" (so start revisions at 1) (this is hard after the fact, but if you know from day 0 that missing values for int types will be 0, you can design everything to start at 1) (Edit: maybe this is how etcd works?)

- explicitly break out revision into a message type (so you can notice if it's not provided)

- use something like "-1" to mean "now", so that 0 isn't overloaded

etc...

(You could argue maybe the right call for proto3 was to have a flag on a field saying if you want to be able to notice if it was provided or not. Best of all worlds, at cost of a bit of complexity.)


"Is this field set" costs 1 bit per _optional_ field in Proto2. _Required_ fields (which are also supported by proto2 and not supported by proto3) do not incur this cost. Those flags then would get packed into a bit mask. Not too big a cost, if you ask me.

But now you can't figure this out at all without adding another, boolean field, _and setting it separately_, which I'm pretty sure nobody is going to do unless they really have to, leading to the type of issue we're seeing here.


At my place of work, we typically follow the "message type" solution in situations like this. I don't think it's the most legible solution, but it's the best we can do with the proto spec: I always feel like I should qualify these fields with a comment explaining the apparently pointless wrapper.

Google themselves provide https://github.com/protocolbuffers/protobuf/blob/master/src/... to deal with this situation.

Quoted from the documentation:

> Wrappers for primitive (non-message) types.

> These types are useful for places where we need to distinguish between the absence of a primitive typed field and its default value.

It should probably be advertised more, as we've experienced that default values of optional fields are a surprising feature for smart developers who are new to protobufs. Maybe it's seen as a wart in the design? Getting rid of Null is hard.

I kinda wish they went with the approach that all fields are required, unless they are explicitly declared as optional. This is how Rust does it, and people seem to like it.


I think you got it backwards. The current situation where unset field results in a value is the equivalent of "null", it's the opposite of "getting rid of null". The previous design didn't have that problem.


I might be misunderstanding you, but the current situation results in a “zero” value, whereas “Null” would represent “unspecified”, or the absence of a value. I wouldn’t say that the current spec supports null, unless you’re using a wrapper like the one linked.

It’s the distinction between an optional type and an optional value (default value). With optional values but no optional types, you can’t be certain about the caller’s intentions. It’s a distinction that’s subtle but important, therefore a “gotcha”.

Getting rid of null is a noble idea, because of the headaches that null tends to induce. Optional types (like Rust’s) is a neat way to get the behavior of null without the value of null. Proto3 doesn’t have null, it’s replaced with arcane wrapping that’s arguably less straightforward.

Please let me know if I’m talking past you, it isn’t intentional, just late in the day. :)


It depends on the target language. I think the driving force behind the protobuf3 simplifications were to make working with Go better. Go pretty much required dropping the 'required' and non-zero-value defaults. Which annoys Python developers and similar with access to a None, nil or NULL value. It is particularly visible on my main code base, where I'm dealing with SQL with NULL, to a Go gRPC server which needs specially handle all those NULLable columns, via protobufv3 to Python clients which generally now can't tell if the original data was an empty string or NULL (because the solution to that is worse than the problem, which is annoying but manageable). protobuf3 is designed to be cross language, which in this case meant going to the lowest common denominator rather than making it work harder.

Yes, I would love if there were set/unset bits available. Even if it was awkward.


you’re probably right. obviously you can model it in go (just use a ptr!), but it doesn’t feel as natural.


The elegant way to do this is via coproducts and products.


You can implement emptiness with the wrapper message types, e.g. StringValue: https://github.com/protocolbuffers/protobuf/blob/master/src/...

and sometimes defining your own, e.g. nullable lists.

It's not elegant, but it is simple.


But the point is, proto2 had this feature, and it worked without any hacks.


Worse, this is apparently an intentional choice.


For me, etcd is the go to for dynamic configuration management.


The corresponding jepsen post: http://jepsen.io/analyses/etcd-3.4.3

I do no work at all in this area, but i love these reports. They're examples of well-written, clear, "engineer-mind" reports that we would all do well to emulate.


I agree they are great. I kind of miss the older style of reports though, with memes all over the place!


I completely agree. I have also always appreciated the very honest and reflective ethics policy that underlies the work:

https://jepsen.io/ethics


(This comment was originally posted to https://news.ycombinator.com/item?id=22191925. We merged the threads.)


I mostly see etcd being used to store metadata & configuration data for distributed systems.

Can etcd be also used as a general distributed database like FoundationDB or ScyllaDB? If so how does it compare to those other optiions?


Etcd is designed to be able to cache it's entire working set in memory.

https://etcd.io/docs/v3.3.12/dev-guide/limit/ https://github.com/etcd-io/etcd/blob/master/Documentation/op...

TiKV, a distributed kv store that can store many terabytes of data, actually uses etcd internally for metadata storage.


Most databases fit in memory though.


Many do, and even large DBs can be accomodated by throwing terabytes of RAM at the problem, but etcd is fundamentally targeted at a different problem domain than a general-purpose database. Everything in etcd is represented by a single-threaded state machine, which is great for simplicity and correctness, but that comes with performance limitations.


etcd is used as the primary data store of applications like Kubernetes or CoreDNS. But, many use it for configuration like M3 or Vitess. You can see different users in the docs[1].

There are different databases like CockroachDB or Dgraph that use etcd's raft libraries but build more application focused (SQL, graphql, etc) APIs[2].

[1]: https://github.com/etcd-io/etcd/blob/master/Documentation/in...

[2]: https://github.com/etcd-io/etcd/tree/master/raft#notable-use...


I don't think etcd was designed with the idea of competing with larger distributed databases. For example, the maximum database size limit in etcd (as of v3.3) 10GB. This works for an application like Kubernetes where you're storing less than a million records but likely isn't something you want backing your wildly successful django application.


As usual, Kyle provides an awesome write-up. As much as I miss the old, funny, prose, the level of detail is still unmatched.


The elementary confusions apparent in the quoted documentation do not inspire confidence in the design or implementation.


If you look at other reports by Jepsen (elasticsearch and mongodb are real gems), you might find that this one found very few very minor bugs and documentation problems. Etcd is pretty solid for what it does.

edit: seems like my recollection of the mongodb one is the 2.4.x one [1] and the later ones are much better. Also mongo included the Jepsen test suite into their CI.

[1] https://aphyr.com/posts/284-call-me-maybe-mongodb


How many nodes can etcd handle without having noticeable decay in performance? their FAQ says 7 but did somebody use it in some other distributed app other than k8s with more nodes? Assume that most of the operations are get and watch (i.e. write/read <<< 1.0), how big of a cluster in terms of number nodes can we scale up to?


Not sure why youre looking at so many nodes for a cluster. Most of the time scaling the number of etcd nodes does little to help performance, instead focus on giving the nodes plenty of IOPS.


If you run a 3 availability zone architecture and want to survive AZ + 1 failures you need more nodes. 9 nodes is the minimum to guarentee that survival while still maintaing a quorum


Do you think you're actually gaining availability with an AZ + 1 setup?


Comon, could you be constructive? Of course they do. Now whats your gripe with that?


I was more curious on why they thought that would increase availability. It didn't seem like it would be constructive to speculate about why they chose that.

AWS doesn't guarantee that failures within an AZ are independent of each other, so it's not clear how you would estimate what availability you'd gain with this. Losing everything in an AZ + 1 instance sounds like a very unusual and specific scenario to design for.


The use case here is an AZ going down (reasonable to guard against) and an individual machine failure (the first reason we use HA like this anyway).

An AZ going down doesn’t make all other hardware reliable, and equally a machine going down from a cluster doesn’t mean that all AZs are going to be reliable.

Many products have uptime requirements above what Amazon can provide at the AZ or machine level.


To improve read perf, you can add learner members or add a layer of cache proxy.


Thanks! never heard of the learner member feature until now. I was asking for an app where most of the nodes are exerting watch operations while only a few other nodes do PUTs due to human intervention. This means that I have a very low write/read ratio. Also I assume that the number of nodes is usually stable so it's not like a very dynamic system where nodes join and leave very frequently. Does this make it easier to have a cluster of 50-100 nodes in different datacenters without breaking etcd down?


Proxy is deprecated in v2. V3 has grpc proxy but i think it’s mainly good for coalescing watches and discovery


Going off the name I thought it would be something to manage Unix configurations (/etc Daemon)


It’s kind of used for things of that nature in a distributed context


For the impatient: “The etcd key-value store is a distributed database based on the Raft consensus algorithm. In our 2014 analysis, we found that etcd 0.4.1 exhibited stale reads by default. We returned to etcd, now at version 3.4.3, to investigate its safety properties in detail. We found that key-value operations appear to be strict serializable, and that watches deliver every change to a key in order. However, etcd locks are fundamentally unsafe, and those risks were exacerbated by a bug which failed to check lease validity after waiting for a lock.”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: