Etcd 2.0 – First Major Stable Release

lclarkmichalek · on Feb 10, 2015

Well, some notes from when I deployed this the other day.

If running in a docker container, you'll need to mount /etc/ssl/certs, as the etcd container is minimal, and will require finding some x.509 roots or something, even when running without HTTPS communication (that's what you get for using Go, I guess).

My setup for command line flags is

    -name "{{ hostname }}" \
    -initial-advertise-peer-urls "http://{{ clusterip }}:7001" \
    -advertise-client-urls "http://{{ clusterip }}:4001" \
    --listen-client-urls 'http://0.0.0.0:4001' \
    --listen-peer-urls 'http://0.0.0.0:7001' \
    -discovery "{{ discovery url }}"

I manually specify the proxy nodes by putting adding -proxy 'on' (my vpn host was becoming a master, which was not optimal), this may not be a problem for you.

Removing and re-adding node was funky for me, as although :7001/members is great, removing a node there does not remove it from the discovery node, making rejoining with a clean etcd from the name machine rather painful. Not much that can be done about that though.

All in all, I think I'll start writing my Etcd compatible interface to zookeeper :)

barakm · on Feb 10, 2015

One new 2.0 feature you may like (discovery wise) is SRV discovery, if you've got an internal DNS/dnsmasq or something. Set a few records for where machines can be found (and keep them appropriately up to date) and it'll do the same thing as a static bootstrap automatically

lclarkmichalek · on Feb 10, 2015

Unfortunately, I run skydns for service discovery, and having multiple dns servers on the same machine is a PITA. I should note, for anyone considering skydns, that skydns + etcd + docker is circular dependency hell :)

saryant · on Feb 11, 2015

What problems have you had? We're using a similar setup but SkyDNS has been pretty trouble-free so far.

lclarkmichalek · on Feb 11, 2015

If skydns fails to load its config from etcd, it doesn't terminate, it just continues and emits a warning.

If you are using etcd.discovery.io for your discovery url, you need to have dns working.

At some point, passing multiple --dns to docker didn't cause failover to be working. Or at least, dns resolution was failing in the etcd container, despite the docker daemon having `--dns 127.0.0.1 --dns 8.8.8.8 --dns 8.8.4.4`. I don't even know.

vruiz · on Feb 10, 2015

We had some trouble with etcd at work with constant leader re-election and high CPU usage around last summer, we switched to consul and so far we are happy with it, but etcd seems to be better supported by 3rd party apps, so maybe we should take it for another spin.

barakm · on Feb 10, 2015

This would certainly help. High CPU was also an issue that we started to notice on 0.4.6, here with some of the NYC CoreOS guys, and that's been fixed. Chalk it up to completely redoing internal communication.

EDIT: Master election too :)

(Disclaimer: etcd dev here)

eternalban · on Feb 10, 2015

Are you guys still per spec RAFT or have you diverged at this point?

philips · on Feb 10, 2015

The team has worked _very_ carefully to follow the raft state machine as described in the paper as close as possible. For example we have a set of tests[1] that takes possible problems outlined in the original paper and implements them as unit tests.

[1] https://github.com/coreos/etcd/blob/master/raft/raft_paper_t...

_ondq · on Feb 10, 2015

Keep in mind that reads do not go through Raft by default, so it's possible to get stale data.

They've added "consistent=true"/"quorum=true" URL parameters for GETs per https://github.com/coreos/etcd/issues/741 as a workaround.

wereHamster · on Feb 10, 2015

I had problems due to high-cpu load as well. Haven't had an issue since I updated to latest etcd sometime late last year.

hurrycane · on Feb 11, 2015

I had the same issues last around July - August and continued upgrading as they release new versions and somewhere along the way it got fixed.

You can also had to fine tune some timeouts (election and compression if I remember correctly) to get the best performance out it.

AYBABTME · on Feb 10, 2015

Pretty sure the high-cpu issue was a time.Ticker leak that was fixed earlier.

stonogo · on Feb 10, 2015

Was the 1.0 release not major, or not stable?

barakm · on Feb 10, 2015

How about not existing?

Seriously, this was originally 0.5. However, because people had started to use 0.4.6 as a pseudo-1.0 in production, and because the internals are completely different, it's a little bit of version jumping to a base we actually want to call our stable branch.

robszumski · on Feb 11, 2015

To expand, the 0.4.6 uses the internal v1 API and 2.0 uses the internal v2 API. It made sense to sync up the internal and external release numbers to make things clearer going forward.

nemothekid · on Feb 11, 2015

Or just Larry Ellison versioning.

23david · on Feb 11, 2015

etcd 3.0 :-)

i_have_to_speak · on Feb 11, 2015

Do excuse my ignorance, but what practical advantages does etcd offer over Cassandra (or even Riak)? To me, it seems that raft's leader-does-the-heavy-lifting style of replication will only limit the cluster size and thus the horizontal scalability of the cluster. The gossip-based Cassandra has stability and proven scalability.

robszumski · on Feb 11, 2015

etcd is designed to store app settings, data for service discovery, feature flags, distributed locks, that type of thing. It's not a general purpose data store and it isn't designed to store data in the same way you would use Cassandra.

stdbrouw · on Feb 11, 2015

Any idea when we might see this included in a CoreOS release?

joepvd · on Feb 11, 2015

Excellent release notice, as it does not leave any questions as to what the software is supposed to do.

necrodawg · on Feb 11, 2015

Did you guys time this post with the Docker 1.5 release post?

jacques_chester · on Feb 11, 2015

Note the date.

necrodawg · on Feb 11, 2015

Yeah, why not post it on HN on Jan 28th? :)

jacques_chester · on Feb 12, 2015

You assume that people can magically decide when their stuff appears on HN.

necrodawg · on Feb 13, 2015

The poster works for CoreOS...not so magical.

jacques_chester · on Feb 14, 2015

Again, you assume that only CoreOS got to decide when to post it. There was nothing stopping you or anyone else from posting it.

Still, if you think that getting on HN is the highest level of PR mastery, I have a hello_world.go-to-io.js transpiler to sell you.

necrodawg · on Feb 16, 2015

I'm not sure you know how HN works. Just posting isn't enough, you need your colleagues to upvote it if you want the front page. Unless it's super interesting.

It's not difficult to get on the front page but it has massive PR implications however. My previous startup is living proof of that.

And this isn't the first time Docker announce something on HN and CoreOS follow straight up with something else. That's why I was wondering.