Etcd Clustering in AWS

justizin · on June 12, 2015

"If this happened our cluster would become unavailable and may have trouble re-clustering."

This was basically the repeated experience I had which caused me to abandon etcd for the time being.

If it can barely ever heal, what the fuck good is it? And I found that it could barely ever heal. A 3-node CoreOS cluster I ran _always_ crashed when it attempted a coordinated update, and rarely could be repaired with the help of #CoreOS over hours.

Because CoreOS pushes out updates with versions of etcd incompatible with recent versions, the etcd cluster could never survive the upgrade.

Add this to the fact that the CEO of CoreOS told me in person that he expected them to be the _only_ Operating System on the internet, and I'm generally not along for the ride with CoreOS any longer.

Consul, Mesos, and Docker are looking good.

Anyone interested in this space should check out:

  https://github.com/CiscoCloud/microservices-infrastructure

efuquen · on June 12, 2015

I had way more trouble with Consul then I ever had with etcd. Actually, I've had almost no trouble with etcd whatsoever, it was way more resilient and tolerant to dying machines then Consul was, which would repeatedly get into inconsistent states and attempt to connect to nodes that were no longer there. I've been running CoreOS on production with dozens of AWS instances over the past year and I really don't have many complaints. Most issues I've come across actually have a lot more to do with docker then the stuff CoreOS has built.

But I also handle upgrading releases differently, that's not something I trusted from the beginning and it's easy enough to disable their update system and stand up new instances with upgraded CoreOS images.

Also, looking at your quote I would consider it very out of context, the previous sentence right before that:

"If there were any changes to these etcd machines, AWS would reboot them to apply the changes, potentially all at the same time."

So they had Cloudformation potentially rebooting all there machines at the same time, I think any cluster is going to have an issue when that happens and really has nothing to do with CoreOS's update system.

justizin · on June 12, 2015

Your experience with Consul and mine with etcd may suffer that of the pref for hard drive brands.

> Also, looking at your quote I would consider it very out of context, the previous sentence right before that:

> "If there were any changes to these etcd machines, AWS would reboot them to apply the changes, potentially all at the same time."

>So they had Cloudformation potentially rebooting all there machines at the same time, I think any cluster is going to have an issue when that happens and really has nothing to do with CoreOS's update system.

Two things:

(a) I'm saying that etcd has a tendency to break in _exactly_ _the_ _same_ _way_ without AWS rebooting anything.

(b) Production systems have a tendency to fail completely and all power on (or experience the end of a network partition) at the same time. It is absolutely necessary for anything as essential as etcd claims to be to be able to deal with a situation where all machines are powered off or unreachable to each other, and that comes to a sudden end.

CoreOS's update system happens to trigger this on its' own, because when it updates, it relies on etcd.

If you're not going to rely on CoreOS to update itself, what in the world is the point of CoreOS?

I'm just saying there are other boxes of sticks, putting some sticks in a box ain't that fuckin' hard, and these particular stick-gatherers are suffering from a dangerous bout of megalomania.

Feel free to lean your livelihood up against whatever box of sticks you please. :)

vidarh · on June 12, 2015

While I like Consul, and actuallly run Consul in the CoreOS cluster I'm currently building, I have had far fewer problems with Etcd in AWS than Consul. For starters there's a long standing issue supposedly to do with ARP caching that makes restarting Consult iffy (we need to keep the Consul instance down for 3 minutes to prevent the cluster membership from flapping continuously after it rejoins, for example).

On the other hand I've never seen the problem you describe where the cluster won't heal after an upgrade.

stdbrouw · on June 12, 2015

I have to second this. Both etcd 0.4.x and etcd 2 can get seriously wonky. Part of it is definitely that it's a lowish-level tool (as is fleet) and so it's up to you to make sure that you've got everything configured as it should be... but even so, sometimes shit just goes horribly wrong. I've put everything on a "No Reboot" schedule for now.

I really want to like the CoreOS ecosystem, but IMO it's still beta-quality software.

jefe78 · on June 12, 2015

Have you tried using Mesos? We're doing a POC but ran into some issues that we're going to wait out. Also, I've spoken to Mesos and they stated that they had no intentions to make deployments easier/more stable, in favour of pushing their commercial offering.

justizin · on June 12, 2015

I pointed at an open-source project by Cisco that pretty much sidesteps them entirely.

Obviously, CoreOS is going to start wanting your money pretty soon, as well.

Having worked for one of the earliest commercial linux distributors, I have little faith in such an effort to get anywhere. Red Hat and Canonical can barely make a dime.

Mesosphere isn't really an alternative to etcd, though. It relies on Zookeeper, which isn't perfect, but is much more battle torn than either Consul or etcd.

I have high hopes for something to replace Zookeeper, but I'm not deploying something in infancy which inherently can't heal from outages.

vidarh · on June 12, 2015

> Red Hat and Canonical can barely make a dime.

Redhat Fiscal 2015 revenue: $1.79 billion. Net income: $180 million.

Their fastest growing business areas are incidentally exactly in this space: OpenShift, OpenStack and Ceph.

samkone · on June 13, 2015

what's your issue with Zk. It might be overweight but so far in ourdeployments it behaves a lot better than etcd. But if you're interested there's a lot being done with etcd and consul. I haven't seen etcd handling the kind of mixed data-web services that zookeeper was built for.

samkone · on June 13, 2015

We are using Mesos/Marathon/Chronos/Docker/Bamboo on 200+ nodes on Google compute. Both for Data processing workloads and web applications (intensives advertising real time bidding servers). Using Mesosphere Packages, Salt, it was quite easy. Of course some services and some ssl stuff were a bit hairy but it's been working very great. Now I'm waiting for the availability of Mesosphere DCOS to come to GCP. Tried it on AWS and, god it's awesome. Makes spinning services like a breeze and deploying new product instantly.

atombender · on June 12, 2015

> Also, I've spoken to Mesos

Mesos is an Apache project, did you mean Mesosphere?

justizin · on June 12, 2015

Yeah, we're clearly talking about Mesosphere.

jefe78 · on June 12, 2015

The context makes that a silly question.

atombender · on June 12, 2015

Only if you alraedy know Mesos and that there's a company called Mesosphere.

stephenr · on June 13, 2015

> Add this to the fact that the CEO of CoreOS told me in person that he expected them to be the _only_ Operating System on the internet, and I'm generally not along for the ride with CoreOS any longer.

That's the part that would concern me the most. The guy sounds delusional at best.

jefe78 · on June 12, 2015

... thanks Monsanto?

In all seriousness, this is really interesting. They solved some of the problems associated with persisting a cluster and we're likely going to use that. Feels weird thanking them for anything though.

Edit: Is anyone using CoreOS in a physical DC? We're using AWS with ~1.5k VMs but have another 5-6k hosts in physical DCs. Trying to move us towards containers but struggling.

mooreds · on June 12, 2015

I just sent an email to some clients that I've been trying to get to blog about technical issues (for recruiting and retention purposes)--if Monsanto can do it, most anyone can.

justizin · on June 12, 2015

They're certainly known for not always knowing the difference between 'can' and 'should'. ;)

mkiwala · on June 13, 2015

I'm using coeros to operate a deis cluster. We deploy to our own data center using openstack.

yeukhon · on June 12, 2015

I think they fixed etcd cluster problem in 2.0 release (previously this is 0.5 branch).

For example, we use CF (old version), and we hit https://github.com/coreos/etcd/issues/863.

KnownSubset · on June 12, 2015

From my experience etcd is pretty rock solid, until you start using it across availability zones. Then if you add in SSL into the mix, the reliability drops even further if you are using the default configuration. At that point you need to start tweaking the heartbeat and timeout parameters for a the cluster to stay stable.

ideal0227 · on June 12, 2015

We have fixed the SSL issue in 2.1. We are also considering back-port it to 2.0 release if possible.

vinodc · on June 13, 2015

We solve the bootstrapping problem with an internal ELB instead.

Autoscaling Groups can be configured to have instances join multiple ELBs. We have one be the regular ELB to access the instances with, and the other is an internal ELB that only allows connections from instances in the cluster to other instances in the cluster on the etcd port (controlled via security groups).

When an instance comes up, it adds itself to the cluster via the internal ELB's hostname. The hostname is set in Route 53.

The biggest issues we've been having with etcd continue to be simultaneous reboots and/or joins to the cluster. It would also be great if the membership timeout feature that used to exist in 0.4 made its way back in. Right now, each member has to be explicitly removed rather than eventually timing out if it hasn't joined back in.

Looking forward to hear any other approaches folks have taken.

codewithcheese · on June 12, 2015

Running docker clusters on AWS seems a little foolish to me unless your trying to save money. Instead of manage containers why not just manage instances?

vidarh · on June 12, 2015

If you're trying to save money, you wouldn't be using AWS in the first place.

Here's one reason to run Docker on AWS: Putting everything in Docker containers makes it far easier to migrate off AWS if/when you want to.

brianwawok · on June 13, 2015

Well unless you use sqs and sns and s3 and redshift... Then you are pretty locked in anyway

pnathan · on June 13, 2015

boot time is one reason. another reason is integration with the aws ecosystem of services - it's not just ec2 these days.

gct · on June 12, 2015

I can't get over how bad a name etcd is. Everytime I see it I think it's some sort of daemon for /etc files.

phildougherty · on June 12, 2015

I think what they were going for was:

/etc is where your config files go on your server etcd is where your config goes for the distributed system

MikeTLive · on June 12, 2015

etcd IS a mashup of /etc (configuration files) and distributed. https://youtu.be/z6tjawXZ71E?t=30s