Docker Containers at Scale: Our Take on Docker Swarm

phildougherty · on Feb 26, 2015

If you've spent any significant amount of time with Mesos and Marathon, you've probably found that it's buggy (Marathon at least), complicated, and hard to work with. Zookeeper is just one part of the complicated web of infrastructure you need to stand up. To top it off you end up needing to run something like Netflix' Exhibitor to have any shred of confidence that if ZK goes away the state of your cluster will be recoverable. It definitely "works", but certainly leaves something to be desired in the stability and ease of use department.

maslam · on Feb 26, 2015

We[1] haven't found this to be the case. These are early-stage projects but they hold promise for companies that need an internal PaaS at scale. I look at Marathon, Chronos and ZK (to some degree) as building blocks, and DCOS as the glue that will tie them together.

One big reason to use Marathon + Chronos + Mesos is to get multi-tenancy of compute and data workloads on your instances. You can't get that with web-service specific tech like Tutum.

[1] appuri.com

josh2600 · on Feb 26, 2015

I mean, I'm a little biased because I work there, but Terminal gives you multi-tenancy of compute and data workloads by default, and you get almost all of the properties of VMWare on containers (including VMotion style migrations).

You can make an account and boot apache Spark in about 30 seconds using this link [0]. It's running in production right now for a lot of people, and you can run Mesos on top of Terminal if you want it [1].

Again, I'm not trying to push this on you if you're happy with how stuff works today, but I think we've made a PaaS that solves a lot of these problems (and we'll let you run it on your own metal too if you want it). Check it out at terminal.com if you have some free time.

[0]https://www.terminal.com/snapshot/c81e6215eba5799335a45b6936... [1]https://www.terminal.com/snapshot/44d4ee043422afec75dfd3bdaa...

mdekkers · on Feb 27, 2015

This post would be a lot less spammy if you actually added some content, like how you do things differently, or what stack you use, whatever. Anything except "look at us, we are great at this"

josh2600 · on Feb 27, 2015

We hacked the linux kernel to provide better checkpointing of RAM state and then we write the state to disk. Anyone can then summon the state from disk in the time it takes to read the SSD.

It's kinda cool, or at least I think so.

otterley · on Feb 26, 2015

I'm curious about just how difficult managing ZK is. My initial experience was pretty pleasant. Sure, it doesn't have a pretty admin UI (Exhibitor fills in that gap), but it's not like the consensus is that ZK is unreliable or doesn't live up to its utility claims. So, what exactly am I missing here?

phildougherty · on Feb 26, 2015

I wasn't picking on ZK in particular. I was more pointing out that the system as a whole is complicated. It might be fine for "enterprise deployments" where a company has a lot of time, money and resources to throw at the implementation. In my opinion the average end user is going to get lost in the details of how these systems interconnect and work together. Debugging some of the more complex issues that they will face will be challenging.

Even Exhibitor is not a silver bullet for managing ZK. It helps with rolling restarts and lets you do a sort of MySQL-esque bin-log replay of your state, but even with Exhibitor, it's not going to make for a smooth recovery at 4:00AM when PagerDuty rings.

The underlying design of how Mesos and Marathon communicate create situations where they have differing views of the state of the cluster, and you end up with "orphans" which are tasks that Mesos is aware of but Marathon knows nothing about.

In my opinion, the system feels like it was designed to be something else, and then all this functionality was tacked on later. Coincidentally that is exactly what the case is with Mesos. I think there is a lot of room for an improved user experience that these tools will struggle to provide.

presspot · on Feb 26, 2015

The whole point of the Mesosphere DCOS product is to make Mesos consumable by mere mortals: https://www.youtube.com/watch?v=UgJMlHdZEx4

nor0101 · on Feb 26, 2015

Hi there, Marathon committer here. If you had a bad experience with the 0.7.x series, I apologize for that. We've made a serious effort to fix the consistency issues and that work has paid off. 0.8.0 is current, and we've just tagged 0.8.1-RC1 yesterday. I encourage you to give it another try! As mentioned in the other comments, we (Mesosphere) are actively working to provide alternatives to zookeeper such as etcd. The reality is that ZK will stick around as a stable option for people, as the first > 1.0 etcd release was tagged only two months ago.

23david · on Feb 26, 2015

Yep. In my experience, I've seen that the issue isn't with Mesos the Apache project, but rather with Marathon and Chronos, which are the applications built by Mesosphere.

There's a big naming and marketing confusion at the moment with the open-source project and the VC-funded startup having such a similar name. It's unfortunately giving a bad reputation to the Mesos project.

Curious why there actually isn't a trademark issue here...?

phildougherty · on Feb 26, 2015

Because Mesosphere (the VC backed startup) hired the creator of Mesos (the Apache project), and is now trying to capitalize on the open source project as one of their "products".

presspot · on Feb 27, 2015

Mesosphere is arguably the largest contributor to Mesos, certainly on par with Twitter, especially when you consider all the surrounding ecosystem. The company also secured permission from the Apache foundation with the trademark when the company was founded. It's good for the open source ecosystem to have companies productize and support projects, particularly when they are plowing millions of dollars back into the open source.

eropple · on Feb 26, 2015

FWIW, I've had much more success with HubSpot's Singularity than Marathon/Chronos. (I use Marathon to bootstrap Singularity as the "init system", because it had HA before Singularity, but that's the extent of its interaction with my system.)

phildougherty · on Feb 26, 2015

That's interesting. I've also found it interesting that one of the big selling points of the Mesosphere ecosystem is that it is used at scale by Twitter. But AFAIK twitter developed their own framework for long running tasks (http://aurora.incubator.apache.org/). So they are using Mesos (the open source project), maybe Chronos, and their own version of Marathon. That can be somewhat misleading to newcomers.

davelester · on Feb 27, 2015

To be clear, Twitter runs Aurora which is open source and has proven itself to be stable, battle-tested, and scalable; development on Aurora began in 2010, and over the years Mesos and Aurora have evolved together. Aurora has cron capabilities built into it, and Twitter does not run Marathon or Chronos.

Twitter engineering recently blogged about Aurora with a bit more on its history: https://blog.twitter.com/2015/all-about-apache-aurora, and The New Stack recently published a two-part article on the project: http://thenewstack.io/twitters-aurora-relates-googles-borg-p..., http://thenewstack.io/twitters-aurora-replaces-operating-sys...

(full disclosure, I work at the tweet shop)

hawk188 · on Feb 26, 2015

Marathon is used in production at our company on > 1,000 servers. 0.7.x had some stability problems but 0.8 has been terrific. Especially given the project seems to be less than 2 years old (according to github), I find it to be really stable. A lot of advanced canary-style deploy features were added and they're more advanced and scalable than anything else we tried out.

florianleibert · on Feb 26, 2015

I'd love to hear about the bugs that you encountered. Did you file any github issues for them?

hawk188 · on Feb 26, 2015

Zookeeper isn't complicated. Maybe you picked the wrong profession?

sitkack · on Feb 27, 2015

When ur green and have a karma of one, maybe u shouldn't try and append some junk on the insult log, hawk.

rcarmo · on Feb 26, 2015

I'm a bit concerned about the network layer (or, rather, lack thereof). I've looked at one or two of the "overlay network" approaches (that essentially have a container act as a router/tunneller between hosts), and wish there was something in Swarm that let me do basic port-mapping/load balancing across containers on multiple hosts (preferably with auto-scaling) without funkiness and undue overhead.

carlivar · on Feb 26, 2015

If only Mesos would drop the Zookeeper requirement. We've found Zookeeper to be overly complicated and difficult to troubleshoot.

redmar · on Feb 26, 2015

etcd support for Mesos is almost there. see: https://issues.apache.org/jira/browse/MESOS-1806 the code is already reviewable so it won't take long before this gets released.

AFAIK it allows you to replace ZK for etcd making it a lot easier to run this on top of coreos as a working etcd cluster is a fact in a working coreos cluster.

lclarkmichalek · on Feb 26, 2015

Really? I've found it a lot easier to manage than etcd. The fact that you manually specify all of your nodes in the config file removes a whole host of errors you can create in etcd with its fancy discovery stuff.

What's your set up? How many zookeeper nodes do you run? What problems have you run into?

kelseyhightower · on Feb 26, 2015

You bring up a really good point.

Many of the docs on etcd promote the use of the discovery service[1] as a convenient way of bootstrapping an etcd cluster -- really useful when you don't know the IP addresses of each cluster member up front. The discovery bootstrap method is also great for demos and testing environments, but as you correctly highlight this is not the ideal way to run a production setup.

With the release of etcd 2.0, I strongly recommend using the static bootstrap[2] method for provisioning an etcd cluster. The static bootstrap method provides the key documentation clues that help you reason about an etcd installation.

Finally, etcd 2.0 introduces support for bootstrapping using DNS[3], which provides the convenience of the discovery bootstrap, and the explicitness of the static bootstrap.

[1] https://coreos.com/docs/cluster-management/setup/cluster-dis...

[2] https://github.com/coreos/etcd/blob/master/Documentation/clu...

[3] https://github.com/coreos/etcd/blob/master/Documentation/clu...

carlivar · on Feb 26, 2015

We would get increased latency interacting with Zookeeper over time until eventually it would completely fail. The log messages when this would happen (either latency or failure) were extremely unhelpful. The logging server-side for Zookeeper in fact I found downright terrible.

We wound up proactively restarting the ZK cluster regularly, which improved stability.

Granted, it was our own software written to use it, and we suspect there were problems with the way it was written. It was easier for us to just rip it out than debug, however. I find it overly complicated to write against given the need for thick clients.

Consul phrases it well (https://consul.io/intro/vs/zookeeper.html):

"ZooKeeper provides ephemeral nodes which are K/V entries that are removed when a client disconnects. These are more sophisticated than a heartbeat system, but also have inherent scalability issues and add client side complexity. All clients must maintain active connections to the ZooKeeper servers, and perform keep-alives. Additionally, this requires "thick clients", which are difficult to write and often result in difficult to debug issues."

Edit: in retrospect our problems might have been solved by turning syncing off, as described here:

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zo...

But you can figure out from the above how scalable Zookeeper is... not very. We run in physical datacenters and I certainly wouldn't be thrilled about building out snowflake RAID systems just for our ZK clusters (we generally try to use whitebox commodity hardware and we want individual nodes to be as disposable as possible).

It's ironic that ZK requires such consistency when the goals of Mesos are exactly the opposite.

mdellabitta · on Feb 26, 2015

> It's ironic that ZK requires such consistency when the goals of Mesos are exactly the opposite.

Is it ironic? Seems to me that if you want to have a distributed cluster that can deal with worker failure and still be useful, you need to rely on something to durably maintain your state at a lower level.

carlivar · on Feb 26, 2015

I agree and I want that thing not to be ZK. Consul or etcd would be fine.

mdellabitta · on Feb 26, 2015

Well, you've described how zk doesn't meet the needs of your software as written, but I'm not sure you've established why Mesos would be better off without it...

carlivar · on Feb 27, 2015

Yes, I have. Zookeeper's logging (and thus troubleshooting ability) is atrocious.

mdellabitta · on Feb 27, 2015

And yet somehow they seem to have gotten it to work. I run a few Zookeeper-based systems where ZK just seems to work and I don't have to look at logging statements, either. I've never dealt with Zookeeper downtime that wasn't Amazon's fault.

carlivar · on Feb 27, 2015

So you are having ZK downtime? Such a system shouldn't ever really go down, even with Amazon problems. We strive for software in which individual nodes can be flakey or down without impacting the uptime of the whole cluster. ZK doesn't fit this criteria, as you have just stated. I prefer gossip-based Raft systems, for example (which is why I like Consul so much. We're also big Riak users).

mdellabitta · on Feb 27, 2015

> So you are having ZK downtime?

No, individual nodes have gone down based on hardware problems. The system stays up. Jespen has given Zookeeper probably the most ringing endorsement of anything it's tested, so I don't know what you're on about.

undergrowth54 · on Feb 26, 2015

Why is it called Docker Swarm rather than Docker Fleet or something similarly logistics-related?

rubiquity · on Feb 26, 2015

CoreOS already has a project named Fleet[0].

[0] - https://github.com/coreos/fleet

polynomial · on Feb 26, 2015

If you're in NYC, there will be a talk tonight on Building & Deploying Applications to Apache Mesos at the Digital Ocean meetup: http://eventhunt.io/node/5984.