If you've spent any significant amount of time with Mesos and Marathon, you've probably found that it's buggy (Marathon at least), complicated, and hard to work with. Zookeeper is just one part of the complicated web of infrastructure you need to stand up. To top it off you end up needing to run something like Netflix' Exhibitor to have any shred of confidence that if ZK goes away the state of your cluster will be recoverable. It definitely "works", but certainly leaves something to be desired in the stability and ease of use department.
We[1] haven't found this to be the case. These are early-stage projects but they hold promise for companies that need an internal PaaS at scale. I look at Marathon, Chronos and ZK (to some degree) as building blocks, and DCOS as the glue that will tie them together.
One big reason to use Marathon + Chronos + Mesos is to get multi-tenancy of compute and data workloads on your instances. You can't get that with web-service specific tech like Tutum.
I mean, I'm a little biased because I work there, but Terminal gives you multi-tenancy of compute and data workloads by default, and you get almost all of the properties of VMWare on containers (including VMotion style migrations).
You can make an account and boot apache Spark in about 30 seconds using this link [0]. It's running in production right now for a lot of people, and you can run Mesos on top of Terminal if you want it [1].
Again, I'm not trying to push this on you if you're happy with how stuff works today, but I think we've made a PaaS that solves a lot of these problems (and we'll let you run it on your own metal too if you want it). Check it out at terminal.com if you have some free time.
This post would be a lot less spammy if you actually added some content, like how you do things differently, or what stack you use, whatever. Anything except "look at us, we are great at this"
We hacked the linux kernel to provide better checkpointing of RAM state and then we write the state to disk. Anyone can then summon the state from disk in the time it takes to read the SSD.
I'm curious about just how difficult managing ZK is. My initial experience was pretty pleasant. Sure, it doesn't have a pretty admin UI (Exhibitor fills in that gap), but it's not like the consensus is that ZK is unreliable or doesn't live up to its utility claims. So, what exactly am I missing here?
I wasn't picking on ZK in particular. I was more pointing out that the system as a whole is complicated. It might be fine for "enterprise deployments" where a company has a lot of time, money and resources to throw at the implementation. In my opinion the average end user is going to get lost in the details of how these systems interconnect and work together. Debugging some of the more complex issues that they will face will be challenging.
Even Exhibitor is not a silver bullet for managing ZK. It helps with rolling restarts and lets you do a sort of MySQL-esque bin-log replay of your state, but even with Exhibitor, it's not going to make for a smooth recovery at 4:00AM when PagerDuty rings.
The underlying design of how Mesos and Marathon communicate create situations where they have differing views of the state of the cluster, and you end up with "orphans" which are tasks that Mesos is aware of but Marathon knows nothing about.
In my opinion, the system feels like it was designed to be something else, and then all this functionality was tacked on later. Coincidentally that is exactly what the case is with Mesos. I think there is a lot of room for an improved user experience that these tools will struggle to provide.
Hi there, Marathon committer here. If you had a bad experience with the 0.7.x series, I apologize for that. We've made a serious effort to fix the consistency issues and that work has paid off. 0.8.0 is current, and we've just tagged 0.8.1-RC1 yesterday. I encourage you to give it another try! As mentioned in the other comments, we (Mesosphere) are actively working to provide alternatives to zookeeper such as etcd. The reality is that ZK will stick around as a stable option for people, as the first > 1.0 etcd release was tagged only two months ago.
Yep. In my experience, I've seen that the issue isn't with Mesos the Apache project, but rather with Marathon and Chronos, which are the applications built by Mesosphere.
There's a big naming and marketing confusion at the moment with the open-source project and the VC-funded startup having such a similar name. It's unfortunately giving a bad reputation to the Mesos project.
Curious why there actually isn't a trademark issue here...?
Because Mesosphere (the VC backed startup) hired the creator of Mesos (the Apache project), and is now trying to capitalize on the open source project as one of their "products".
Mesosphere is arguably the largest contributor to Mesos, certainly on par with Twitter, especially when you consider all the surrounding ecosystem. The company also secured permission from the Apache foundation with the trademark when the company was founded. It's good for the open source ecosystem to have companies productize and support projects, particularly when they are plowing millions of dollars back into the open source.
FWIW, I've had much more success with HubSpot's Singularity than Marathon/Chronos. (I use Marathon to bootstrap Singularity as the "init system", because it had HA before Singularity, but that's the extent of its interaction with my system.)
That's interesting. I've also found it interesting that one of the big selling points of the Mesosphere ecosystem is that it is used at scale by Twitter. But AFAIK twitter developed their own framework for long running tasks (http://aurora.incubator.apache.org/). So they are using Mesos (the open source project), maybe Chronos, and their own version of Marathon. That can be somewhat misleading to newcomers.
To be clear, Twitter runs Aurora which is open source and has proven itself to be stable, battle-tested, and scalable; development on Aurora began in 2010, and over the years Mesos and Aurora have evolved together. Aurora has cron capabilities built into it, and Twitter does not run Marathon or Chronos.
Marathon is used in production at our company on > 1,000 servers. 0.7.x had some stability problems but 0.8 has been terrific. Especially given the project seems to be less than 2 years old (according to github), I find it to be really stable. A lot of advanced canary-style deploy features were added and they're more advanced and scalable than anything else we tried out.
I'm a bit concerned about the network layer (or, rather, lack thereof). I've looked at one or two of the "overlay network" approaches (that essentially have a container act as a router/tunneller between hosts), and wish there was something in Swarm that let me do basic port-mapping/load balancing across containers on multiple hosts (preferably with auto-scaling) without funkiness and undue overhead.
AFAIK it allows you to replace ZK for etcd making it a lot easier to run this on top of coreos as a working etcd cluster is a fact in a working coreos cluster.
Really? I've found it a lot easier to manage than etcd. The fact that you manually specify all of your nodes in the config file removes a whole host of errors you can create in etcd with its fancy discovery stuff.
What's your set up? How many zookeeper nodes do you run? What problems have you run into?
Many of the docs on etcd promote the use of the discovery service[1] as a convenient way of bootstrapping an etcd cluster -- really useful when you don't know the IP addresses of each cluster member up front. The discovery bootstrap method is also great for demos and testing environments, but as you correctly highlight this is not the ideal way to run a production setup.
With the release of etcd 2.0, I strongly recommend using the static bootstrap[2] method for provisioning an etcd cluster. The static bootstrap method provides the key documentation clues that help you reason about an etcd installation.
Finally, etcd 2.0 introduces support for bootstrapping using DNS[3], which provides the convenience of the discovery bootstrap, and the explicitness of the static bootstrap.
We would get increased latency interacting with Zookeeper over time until eventually it would completely fail. The log messages when this would happen (either latency or failure) were extremely unhelpful. The logging server-side for Zookeeper in fact I found downright terrible.
We wound up proactively restarting the ZK cluster regularly, which improved stability.
Granted, it was our own software written to use it, and we suspect there were problems with the way it was written. It was easier for us to just rip it out than debug, however. I find it overly complicated to write against given the need for thick clients.
"ZooKeeper provides ephemeral nodes which are K/V entries that are removed when a client disconnects. These are more sophisticated than a heartbeat system, but also have inherent scalability issues and add client side complexity. All clients must maintain active connections to the ZooKeeper servers, and perform keep-alives. Additionally, this requires "thick clients", which are difficult to write and often result in difficult to debug issues."
Edit: in retrospect our problems might have been solved by turning syncing off, as described here:
But you can figure out from the above how scalable Zookeeper is... not very. We run in physical datacenters and I certainly wouldn't be thrilled about building out snowflake RAID systems just for our ZK clusters (we generally try to use whitebox commodity hardware and we want individual nodes to be as disposable as possible).
It's ironic that ZK requires such consistency when the goals of Mesos are exactly the opposite.
> It's ironic that ZK requires such consistency when the goals of Mesos are exactly the opposite.
Is it ironic? Seems to me that if you want to have a distributed cluster that can deal with worker failure and still be useful, you need to rely on something to durably maintain your state at a lower level.
Well, you've described how zk doesn't meet the needs of your software as written, but I'm not sure you've established why Mesos would be better off without it...
And yet somehow they seem to have gotten it to work. I run a few Zookeeper-based systems where ZK just seems to work and I don't have to look at logging statements, either. I've never dealt with Zookeeper downtime that wasn't Amazon's fault.
So you are having ZK downtime? Such a system shouldn't ever really go down, even with Amazon problems. We strive for software in which individual nodes can be flakey or down without impacting the uptime of the whole cluster. ZK doesn't fit this criteria, as you have just stated. I prefer gossip-based Raft systems, for example (which is why I like Consul so much. We're also big Riak users).
No, individual nodes have gone down based on hardware problems. The system stays up. Jespen has given Zookeeper probably the most ringing endorsement of anything it's tested, so I don't know what you're on about.
If you're in NYC, there will be a talk tonight on Building & Deploying Applications to Apache Mesos at the Digital Ocean meetup: http://eventhunt.io/node/5984.