A Decade of Container Control at Google

contingencies · on March 27, 2016

I believe this article is poorly researched and very pro-Google slanted. The author, 'Timothy Prickett Morgan' falsely presents the impression that Google has been driving in one direction for ten years, that they did most of the work of containers in Linux, and that everyone else can thank them for their hard work.

I was following the container automation area very closely a couple of years ago, including some personal email interactions with Wilkes, and the primary LXC/cgroup userspace/kernelspace authors (none of whom worked at Google). It is my distinct impression, as others have noted, that Kubernetes was not an in-house for in-house thing, but rather a product that was made after many successive in-house systems specifically with a view towards sharing with the public, probably partly in response to public perceptions and popularization around LXC/Docker, and Amazon EC2's rapid success which Google's management would probably like to replicate.

Bad journalism.

thockingoog · on March 28, 2016

Nobody on the Kubernetes project ever claimed that it was an internal project gone public. It was absolutely, 100% designed to be open-sourced as a derivative of Borg and Omega. There's not sleight of hand here.

That said, we had been discussing a Borg-as-a-Service for quite some time BEFORE Docker became popular. Docker's popularity was undeniably a catalyst in getting Kubernetes built.

adamnemecek · on March 28, 2016

Wikipedia seems to indicate that the people behind cgroups did in fact work at Google https://en.m.wikipedia.org/wiki/Cgroups

thockingoog · on March 28, 2016

This is true. Early cgroups came primarily from Google. Namespaces did not. We have also been very involved in development of various cgroup controllers, and very vocal about what changes will make our style of isolation (which is suddenly popularized by Docker) more robust.

contingencies · on March 28, 2016

Looking at /usr/src/linux/Documentation/cgroups/* it seems that while the renamed cgroups kernel functionality by itself did have initial google authorship, the "actually make it useful" later developments of namespaces and LXC (ie. first userspace component) did not come from Google, and by the time Google was working on cgroups code from SGI and other sources (such as cpusets) had already been merged. Precisely what Google wrote would be an interesting question, in any event it's significantly less than "all of the above" as the article claims. By the time I started looking at it in 2009, development was dominated by IBM, who apparently funded LXC userspace development because they felt the new features would be good for their big mainframes.

menage · on March 28, 2016

That's mostly accurate. Lots of people (at Google, IBM, SWSoft and elsewhere) had been working on approaches to get resource isolation into the Linux kernel since around 2000, but none had achieved general support. The main debate was around the abstractions to be used for defining/controlling the sets of processes being isolated and the isolation parameters, rather than the actual mechanisms used for isolation.

Around about the same time (~2005?) SGI got cpusets merged into the kernel; this was initially just intended for pinning groups of processes on to specific NUMA nodes on big-iron systems. At the suggestion of akpm we started using it internally at Google to do coarse-grained CPU and memory isolation, by making use of the fake-NUMA emulation support to split the memory on our servers into chunks of ~128MB each and pinning each job to some number of fake nodes. This worked surprisingly well, but required painfully-complex userspace support to keep track of memory usage of each job, and juggle memory node assignments (particularly since we wanted to be able to overcommit machines, so we had to dynamically shift nodes around from low-priority jobs to high-priority jobs in response to demand).

The cpuset API and abstractions turned out to fit the resource control problem pretty well, and they had already been merged into the kernel, which gave that API a kind of pre-approval compared to the other generic resource control approaches. So we worked on separating out the core process/group management code from cpusets, and adapting it to support multiple different subsystems, and multiple parallel hierarchies of groups. The original cpusets became just one subsystem that could be attached to cgroups (others included memory, CPU cycles, disk I/O slots, available TCP ports, etc). It turned out that this was an approach that everyone (different groups of resource-control enthusiasts, as well as Linux core maintainers) could get behind, and as a result Linux acquired a general-purpose resource control abstraction, and other folks (including some at Google) went to town on providing mechanisms for controlling specific resources.

The namespace work was going on pretty much in parallel with this - it wasn't something that we were interested in since it was just added overhead from our point of view. The jobs we were running were fully aware that they were running in a shared environment (and mostly included a lot of Google core libraries that made dealing with the shared environment pretty straightforward) so we didn't need to give the impression that the job had a machine to itself. IP isolation would have been somewhat useful (and I think was later added in Kubernetes) but wasn't very practical to provide efficiently given Google's networking infrastructure at the time.

We weren't really interested in LXC since we had our own userspace components that had developed organically with our container support (and which as others have commented were so entwined with other bits of Google infrastructure that open-sourcing them wouldn't have been practical or very useful).

Katydid · on March 28, 2016

Container automation will be very important in the online postcard museum industry. It is good that you are vigilant.

contingencies · on March 28, 2016

Actually I got in to the area for experimental large-scale video transcoding cluster design (2009-2010; $), then to enable microservice-based architecture of secure digital currency exchange systems (2011-2015; $). However, even in my postcard archive project (2016; interest) I am using it for scalable reverse image search, enhanced workflow (CI), and secure image ingestion as well as standard web and database service segregation.

bootload · on March 28, 2016

"Now, the top Google techies who work on Borg, Omega, and Kubernetes have published a less technical paper in ACM Queue that describes some of the lessons learned from a decade of container management,"

Refers to this post of three weeks ago ~ https://news.ycombinator.com/item?id=11216020

dantiberian · on March 27, 2016

An interesting article, though I was surprised by the assertion that Google regrets Hadoop and Mesos being developed and controlled by groups outside of Google. At least for the last decade, I got the impression that Google deliberately didn't open source their code because they saw it as a major competitive advantage. Perhaps that is changing now that data is the new competitive advantage?

wstrange · on March 27, 2016

In the case of Kubernetes, I think Google's motivation is really simple: AWS has a huge lead but it is also a proprietary platform.

If they can commoditize the platform from underneath Amazon, they believe they can build out better/faster/cheaper cloud infrastructure.

jsnell · on March 27, 2016

There are multiple reasons not to open source code. One is the idea of the proprietary code as a competitive advantage. But even in the last decade there were definitely people inside Google who were arguing that the internal infrastructure, even if the best in the world at that time, was actually becoming a liability. Some obvious arguments are:

1 - You don't get to share the burden of the maintenance with other users of the code.

2 - It's guaranteed that after every acquisition the code-base of the bought company will be hard to integrate; it's been written against the publicly available infrastructure, which is of course wildly incompatible with the internal one.

3 - New hires will often already know how the open source tools work, while they'll have to learn all of the proprietary equivalents from scratch.

4 - Engineers will generally prefer to work on projects where the knowledge is transferable elsewhere. Staffing is easier if you can say with a straight face that people will not just be working with a big proprietary ball of mud, but at least partially with publicly available technology.

The other reason I can think of is that open sourcing parts of existing code bases is a lot harder than it looks [0]. It gets harder if the code is in a monorepo [1]. It gets harder if there's a strong culture of library reuse. The further you get away from the center of the dependency graph, the harder it gets. Could you open source Borg without open sourcing Chubby? GFS? Babysitter? (My recollection, which might be faulty, is that even long after Babysitter was otherwise dead, it was still used for bootstrapping Borg clusters).

And since the goal has to be to create a viable project rather than just a code dump, there has to be a way for outside contributors to make changes. That needs to be possible for both the project itself, and all of its dependencies. Either you need realtime synchronization between the internal and public repos, or you have to make the public one the repository of record, and pretty much give up on the concept of the monorepo. Which would have been a really hard sell at least at last decade's Google.

It's a lot easier to do this for green-field projects, with plans for releasing as open source being taken into account right from the start.

[0] I wrote more on that subject a year ago; https://www.snellman.net/blog/archive/2015-03-19-cant-even-t...

[1] http://danluu.com/monorepo/

mentat · on March 28, 2016

I've been joking that Kubernetes and gRPC are really to seed the startup ecosystem to make companies easier to acquire, interesting to read that here.

mikecb · on March 28, 2016

Even better: huge investment in their cloud and desire to get more people to switch. Response to best-of-class technology has always resulted in cries of lock-in. Opensourcing these things completely removes that risk.

Kubernetes: our container services are best; run your containers here, but if you don't want to anymore, run them anywhere using exactly the same api.

gRPC: our pub/sub is the best; run your rpc here, but if you don't want to anymore, run them anywhere using the same api.

tensorflow: our managed learning api is the best; run your learning here, but if you don't want to anymore, run it anywhere with the same api.

etc etc with dataflow and apache beam. These are best in class services, and except for bigquery (which has been a big driver of some of the big-name moves recently), they're using this model to take away the risk of transition.

It's pretty smart.

jamesblonde · on March 27, 2016

Hadoop is moving the way Linux did - fast and with a large number of companies pushing it together. As big as Google is, it wouldn't be able to dominate Linux, just as it can't now dominate Hadoop. My interpretation of the author is that it would have been better for Google to have something they can control the development of, rather than have somebody outside the tent pushing it in ways that don't suit them. Look at IBM and Spark. They see a long term threat/opportunity and moved quickly to negate it.

TheIronYuppie · on March 28, 2016

Disclaimer: I work at Google on Kubernetes.

To be clear, this is exactly why we donated 100% of Kubernetes to the Cloud Native Computing Foundation. We believe very strongly that the reason Linux was successful was that no single company controlled it. Though we have opinions about what makes container management successful, we are but a minority of the more than 650+ contributors to the platform.

mountainriver · on March 27, 2016

For a decade of container control, cAdvisor is really not a good product. Fundamentally flawed and seriously lacking documentation. This is an area where Google is a miss, they leave to many projects dangling and half working. Luckily there are better alternatives on the market.

thrownaway2424 · on March 27, 2016

The problem with extrapolating like you just did is that Kubernetes has not been in use at Google for a decade, and it is not related to the container infrastructure at Google. When you are evaluating Kubernetes it's important to know that no major project at Google has used Kubernetes for anything, ever. Kubernetes is a weird project where Google is trying to give the public a large-scale container infrastructure, but not _their_ large-scale container infrastructure.

thockingoog · on March 28, 2016

As cited above, we simply CAN'T opensource Borg. It's enormous, and it is deeply, DEEPLY entangled with millions of LoC of Google code. Nobody could untangle that. And even if we DID untangle it, it's alien technology. It does not make any attempt to meet people where they are. It does not focus on open standards or simple solutions to simple problems. On top of that, it's got 10+ years of semi-organic growth in it. There are a lot of mistakes that have been made that we simply have to live with internally. Also, it's C++, for which there is approximately ZERO opensource community.

We made a very strategic decision to rebuild it. It embodies many lessons from Borg and Omega (both things we got right and things we botched). It is implemented in an easier-to-approach language (Go) which has an active OSS community. It specifically focuses on "legacy apps" (everything written up to and including today) and open standards (HTTP, JSON, REST).

I've never been shy about my opinion that I do hope to supplant Borg one day, but that day is necessarily years away. Of course no major project has ever used it, the whole thing didn't exist just 2 years ago.

mike_hearn · on March 28, 2016

Whilst I appreciate that you're effectively competing with Borg internally, having been a heavy user of Borg for many years I'm not sure why you think it doesn't focus on simple solutions to simple problems, or that it doesn't meet people where they are. Borg always impressed me as one of the best thought out pieces of Google infrastructure: bringing up simple jobs was in fact quite simple, or so it seemed to me, but it also had sufficient power to do far more complex tasks as well.

Much of the organic growth, as you put it, can also be described another way: as an accumulation of useful features and optimisations.

The language issue was addressed by another response. To claim there's no open source C++ community which is why it's written in Go is just bizarre. There's absolutely a thriving open source C++ community, but if having the biggest open source community was the driving factor in picking the implementation language then I guess you should have picked Java.

thockingoog · on March 28, 2016

How would you run Apache + PHP + MySQL on Borg? Hint: you can't. Not without HUGE difficulty, anyway. Nobody does it. Part of Kubernetes "meet people where they are" mindset is that we simply can not ask people to rewrite their apps.

Truth is, a LOT of people don't write code. They write content and use pre-built code (think WordPress). Borg simply can not accomodate that very well. It's simple as long as you control things from soup to nuts.

Yes, some "organic growth" was useful features. But a lot of it was useless features, or features that are now obsolete but can't be removed because someone somewhere is using them, and probably doesn't have enough time to re-test without the feature (true story).

thockingoog · on March 28, 2016

I can not reply to your last comment, but "port 80 request denied". And where do you store your MySQL data? The point being nothing is impossible, just prohibitively hard.

dekhn · on March 28, 2016

There are any number of places you can store your MySQL data in a container world.

The first and best is to make all your MySQL IO go to an external cluster filesystem or other remote IO system. Because MySQL supports pluggable storage, you could write an Hadoop FS storage manager. This has the advantage that if a single MySQL instance is blown away, all the committed data is available for a new replica to start reading. I don't know if Docker or other container systems support automagically turning local IO calls into remote IO calls (or whether that really makes sense in a MySQL environment), but that's a similar approach. Condor supported this through their remote libc interface.

The second is to use some sort of per-task persistent local storage. In Docker world, this would be a mounted volume- the docker host would manage the storage, and new containers would remount that storage. You could have a process that restored the local storage from a backup, and the use replication from a master to catch up.

The third would be to have some sort of per-container persistent storage (the Borg paper calls this an "alloc").

For the server, most people wouldn't have Apache bind port 80 inside the container- you'd bind another port, and use some sort of other mechanism such as load balancing to expose the web server on a standard port

thockingoog · on March 28, 2016

The question had an implied "... in Borg" suffix. The point was to demonstrate that Borg does not have "legacy" affordances like durable storage (well, not in the same way as MySQL would need).

dekhn · on March 29, 2016

We both are Google employees. I used to be MySQL sre, with experience in this.

thockingoog · on March 29, 2016

You're still making my point. Using something like MySQL on Borg is not trivial.

mike_hearn · on March 28, 2016

Put together the packages, request a fixed port, disable the health checks (unless I had a file in the Apache root with the right name), start it up?

I don't think Borg imposes all that many requirements on jobs, really, and the few it does can be disabled. Or at least could.

But I guess we're probably wandering out of the area covered by the papers now.

gjvc · on March 28, 2016

>> Also, it's C++, for which there is approximately ZERO opensource community.

Rubbish, and saying so undermines the rest of your points.

thockingoog · on March 28, 2016

Wonderful rebuttal. Proof? There are some successful projects, but that is not a community. There are some libs, but that is not a community.

The Go community is vibrant and growing. Go is an easy language to learn (and I say that as someone who LIKES the power of C++) and it is not a total joke to ask people who report bugs to jump in and try to fix them. C++ is simply NOT approachable by mere mortals, and would have made for a very different community and a much slower pace.

And I say that as someone who detests many facets of Go - but it's just better at Getting Things Done than C++.

zeveb · on March 28, 2016

>> Also, it's C++, for which there is approximately ZERO opensource community.

> Rubbish, and saying so undermines the rest of your points.

Hardly rubbish. Although there are open-source projects which use C++, I and many others avoid them like the plague.

I think he meant 'approximately ZERO' in the Spolsky sense, which is 'sure, there are some, but in the grand scheme of things they're indistinguishable from ε.

thrownaway2424 · on March 28, 2016

I don't strongly disagree with any of that. I was only pointing out that it's wrong to attribute a decade of history to something like cadvisor, which is brand new and does not draw on anything more than lessons learned from Google production.

I also don't blame people for being confused about Google's container infrastructure. Google has issued blog posts in the past that were misleading (at best) about the relationship between Omega, Borg, and Kubernetes.

thockingoog · on March 28, 2016

Hmm, what was misleading? That certainly was never the intent.

A decade or history lead to the knowledge that a particular style of monitoring was needed. That knowledge lead to cAdvisor. Is it perfect, of course not, but it fills a need and is directly derived from Google's experience. I fail to see how that misattribution, personally.

TheIronYuppie · on March 28, 2016

Disclaimer: I work at Google on Kubernetes.

Tim knows more about this than almost anyone, but I will add one point - we have used Kubernetes for significant internal projects, and plan to continue expanding its usage over time.

To Tim's point, though, it'll take time. The thought that you could move literally millions of lines of code and applications over to a new platform in just 10 months (the amount of time that Kubernetes has been GA) is... optimistic.

Estragon · on March 29, 2016

When did the internal projects start using Kubernetes? I heard that its uptake within Google has been very anemic.

thockingoog · on March 29, 2016

Some internal projects started evaluating Kubernetes well before 1.0. We don;t talk about them much because they are, well, internal.

This is NOT in competition with Borg, though. Not yet.

jamesblonde · on March 27, 2016

I tend to agree. It's another case of - here is an API to access our products so you can play around with it. The scalable, powerful scheduler, however is not included. There are no other batteries that fit it, apart from GCE. Kind of like Blaze and TensorFlow - nice and shiny, but hollow (missing the good distributed filling).

SEJeff · on April 10, 2016

FWIW, you can plug mesos in as a Kubernetes scheduler. From the OSS world, that is about as heavy duty as you're going to get and proven on quite large (although not google large) 10k+ node clusters.

skj · on March 28, 2016

The internal google scheduling system is simply not appropriate for anyone other than Google to use. A huge number of machines, all alike, with a huge number of trusted binaries that can be multiplexed onto these machines without fear that they're going to break the jail and cause havoc (since there is a solid trail from source to running artifact). It's just not the reality that other companies exist in.

star-trek-fleet · on March 28, 2016

Scheduler is actually one relatively simple piece of the whole picture. If scheduling is a pain, then Kubernetes would just addressed. The fact that Kubernetes did not choose scheduler, means that it is actually not a big problem, at least not the biggest one.

stonogo · on March 28, 2016

That is not an argument against releasing the code. Why would Google assign itself as gatekeeper? I personally could use this code on supercomputers now. I don't work for a supercomputing company; my use case is academic work and computational science. I absolutely have thousands of huge machines that I multiplex trusted binaries to -- and scheduling is not a trivial problem.

So what's the real reason nobody gets to see this code?

thockingoog · on March 28, 2016

The code is literally millions of LoC, all of which needs to be audited for stuff we can't release for whatever reasons. All that code is built upon layer after layer of Google internal stuff. Open-sourcing Borg means open-sourcing Chubby, internal form of gRPC (older), hundreds of libs, monitoring infrastructure, etc. Net result is O(50M) LoC. And when someone sends us a patch - then what? The cost of doing it is simply prohibitive. I'd love to do it, it's just not practical and has no RoI.

stonogo · on March 29, 2016

That's a much more sensible reason! If the code is truly Google specific, then I agre. It sounded to me like the code was not released because nobody else has a lot of computers, which I found odd.

Thank you for the details!

mike_hearn · on March 28, 2016

Borg jobs are not trusted! The system sandboxes them, prevents them spying on other jobs data files, and assumes they might abuse system resources in arbitrary ways. The days when all Google machines trusted all Google employees is long in the past.

thockingoog · on March 28, 2016

As for "no other batteries that fit it" - I am confused. We do run on AWS, OpenStack, and other cloud providers, as well as on bare metal. It's not like nobody is using this thing. In fact, just a finger in the wind, I'd guess the number of people using it outside of Google Cloud is several times more than people using it on Google Cloud.

thockingoog · on March 28, 2016

"Powerful scheduling" is such a tiny piece of what Kubernetes does, it's funny. Yeah, Borg's scheduler is faster and more scalable and has more features. It also has 12 years of optimizations under its belt. I have 100% confidence that, should Kubernetes be around 10 years from now,this will be a non-issue.

x0rg · on March 28, 2016

Sure, even though having multiple schedulers support (for different type of workloads) would be great to increase the cluster utilization, which is one selling point of such systems. I understand that Kubernetes is developed in the open and with the community but the heavy marketing as the "solution you can use now, directly from the creators of Omega" makes some people think it's ready, perfect and will fix all of their problems, but that's simply not true.

thockingoog · on March 28, 2016

We have multi-scheduler support in v1.2 :)

x0rg · on March 28, 2016

Great, the docs are a bit bad for that, I'll have a look at the code :-)

rodionos · on March 28, 2016

cAdvisor is pretty good at what it does. At its core, it's a metrics collections daemon that iterates over cgroup controllers and sends them to a backend server for centralized alerting and long-term retention. It does other helpful things with the docker API such as extracting container labels so that you can rightsize resource allocations based on workload metadata. What's not to like?

geggam · on March 28, 2016

How many companies actually need the scale of Google ? When you are Google you always need more. When you are a smaller company understanding how to control costs is more important.

Capacity planning and actual design of a system is more important for most companies.

I have seen several companies use 3 to 4 times the hardware it needs because they want to do containers or the private cloud thing.

gshx · on March 28, 2016

Can you cite some examples? Did these companies decide to run containers on vm's or just bare metal? In the latter case, they shouldn't need 3 to 4 times more hardware. Your point on capacity planning is well taken. Capacity will always be finite and especially critical to put an upper bound on if doing container based services with multi-tenancy.

geggam · on March 28, 2016

Because docker lacks security and we ran docker in a vm

so....

1 VM to run 1 docker container to run 1 jvm .... probably 10 vms per host.... instead of 20 jvms per host

...and we added consul + Haproxy and a few other pieces we didnt need ( think srv records ) ... which added yet more hardware

snark42 · on March 28, 2016

> 1 VM to run 1 docker container to run 1 jvm .... probably 10 vms per host.... instead of 20 jvms per host

That doesn't make sense, the Linux overhead is not enough to make it that much less dense. You get better HA by running 20 VM's instead of 20 JVM's because machines crash taking out 20 apps instead of 1. I know the physical host can crash taking out 20 hosts, but you have that problem in either scenario.

geggam · on March 28, 2016

I have yet to work at a place where making sense was the driving factor for technology choices :)

journeeman · on March 28, 2016

1) Will enterprises eventually shift to a DCOS?

2) Is it even worth it or necessary as afaik, even the biggest of enterprises' infrastructure needs are considerably less than those of internet companies?