cgroups were written at Google, and have been used internally for a very long time; they provide "container"-like limits on resource usage for a group of processes.
I assume that Google isn't using Docker internally for production services, but don't know for sure (and I assume anyone who does know for sure can't tell you).
It also just doesn't make sense for Google to use Docker (or even Kubernetes) for their core infra. They've been the foremost leader in distributing containerized applications across data centers. Whatever they've already built is almost certainly more battle-tested and more customized to their needs than anything public that's based on their concepts.
I don't know where you heard that but it's simply not feasible to run k8s at the scale they run Borg. Not to mention hundreds if not thousands of features it's missing.
There’s a meme out there, helpfully nudged along by Google, that Kubernetes is Borg “done right” and the successor. It’s even mentioned in this article. Neither of those things are true. Not even remotely. Please pass along to everyone to stop repeating the meme, because it distracts from Kubernetes’ true purpose, which is to lock people into GKE and force competitors to ship a Kubernetes runtime to compete. It’s partly the OpenStack playbook: if everybody runs the same platform, competition inherently drifts toward other aspects of the businesses, such as customer support. Seriously, I’m the only one who noticed the timing of Kubernetes and Google Cloud? Really? But Google shipped it, so now it has an ecosystem, and zealots who force entire teams onto it for zero upside and nonzero overhead, while the system that actually looks like Borg, the far superior to k8s HashiCorp Nomad, twiddles its thumbs with pretty much no mindshare.
For one, if Kubernetes were the successor to Borg, they wouldn’t have hobbled it architecturally as much as they have by marrying it to both Docker (kinda) and etcd, and deciding in the beginning to do every cluster mutation via external consensus in etcd, because you know, that’s a great idea and a classic Google design. Remember when Kubernetes pushed all job state changes through consensus and a flapping job could OOM etcd? I do too. Someone cynical could argue its fundamental architectural limitations are intentional. (I would argue simply that it didn’t have Paul Menage and most of the other names on the Borg paper working on it, to my knowledge.) I hear keyboards getting angry to yammer about how it just works. Not at seven-digit machine scale, it doesn’t, and never will. I’m happy it works for you. It’s a toy for a large fleet, which I’ll revisit in another point. Everyone I am aware of at Borg or Mesos scale has ruled out or failed with Kubernetes. No, really.
Relatedly, if Kubernetes were the successor to Borg, it’d be in C++. It just would be, and that’s not a language flame war. Ever wonder what percentage of systems at Google are C++? Ever wonder what that number does when you qualify “infrastructural”?
For two, Google containers aren’t an entire operating system, unlike Docker without crazy gymnastics. Seriously, this paragraph could be an essay. To paraphrase Jeremy Clarkson, Docker looks like a proper container system described over a blurry fax. Maybe we do need seventy probably-not-deduplicated copies of getty on every machine, and I’m yelling at clouds. I doubt it, and it reeks of “disk is cheap, fuck it.”
For three, Kubernetes is several orders of magnitude behind Borg and Omega, if that’s still alive, in terms of scheduling performance and maximum “cluster” size (I quote cluster because Google identifies a Borg unit as a cell, and a cluster means something else). This is not fixable in Kubernetes, in my reasonably informed opinion, without doing consensus differently. To my point, Borg does consensus and cluster state much differently, and you know what? It’s fine. Anybody who has used fauxmaster will back me up on that, and John Wilkes even said yup, every time we hit a limit we manage to double it with no end in sight. Why would that suddenly change? etcd remains Kubernetes’ Achilles heel, and this is why messaging around Kubernetes has gravitated toward smaller, targeted clusters. Bonus: if you find a bug in etcd make sure it loudly affects Kubernetes so it gets properly prioritized. Double bonus: someone was brave enough to suggest Consul in #1957 and children. Go read and sigh.
For four, when’s the last time you ran a 10,000+ node MapReduce on Kubernetes? Surprise, the underpinnings of Borg handle both batch and interactive with the exact same control plane, which is where the billions of containers a day number they occasionally talk about comes from. I mean, several JARs of glue might get you to Hadoop scheduling via Kubernetes, but that’s a much different animal than the platform thinking in terms of jobs with different interactivity requirements.
For five, half of Borg is the shit around it. Borg works because everything behaves the same. Everything is a Web server. Everything exposes /statusz. Everything builds and monitors the same way. Everything speaks the same RPC to each other. All of this is implemented by forcing production systems at Google to choose from four (as of my tenure) languages which are well tended and manicured by hundreds of people. Google has a larger C++ standard library team than many startups have engineers. Borg works because apps let it work. They’re not black boxes. Unlike Kubernetes.
Which brings me to point the sixth, which is that the reason you haven’t seen open source Borg is (a) they’re not moving off it, like, ever (I’d bet my season tickets on it), because significant parts of every production system and tool would have to change and (b) they can’t unravel Borg and the rest of the google3 codebase, because it’s so fundamental to the Google ecosystem and half of Borg’s magic is wrapped up in other projects within Google which they aren’t keen to show you.
Link to this answer next time anyone gets tempted to relay what they’ve heard about Borg and Kubernetes, please. For years I’ve watched this tale evolve until it’s barely recognizable as factual. Saying Kubernetes is Borg’s proper successor not only drastically insults Borg and the hundreds (thousands?) who have worked on it, it also calls to mind thinking of a cotton candy machine as the successor to the automobile. They’re that different.
There's a lot of FUD in here, like suggesting that Google wouldn't use something written in Go for this purpose (lol,) suggesting an open source platform that can span multiple clouds is an attempt to lock people into GKE (lol 2x,) and suggesting that Kubernetes is "married" to Docker (CNI? CRI?) or isn't extensible (the entire gRPC/REST API? custom resources? device plugins?)
People use Kubernetes for the ecosystem, the "shit around it." They even formed a foundation for the "shit around it" called the CNCF. And if you think all of that stuff is just for GKE lockin, maybe take a look at gRPC sometime.
I don't disagree that Kubernetes is not positioned to be a replacement for Borg, at least not for many years. It has a lot to go and the Storage API is indeed a sore spot for the Kubernetes project in general. But this realization is a very long walk from "Kubernetes was designed to lock people into GKE."
> I don't disagree that Kubernetes is not positioned to be a replacement for Borg
Good. Because that was my point, but the verbiage “don’t disagree” says a lot. We agree, except for the timeline. We will all be dead before Borg is replaced with Kubernetes. You can take that to the bank.
Kinda weird to fire up a throwaway, presumably to conceal your Google credentials, then attack a Xoogler who used to work on Borg SRE (alongside sjh under Peter Dahl, and long enough my NDA has long since lapsed) and has run Kubernetes since it was able to OOM as I described, for spreading FUD. The term FUD implies that I don’t know what I’m talking about and I’m making shit up, while I’m one of the few people, including presumably you, who can actually coherently comment on both.
It can only span multiple clouds now because other clouds had to ship Kubernetes. Remember the timeline: hello world, we made a container thing. Now we offer it as a service. Now Amazon does too. Oh, we are now multicloud. Your rebuttals are quite disingenuous, and casting them with a mocking aspersion doesn’t sell your point. It makes you come across nothing like you intend.
Get back on your real username and stop being offended I criticized Kubernetes. I know I’m one of the few who does, but there are legitimate concerns, and sharing them toward a “why Borg isn’t going anywhere” point is a weird hill for you to die on.
> Kinda weird to fire up a throwaway, presumably to conceal
> your Google credentials, then attack a Xoogler who used to
> work on Borg SRE (alongside sjh under Peter Dahl, and long
> enough my NDA has long since lapsed)
That's an interesting way to phrase it. I also worked on Borg SRE with Seth under Peter (for six years), and I don't remember anyone else with the same initials as me being present during that time. Just in case I was forgetting someone, I checked the Borg paper (https://storage.googleapis.com/pub-tools-public-publication-...) for the SRE credits section -- I remember all of them, but none had your ... personality.
Luckily, the very first hit for a search of [[ jsmthrowaway site:news.ycombinator.com ]] is your old comment from 2013 (https://news.ycombinator.com/item?id=6750805), in which you discuss being terminated after two months due to failing a background check. That explains why I couldn't remember you, but it doesn't explain why you think two months of reading "Borg 101" tutorials has given you meaningful insight into which parts of the Kubernetes implementation are difficult to scale.
I found this post on the HN front page. I take it somewhat personally because you're using an account with my initials, claiming to be a member of the team I used to work on, and using it to pretend knowledge of a system you used for less time than a typical intern.
Fair enough, but you said a whole lot of other crap that is very much misleading in my opinion, and I was addressing all of that.
>It can only span multiple clouds now because other clouds had to ship Kubernetes
See, this is the kind of thing that makes me treat this as FUD rather than just criticism. I was running Kubernetes on AWS long before Amazon offered a service for it. If Kubernetes wasn't open source and instead each cloud provider came up with their own implementation, would that have been better?
>Get back on your real username
No thanks. Doesn't take a rocket scientist to figure out why.
>Do I think Kubernetes is great if I have to build a CloudFormation thing to spin up all of its infrastructure and run it myself,
Yes.
>or would I rather pay big company to do it?
Also yes. The two aren't mutually exclusive.
Come on, you've used Borg. You wanna volunteer to going back to just using machines and VMs by hand, or worse, wiring up a complex and unreadable (Ansible|SaltStack|Chef|Puppet) playbook to set everything up?
No. This is why in the early days I was indeed running Kubernetes on AWS, by hand. As the tooling improved it only got better. I honestly wanted to have alternative choices, but Docker continually missed the point with Swarm and I just gave up on it.
>Your rebuttals are honestly more misleading, in my opinion, than my points, because you’re personally wrapped up in it and that’s coming across.
> It can only span multiple clouds now because other clouds had to ship Kubernetes.
This isn't true. People were running open-source Kubernetes on AWS and Azure before either provider had a hosted Kubernetes service. In fact back when GKE was the only hosted Kubernetes service, more companies were running Kubernetes on non-Google platforms than on GCP (https://www.cncf.io/blog/2017/12/06/cloud-native-technologie...).
Thanks so much for the detailed clarification! The k8s-as-Borg-successor meme is even perpetuated on the Borg paper, so I guess that's why I repeated it :P
If I may ask, is it primarily just reliance on publicly-available infrastructural pieces that hobbles K8s in terms of scalability? i.e. that the problem is more about ecosystem than architecture, because the industry just doesn't have things like (or as "good" as) Stubby and Chubby, and Google's basically never going to open-source/reimplement those?
Stubby and Chubby are not related to Borg's scalability.
The reason Kubernetes scalability was originally not so great was because it simply wasn't prioritized. We were more concerned with building a feature set that would drive adoption (and making sure the system was stable). Only once Kubernetes began to have serious users, did we start worrying about scalability. There have been a number of blog posts on the Kubernetes blog over the years about what we did to improve scalability, and how we measure it.
I'd encourage you to join the Kubernetes scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...) to learn more about this topic. The SIG is always interested in understanding people's scalability requirements, and improving Kubernetes scalability beyond the current 5000 node "limit." (I put that in quotes because there's no performance cliff, it's just the maximum number of nodes you can run today if you want Kubernetes to meet the Kubernetes performance SLOs given the workload in the scalability tests.)
In this thread there is a repeated meme of "Borg is way more scalable than Kubernetes, and will always be so".
But this ignores a lot of the history of Borg. When Borg was first created, it was not nearly as scalable as its current incarnation. We hit scalability bugs and limitations all the time! (I was working on a team which was exploring the scalability limits of MapReduce, which was often very good at finding the limits in Borg and other systems it interacted with.)
Over the years many many Borg engineers have taken on many projects, both in solving bugs and rearchitecting major pieces of Borg with the intention of making it scale better (to run more jobs at once, utilize machines better, increase the degree of failure and performance isolation between jobs, and scale up to manage larger clusters of machines). Many of the lessons learned went into the design of Kubernetes, but Kubernetes is still much newer than Borg, which means it has fewer years of the "identify a scalability bug and squash it" feedback loop.
What is really needed to drive that loop is a major customer pushing the boundaries of scalability and identifying bugs. My guess (from the outside) is that the main users of Kubernetes have been pushing the limits in other directions, which has meant the team has been prioritizing other things (such as improving usability, and adding features) in their development efforts.
Borg will remain orders of magnitude beyond Kubernetes until Kubernetes is completely rearchitected. It’s not scalability bugs. It’s decisions regarding how the cluster maintains state that hamstring it, and that’s so fundamental to everything it’s not a find/squish loop.
As I said in my comment, those major customers (one personal experience, three anecdotally, eight or nine I’ve consulted with) have quietly ruled out Kubernetes, either by trying it or prying it apart and deciding not to try it. That feedback isn’t coming. At Borg scale, Kubernetes is very much considered a nonstarter.
> Borg will remain orders of magnitude beyond Kubernetes until Kubernetes is completely rearchitected. It’s not scalability bugs. It’s decisions regarding how the cluster maintains state that hamstring it, and that’s so fundamental to everything it’s not a find/squish loop.
Can you say more about this? Borgmaster uses Paxos for replicating checkpoint data, and etcd uses Raft for replicating the equivalent data, but these are really just two flavors of the same algorithm. I don't doubt that there are probably more efficient ways that Kubernetes could handle state (I don't claim to be an expert in that area), but I don't think they're approaches that would look any more like Borg than Kubernetes does.
If you're at liberty to do so, could you say what orchestrators the customers you mentioned chose in lieu of Kubernetes? What scale are they running at for a single cluster?
> It’s decisions regarding how the cluster maintains state that hamstring it
Jed, you keep repeating this like it's true, but it's not actually so. Here's an excerpt from Borg paper (which David co-authored btw ;-)):
> A single elected master per cell serves both as the Paxos leader and the state mutator, handling all operations that change the cell’s state, such as submitting a job or terminating a task on a machine.
And while we're at it, I don't know what it has to do with FauxMaster since it ran single replica and the passage about C++ is just pure fud.
It’s using Chubby for locking (it’s actually next sentence in that Borg paper) and some othe things not related to quorum that i cant go into. This is different from kube master that uses etcd for everything but in terms of performance it’s not a big deal because elections dont happen often (and youd be surprised how many ppl run k8s with a single master setup, even GKE)
I suppose that’s fair, but I’d argue against switching even being a primary motivation for anyone at Google, which is why I don’t think of it that way. You do have a point, though.
Without intimate knowledge of Borg, I can understand the successor discussion. With knowledge of what changed (i.e., was getting rid of borgmaster really that important to sacrifice that much perfwise?) I can’t even remotely fathom any purpose for Kubernetes other than what I’ve described. You, however, know far better than me. :)
[Disclaimers: I worked on Borg and Omega, and currently work on Kubernetes/GKE. Everything here is my personal opinion.]
There's a lot to unpack here, but I'll do my best.
I don't see Kubernetes locking people into GKE. There's an extensive conformance program (https://github.com/cncf/k8s-conformance) administered by the CNCF. AWS and Azure both have certified hosted Kubernetes offerings. Portability is in Google's best interest.
Go, Docker, and etcd were the best open-source technologies for the job at the time Kubernetes was created (and arguably still are). Open-sourcing Borg would have been impossible, due to its use of many Google-specific libraries (though a number of those have been open-sourced since then), and its close coupling to the Google production environment. Commenting more specifically on each of the pieces you mentioned:
* Go was chosen over C++ because, like C++, it is a systems language, but is much more accessible for building an open-source community.
* Docker was (and still is) by far the most popular container runtime, and the slimmer containerd makes it even more appropriate to serve as the container runtime for a system like Kubernetes. While it's true that in Borg the container runtime and "package" (container image) management systems are separate, the tradeoffs between packaging more in the image vs. pre-installing dependencies on the host are exactly the same as with Docker images. In any event, it's very feasible to build very slim Docker images (you definitely don't need getty in your image :-).
* You can read the reasons etcd was chosen in this recent comment (https://news.ycombinator.com/item?id=17476142) from a Red Hat employee who is one of the earliest contributors to Kubernetes and one of the most prolific. Regarding consensus, I didn't understand your comment; Borg uses Paxos and etcd uses Raft, but those are basically equivalent algorithms.
Regarding scalability, we do continuous scalability testing as part of the Kubernetes CI pipeline, at a cluster size of 5000 nodes. If you're interested in learning more, I'd encourage you to joint the scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...). I'm not aware that "messaging around Kubernetes has gravitated toward smaller, targeted clusters." It's true that a lot of people do use small-ish clusters, but AFAICT that's not because of scalability limitations, but rather because (1) the hosted Kubernetes offerings make it so easy to spin up clusters on demand, and (2) until recently, Kubernetes was lacking critical multi-tenancy features that would allow, say, multiple teams within a company to safely share a cluster.
Regarding mixing batch and interactive/serving applications in a single cluster managed by a single control plane, this has been the intention of Kubernetes from the beginning. It's true that open-source batch systems like Hadoop and Spark have traditionally shipped with their own orchestrators/schedulers, but that's starting to change as Kubernetes becomes more popular, for example Spark now supports Kubernetes natively (https://kubernetes.io/blog/2018/03/apache-spark-23-with-nati...). In terms of features that enable batch and serving workloads to share a node and a cluster, Kubernetes has had the concept of QoS classes (https://kubernetes.io/docs/tasks/configure-pod-container/qua...) from the beginning, and as of the most recent Kubernetes release we now have priority/preemption (https://cloudplatform.googleblog.com/2018/02/get-the-most-ou...). QoS classes and priority/preemption are the two main concepts that allow batch and interactive/serving application to share nodes and clusters in Borg, and we now have them in Kubernetes.
On your fifth point, I agree that this is one of the strengths of the Google production environment, but Kubernetes is limited in how prescriptive it can be in dictating how people write applications, since we want Kubernetes to work with essentially any application. This is why we have, for example, extremely flexible liveness/readiness probes in Kubernetes (https://kubernetes.io/docs/tasks/configure-pod-container/con...) rather than the expectation that every application has a built-in web server that exports a predefined /statusz endpoint. That said, we have been more prescriptive in how to build Kubernetes control plane components (for example such components generally have /healthz endpoints and export Prometheus instrumentation according to the guidelines outlined at https://github.com/kubernetes/community/blob/master/contribu...). Over time as containers and the "cloud native" architecture become more popular, I think there will be more standardization in the ways you described when people see the benefits it provides in allowing them to plug in their app immediately to standard container ecosystems. To some extent Istio (https://github.com/istio/istio) is a step in that direction, and in some sense even better because it interposes transparently rather than requiring you to build your application a particular way.
For anyone interested in learning more about the evolution of cluster management systems at Google, I recommend this paper: https://ai.google/research/pubs/pub44843
While Kubernetes is definitely not the same codebase as Borg, I do think it's accurate to say that it is the descendant of Borg.
Dumb question: why does K8s use a centralized architecture like Borg, if the perf gains from an Omega-style shared-state scheduler decentralization (and maybe a Mesos-style two-level scheduler for batch with multiple frameworks) were already known, and Omega was already being folded back into Borg?
Is this related to (I'm assuming) the fact that K8s was originally architected "mostly" with service rather than batch in mind, and a monolithic scheduler was "good enough"?'
(Disclaimer: I haven't really followed K8s stuff in the last few months. How is multi-scheduler support for K8s nowadays, anyways?)
You can actually build an Omega vertical / Mesos framework architecture on Kubernetes, as described in this doc[1]. That doc pre-dated CRDs; the way you'd do it today is to build the application lifecycle management part of the framework using a CRD + controller, and run an application-specific scheduler (for pods created by that controller) alongside the default scheduler. The Kubernetes documentation page explaining how to run multiple/custom schedulers is here[2].
Borg only worked with a single scheduler, but Kubernetes allows you to build Omega/Mesos style verticals/frameworks and associated scheduling as user extensions to the control plane (as described above).
The rescheduler in Borg isn't a scheduler -- it just evicts pods, and then they go into the regular scheduler's pending queue and the regular scheduler decides where to schedule them. (At least that's how it worked at the time I left the project -- I assume it hasn't changed in this regard, but I don't know for sure.)
As a Xoogler myself, I have always wondered about the logic of "we can't open source X because it uses too many libraries and is too integrated". The obvious answer is, OK, open source the libraries and refactor the integrations to make them more flexible.
Reimplementing all of Borg from scratch seems crazy to me given the huge effort that went into it. Does Google want an open source cluster infrastructure or not? If yes, in what universe is it less effort to write a totally new one from scratch vs just progressively open sourcing things?
What's the size of the transitive dependency graph of Borg? 10MLOC? 50MLOC? 100MLOC? I have no idea. But it's a lot of code no matter what. Open sourcing that much code is a huge undertaking, unless you're just planning to throw it over the wall with no expectation of external people working on it.
On the other hand starting from scratch you get to grow the community and the codebase in lockstep.
It may be a large undertaking but yes, it's clearly still less work to release code that exists and build a community around it, than rewrite it all from scratch and also build a community around that too.
I've worked at Google for many years, and built open source communities based on code I've written from scratch several times. This is not an area I'm inexperienced in.
Generally Google software has a bottom up completely different approach to industry norm/standard. And the divergence started from Google’s very beginning.
Open sourcing system software from its internal state requires the same amount of work as rewriting, plus the effort to morph interfaces and internals to fit external needs, plus changes to internal workloads (assuming a unified stack internal & external).
I worked on google3 for years, most likely some of the code I wrote is still there. I've also done a lot of open source coding too. I'm quite familiar with the structure of google3 and it's not as different as you claim - Borg is a bunch of C++ libraries and programs that depend on each other, nothing magic about that.
So I completely disagree that open sourcing code is as hard as rewriting it from scratch. I think if you tried to argue that to anyone outside the Google bubble they'd think you were crazy. Writing code is hard work! Uploading it to github and creating some project structures around it is vastly less work.
I can't help wondering if this is engineers looking for new promotion-worthy projects.
To rebuttal your statements, I seem need to reveal a lot of technical details. You did not mention what type of software you were open sourcing when you were in Google. But it seems our overlap in knowledge is rather small.
I'll leave this open.
But I want to emphasize that what I stated are reasonable reasons for open sourcing by writing from scratch.
So it descends from Borg, which is fine. It does not replace Borg or indicate a Google strategy to replace Borg with Kubernetes, which was my entire point with supporting points on why, and explaining why you made the choices in Kubernetes that were made does not dispute that at all.
I note you were careful to use the word descendant, instead of my successor.
What I mean is simple: Borg has borgmaster. Kubernetes approached the same concept like a Web application, and now Kubernetes has an entire SIG to play on the same field as Borg. It was a poor architectural decision, along with many others in Kubernetes, but I wasn’t discussing that. I was discussing why Google won’t replace Borg with it.
Yeah, so to clarify, I know from the Borg paper that Google basically implemented the first cgroups and the first cgroups-based containers. I'm pretty sure that lmctfy was the open-sourcing of this work, but it's also been deprecated and last I heard, that code was moving to libcontainer/runc.
If (as the article implies) Google uses Docker internally, that would be a surprising and interesting bit of news.
I've heard a few times that Google's containers actually run inside of VMs. I'm curious if anyone knows what their VM implementation is or what its based on?
Lots of bits of KVM turned off, though. Makes it really interesting when I work with people on Open Source stuff. I find out all sorts of things KVM can apparently do that mostly leave me going "you put WHAT in host ring zero?!" :)
(note: as implied, I work on our userland QEMU replacement)
They mentioned that Kubernetes runs some workloads and they are probably using Docker for that, like GKE does (17.03, I think). No way they would use it for Borg.
I assume that Google isn't using Docker internally for production services, but don't know for sure (and I assume anyone who does know for sure can't tell you).