Thanks so much for the detailed clarification! The k8s-as-Borg-successor meme is even perpetuated on the Borg paper, so I guess that's why I repeated it :P
If I may ask, is it primarily just reliance on publicly-available infrastructural pieces that hobbles K8s in terms of scalability? i.e. that the problem is more about ecosystem than architecture, because the industry just doesn't have things like (or as "good" as) Stubby and Chubby, and Google's basically never going to open-source/reimplement those?
Stubby and Chubby are not related to Borg's scalability.
The reason Kubernetes scalability was originally not so great was because it simply wasn't prioritized. We were more concerned with building a feature set that would drive adoption (and making sure the system was stable). Only once Kubernetes began to have serious users, did we start worrying about scalability. There have been a number of blog posts on the Kubernetes blog over the years about what we did to improve scalability, and how we measure it.
I'd encourage you to join the Kubernetes scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...) to learn more about this topic. The SIG is always interested in understanding people's scalability requirements, and improving Kubernetes scalability beyond the current 5000 node "limit." (I put that in quotes because there's no performance cliff, it's just the maximum number of nodes you can run today if you want Kubernetes to meet the Kubernetes performance SLOs given the workload in the scalability tests.)
In this thread there is a repeated meme of "Borg is way more scalable than Kubernetes, and will always be so".
But this ignores a lot of the history of Borg. When Borg was first created, it was not nearly as scalable as its current incarnation. We hit scalability bugs and limitations all the time! (I was working on a team which was exploring the scalability limits of MapReduce, which was often very good at finding the limits in Borg and other systems it interacted with.)
Over the years many many Borg engineers have taken on many projects, both in solving bugs and rearchitecting major pieces of Borg with the intention of making it scale better (to run more jobs at once, utilize machines better, increase the degree of failure and performance isolation between jobs, and scale up to manage larger clusters of machines). Many of the lessons learned went into the design of Kubernetes, but Kubernetes is still much newer than Borg, which means it has fewer years of the "identify a scalability bug and squash it" feedback loop.
What is really needed to drive that loop is a major customer pushing the boundaries of scalability and identifying bugs. My guess (from the outside) is that the main users of Kubernetes have been pushing the limits in other directions, which has meant the team has been prioritizing other things (such as improving usability, and adding features) in their development efforts.
Borg will remain orders of magnitude beyond Kubernetes until Kubernetes is completely rearchitected. It’s not scalability bugs. It’s decisions regarding how the cluster maintains state that hamstring it, and that’s so fundamental to everything it’s not a find/squish loop.
As I said in my comment, those major customers (one personal experience, three anecdotally, eight or nine I’ve consulted with) have quietly ruled out Kubernetes, either by trying it or prying it apart and deciding not to try it. That feedback isn’t coming. At Borg scale, Kubernetes is very much considered a nonstarter.
> Borg will remain orders of magnitude beyond Kubernetes until Kubernetes is completely rearchitected. It’s not scalability bugs. It’s decisions regarding how the cluster maintains state that hamstring it, and that’s so fundamental to everything it’s not a find/squish loop.
Can you say more about this? Borgmaster uses Paxos for replicating checkpoint data, and etcd uses Raft for replicating the equivalent data, but these are really just two flavors of the same algorithm. I don't doubt that there are probably more efficient ways that Kubernetes could handle state (I don't claim to be an expert in that area), but I don't think they're approaches that would look any more like Borg than Kubernetes does.
If you're at liberty to do so, could you say what orchestrators the customers you mentioned chose in lieu of Kubernetes? What scale are they running at for a single cluster?
> It’s decisions regarding how the cluster maintains state that hamstring it
Jed, you keep repeating this like it's true, but it's not actually so. Here's an excerpt from Borg paper (which David co-authored btw ;-)):
> A single elected master per cell serves both as the Paxos leader and the state mutator, handling all operations that change the cell’s state, such as submitting a job or terminating a task on a machine.
And while we're at it, I don't know what it has to do with FauxMaster since it ran single replica and the passage about C++ is just pure fud.
It’s using Chubby for locking (it’s actually next sentence in that Borg paper) and some othe things not related to quorum that i cant go into. This is different from kube master that uses etcd for everything but in terms of performance it’s not a big deal because elections dont happen often (and youd be surprised how many ppl run k8s with a single master setup, even GKE)
If I may ask, is it primarily just reliance on publicly-available infrastructural pieces that hobbles K8s in terms of scalability? i.e. that the problem is more about ecosystem than architecture, because the industry just doesn't have things like (or as "good" as) Stubby and Chubby, and Google's basically never going to open-source/reimplement those?
Thanks again!