Scaling Kubernetes to 7,500 nodes (2021)

sciurus · on March 15, 2023

This is from 2021 and was discussed then at https://news.ycombinator.com/item?id=25907312

I'm curious what they're doing now.

dang · on March 16, 2023

Thanks! Macroexpanded:

Scaling Kubernetes to 7,500 Nodes - https://news.ycombinator.com/item?id=25907312 - Jan 2021 (53 comments)

MichaelMoser123 · on March 16, 2023

> I'm curious what they're doing now.

building skynet, apparently. All powered by k8s!

MuffinFlavored · on March 16, 2023

Scaling it to 7,600 nodes

(kidding)

mdaniel · on March 16, 2023

Given that their first post in this vein was https://openai.com/research/scaling-kubernetes-to-2500-nodes then one would expect it to be "Scaling it to 12,500 nodes" :-D

   $ kubectl get nodes
   I'm sorry, Dave, I can't do that

antonchekhov · on March 20, 2023

To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( https://armadaproject.io/ ). Armada is a multi-Kubernetes cluster batch job scheduler, and is designed to address the following issues:

A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.

Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.

Armada is designed primarily for ML, AI, and data analytics workloads, and to:

- Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.

Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.

Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!

vvladymyrov · on March 16, 2023

Also they use Ray.io from Anyscale https://archive.ph/ZlMi5

mrits · on March 15, 2023

I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.

However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.

Thaxll · on March 15, 2023

There is no alternative as far as I know. Which open source or private solution scale above 10k nodes and 100k apps ( pods )?

dharmab · on March 16, 2023

There are many private, proprietary systems that exceed those scales. They are usually bespoke to the applications that they run. I work with two such systems and know of others in finance, energy and scientific computing. Not to mention Borg at Google.

In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.

In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.

I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.

FridgeSeal · on March 16, 2023

> There are many private, proprietary systems that exceed those scales

Yeah, and then you have to run (and pay for), some private, proprietary system. K8s might be a bit worse for this, but it’s at least “commodity tooling” and lots of common tooling supports it out of the box.

dharmab · on March 16, 2023

Usually the private proprietary systems are used because the application is at least one order of magnitude larger in scale than k8s can support.

Thaxll · on March 17, 2023

HPC does a fraction of what k8s does though? Can you run 100k never ending jobs listening to networking?

From what I know HPC is basically run that short lived compute model on a very large cluster that's it.

smueller1234 · on March 16, 2023

They also sometimes predate k8s and migration is expensive.

eep_social · on March 16, 2023

> Borg at Google

Which is what Kubernetes is based on.

jmillikin · on March 16, 2023

Kubernetes is based on Omega, which was an attempt to replace Borg using a more distributed architecture. Borg uses a single master with elections, with the replicas serving as hot spares.

Source: I was on Borg-SRE at Google for ~six years, then worked with Kubernetes at Stripe for ~five.

eep_social · on March 16, 2023

Yup, and while ISTR that Omega did (eventually) get rolled out, no one ever called it anything but Borg so you’re the best kind of correct :)

emmp · on March 16, 2023

One of Nomad's major pitches is that it can scale larger than K8s. It's all over any comparison between the two.

dharmab · on March 16, 2023

And HashiCorp has a lot of success stories and experience making it work well. They can help you set up a continent-spanning cluster with 100k+ nodes.

fmajid · on March 16, 2023

HPC schedulers, most likely.

fancy_pantser · on March 16, 2023

In HPC we see Slurm pretty often.

dekhn · on March 16, 2023

Can you use SLURM to schedule services? IE, long-running not-batch-jobs that expose network ports (often multiple)? That's what k8s (and borg) really are- service schedules, not batch schedulers.

prpl · on March 16, 2023

You can schedule jobs without a wall clock time if the admin didn’t disable it, but it’s awkward. That will make scheduling MPI jobs hard/impossible.

__turbobrew__ · on March 16, 2023

Run your batch jobs in slurm and your deployments in gke.

dekhn · on March 16, 2023

I can run batch jobs in GKE. It looks like SLURM doesn't have a way to handle endpoints (IE, a port manager) so I agree, deployments will have to stay in something like k8s.

rco8786 · on March 16, 2023

Mesos definitely does

dharmab · on March 16, 2023

did*. The community is dead.

rco8786 · on March 16, 2023

Ah that's sad to hear, although I guess I contributed to that death by leaving

IceWreck · on March 16, 2023

> private solution

whatever Google and Facebook use internally.

osigurdson · on March 16, 2023

>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH

It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.

Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?

rmorey · on March 15, 2023

good read. should probably get [2021] tag

bbarnett · on March 16, 2023

Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude's basement.

With power to spare.

intelVISA · on March 16, 2023

But I thought Good System Design involved reserializing the same data multiple times across the cloud(tm) and had a dedicated SRE and infra team - it's cheaper than one sys admin!

jmillikin · on March 16, 2023

You'd generally want each of those 7500 machines be a full-sized server. No point running Kubernetes on tiny VMs, since its purpose is to provide bin-packed scheduling in a datacenter.

electroly · on March 16, 2023

This isn't some dipshit enterprise running LOB software. This is OpenAI. These are all giant multi-GPU nodes getting slammed all day with machine learning jobs.

mardifoufs · on March 16, 2023

Yeah, I'm sure openai could've trained gpt4 on a 10k$ machine.

satvikpendem · on March 16, 2023

Is Kubernetes simply BEAM but not on Erlang?