Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Kubernetes to 7,500 nodes (2021) (openai.com)
95 points by izwasm on March 15, 2023 | hide | past | favorite | 37 comments



This is from 2021 and was discussed then at https://news.ycombinator.com/item?id=25907312

I'm curious what they're doing now.


Thanks! Macroexpanded:

Scaling Kubernetes to 7,500 Nodes - https://news.ycombinator.com/item?id=25907312 - Jan 2021 (53 comments)


> I'm curious what they're doing now.

building skynet, apparently. All powered by k8s!


Scaling it to 7,600 nodes

(kidding)


Given that their first post in this vein was https://openai.com/research/scaling-kubernetes-to-2500-nodes then one would expect it to be "Scaling it to 12,500 nodes" :-D

   $ kubectl get nodes
   I'm sorry, Dave, I can't do that


To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( https://armadaproject.io/ ). Armada is a multi-Kubernetes cluster batch job scheduler, and is designed to address the following issues:

A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.

Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.

Armada is designed primarily for ML, AI, and data analytics workloads, and to:

- Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.

Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.

Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!


Also they use Ray.io from Anyscale https://archive.ph/ZlMi5


I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.

However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.


There is no alternative as far as I know. Which open source or private solution scale above 10k nodes and 100k apps ( pods )?


There are many private, proprietary systems that exceed those scales. They are usually bespoke to the applications that they run. I work with two such systems and know of others in finance, energy and scientific computing. Not to mention Borg at Google.

In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.

In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.

I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.


> There are many private, proprietary systems that exceed those scales

Yeah, and then you have to run (and pay for), some private, proprietary system. K8s might be a bit worse for this, but it’s at least “commodity tooling” and lots of common tooling supports it out of the box.


Usually the private proprietary systems are used because the application is at least one order of magnitude larger in scale than k8s can support.


HPC does a fraction of what k8s does though? Can you run 100k never ending jobs listening to networking?

From what I know HPC is basically run that short lived compute model on a very large cluster that's it.


They also sometimes predate k8s and migration is expensive.


> Borg at Google

Which is what Kubernetes is based on.


Kubernetes is based on Omega, which was an attempt to replace Borg using a more distributed architecture. Borg uses a single master with elections, with the replicas serving as hot spares.

Source: I was on Borg-SRE at Google for ~six years, then worked with Kubernetes at Stripe for ~five.


Yup, and while ISTR that Omega did (eventually) get rolled out, no one ever called it anything but Borg so you’re the best kind of correct :)


One of Nomad's major pitches is that it can scale larger than K8s. It's all over any comparison between the two.


And HashiCorp has a lot of success stories and experience making it work well. They can help you set up a continent-spanning cluster with 100k+ nodes.


HPC schedulers, most likely.


In HPC we see Slurm pretty often.


Can you use SLURM to schedule services? IE, long-running not-batch-jobs that expose network ports (often multiple)? That's what k8s (and borg) really are- service schedules, not batch schedulers.


You can schedule jobs without a wall clock time if the admin didn’t disable it, but it’s awkward. That will make scheduling MPI jobs hard/impossible.


Run your batch jobs in slurm and your deployments in gke.


I can run batch jobs in GKE. It looks like SLURM doesn't have a way to handle endpoints (IE, a port manager) so I agree, deployments will have to stay in something like k8s.


Mesos definitely does


did*. The community is dead.


Ah that's sad to hear, although I guess I contributed to that death by leaving


> private solution

whatever Google and Facebook use internally.


>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH

It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.

Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?


good read. should probably get [2021] tag


Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude's basement.

With power to spare.


But I thought Good System Design involved reserializing the same data multiple times across the cloud(tm) and had a dedicated SRE and infra team - it's cheaper than one sys admin!


You'd generally want each of those 7500 machines be a full-sized server. No point running Kubernetes on tiny VMs, since its purpose is to provide bin-packed scheduling in a datacenter.


This isn't some dipshit enterprise running LOB software. This is OpenAI. These are all giant multi-GPU nodes getting slammed all day with machine learning jobs.


Yeah, I'm sure openai could've trained gpt4 on a 10k$ machine.


Is Kubernetes simply BEAM but not on Erlang?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: