To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( https://armadaproject.io/ ). Armada is a
multi-Kubernetes cluster batch job scheduler, and is designed to address the
following issues:
A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster
scheduler built on top of several Kubernetes clusters.
Achieving very high throughput using the in-cluster storage backend, etcd, is
challenging. Hence, queueing and scheduling is performed partly out-of-cluster
using a specialized storage layer.
Armada is designed primarily for ML, AI, and data analytics workloads, and to:
- Manage compute clusters composed of tens of thousands of nodes in total.
- Schedule a thousand or more pods per second, on average.
- Enqueue tens of thousands of jobs over a few seconds.
- Divide resources fairly between users.
- Provide visibility for users and admins.
- Ensure near-constant uptime.
Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.
I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.
However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.
There are many private, proprietary systems that exceed those scales. They are usually bespoke to the applications that they run. I work with two such systems and know of others in finance, energy and scientific computing. Not to mention Borg at Google.
In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.
In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.
I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.
> There are many private, proprietary systems that exceed those scales
Yeah, and then you have to run (and pay for), some private, proprietary system. K8s might be a bit worse for this, but it’s at least “commodity tooling” and lots of common tooling supports it out of the box.
Kubernetes is based on Omega, which was an attempt to replace Borg using a more distributed architecture. Borg uses a single master with elections, with the replicas serving as hot spares.
Source: I was on Borg-SRE at Google for ~six years, then worked with Kubernetes at Stripe for ~five.
Can you use SLURM to schedule services? IE, long-running not-batch-jobs that expose network ports (often multiple)? That's what k8s (and borg) really are- service schedules, not batch schedulers.
I can run batch jobs in GKE. It looks like SLURM doesn't have a way to handle endpoints (IE, a port manager) so I agree, deployments will have to stay in something like k8s.
>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH
It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.
Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?
But I thought Good System Design involved reserializing the same data multiple times across the cloud(tm) and had a dedicated SRE and infra team - it's cheaper than one sys admin!
You'd generally want each of those 7500 machines be a full-sized server. No point running Kubernetes on tiny VMs, since its purpose is to provide bin-packed scheduling in a datacenter.
This isn't some dipshit enterprise running LOB software. This is OpenAI. These are all giant multi-GPU nodes getting slammed all day with machine learning jobs.
I'm curious what they're doing now.