Even Google doesn't need it that much: back when I was there, each Borg cluster ...

Even Google doesn't need it that much: back when I was there, each Borg cluster had something like 10,000+ cores. Large enough to run a typical SV startup wholescale. The ratio of "cluster management work" vs. "actual work being done on it" was not that high.

These days, some people are like "Dude, if you don't have one cluster per AWS availability zone per each environment, you're doing it wrong." Why, just why.