> Ideally the team working on infra problems does not need to care how many versions of a backend are operating on the dev environment.
> The only thing infra needs to take into account the requested resources.
> Perverse incentives on wasting resources aside
(I do infra.) That's like, 95% of the problem. AFAICT, most devs have absolutely no idea how powerful a computer is.
My last change was to resize a 400 core, 800 GiB set of compute into a 100 core, 150 GiB set. It was just ludicrously over-provisioned, because the dev teams isn't incentivized to care at all. (…sadly, I'm not allowed to go out an hire a dev now, even though I literally just saved that amount of money in cloud costs…)
(It's still over-provisioned, but that was the easy "we can lop this compute off and I promise you you won't notice.)
The economics/incentives at play are the hard part. Getting management to not look at infra for "ehrmagerd the cloud bill" but instead devote dev time into getting them to dig into "why is this app, which ostensibly just shuffles JSON about the landscape, using 8 cores and all the RAM?" is … tough. And not what I signed up for in SWE, damn it.
The other way is equally bad: devs find they've run out of resources? Knee-jerk is "resize compute upwards" not some introspection of "wait, what is a reasonable amount of CPU use for JSON shuffling?"
Usage graphing is the other tool that really puts some devs' work in a rather bad light: resource requests of like 20 CPU, but the usage graph says "0.02 CPU". So … at least the code's not inefficient, but the requested resources are wasted.
> The only thing infra needs to take into account the requested resources.
> Perverse incentives on wasting resources aside
(I do infra.) That's like, 95% of the problem. AFAICT, most devs have absolutely no idea how powerful a computer is.
My last change was to resize a 400 core, 800 GiB set of compute into a 100 core, 150 GiB set. It was just ludicrously over-provisioned, because the dev teams isn't incentivized to care at all. (…sadly, I'm not allowed to go out an hire a dev now, even though I literally just saved that amount of money in cloud costs…)
(It's still over-provisioned, but that was the easy "we can lop this compute off and I promise you you won't notice.)
The economics/incentives at play are the hard part. Getting management to not look at infra for "ehrmagerd the cloud bill" but instead devote dev time into getting them to dig into "why is this app, which ostensibly just shuffles JSON about the landscape, using 8 cores and all the RAM?" is … tough. And not what I signed up for in SWE, damn it.
The other way is equally bad: devs find they've run out of resources? Knee-jerk is "resize compute upwards" not some introspection of "wait, what is a reasonable amount of CPU use for JSON shuffling?"
Usage graphing is the other tool that really puts some devs' work in a rather bad light: resource requests of like 20 CPU, but the usage graph says "0.02 CPU". So … at least the code's not inefficient, but the requested resources are wasted.