As an HPC sysadmin for 3 research institutes (mostly life sciences & biology) I ...

As an HPC sysadmin for 3 research institutes (mostly life sciences & biology) I can't see how cloud HPC system could be any cheaper than an on-prem HPC system especially if I look at the resource efficiency (how much resources were requested vs how much were actually useed) of our users SLURM jobs. Often the users request 100s of GB but only use a fraction of it. In our on-prem HPC system this might decrease utilization (which is not great) but in the case of the cloud this would result in increased computing costs (because bigger VM flavor) which would be probably worse (CapEx vs OpEx) Of course you could argue that the users should do and know better and properly size/measure their resource requirements however most of our users have lab background and are new to computational biology so estimating or even knowing what all the knobs (cores, mem per core, total memory, etc) of the job specification means is hard for them. We try to educate by providing trainings and job efficency reporting however the researchters/users have little incentive to optimize the job requests and are more interested in quick results and turnover which is also understandable (the on-prem HPC system is already payed for). Maybe the cost transparancy of the cloud would force them or rather their group leaders/institute heads to put a focus on this but until you move to the cloud you won't know.

Additionally the typical workloads that run on our HPC system are often some badly maintained bioinformatics software or R/perl/pythong throwaway scripts and often enough a typo in the script causes the entire pipeline to fail after days of running on the HPC system and needs to be restarted (maybe even multiple times). Again on the on-prem system you have wasted electricity (bad enough) but in the cloud you have to pay the computing costs of the failed runs. Again cost transparency might force to fix this but the users are not software engineers.

One thing that the cloud is really good at, is elasticity and access to new hardware. We have seen for example a shift of workloads from pure CPUs to GPUs. A new CryoEM microscope was installed where the downstream analysis is relying heavily on GPUs, more and more resaerch groups run Alpafold predictions and also NGS analysis is now using GPUs. We have around 100 GPUs and average utlizations has increased to 80-90% and the users are complaining about long waiting/queueing times for their GPU jobs. For this bursting to the cloud would be nice, however GPUs are prohibitively expensive in the cloud unfortunately and the above mentioned caveats regarding job resource efficiencies still apply.

One thing that will hurt on-prem HPC systems tough are the increased electricity prices. We are now taking measures to actively save energy (i.e. by powering down idle nodes and powering them up again when jobs are scheduled). As far as I can tell the big cloud providers (AWS, etc) haven't increased the prices yet either because they cover elecriticity cost increase with their profit margins or they are not affected as much because they have better deals with elecricity providers.