Having had the responsibility of providing HPC for a literal buildings full of scientists, I can say that it may be true that you can get computation cheaper with owned hardware, than in a cloud. Certainly pay as you go, individual project at a time processing will look that way to the scientist. But I can also say with confidence that the contest is far closer than they think. Scientists who make this argument almost invariably leave major costs out of their calculation - assuming they can put their servers in a closet,maintain them themselves, do all the security infrastructure, provide redundancy and still get to shared compute when they have an overflow need. When the closet starts to smoke because they stuffed it with too many cheaply sourced, hot-running cores and GPUs, or gets hacked by one of their postdocs resulting in an institutional HIPAA violation, well, that's not their fault.
Put like for like in a well managed data center against negotiated and planned cloud services, and the former may still win, but it won't be dramatically cheaper, and figured over depreciable lifetime and including opportunity cost, may cost more. It takes work to figure out which is true.
Let me echo this as someone who once was responsible for HPC computing in a research intensive public university. Most career academics have NO IDEA how much enterprise computing infrastructure costs. If a 1 terabyte USB hard drive is $40 at Costco we (university IT) must be getting a much better deal than that. Take this argument and apply it to any aspect of HPC computing and that's what you're fighting against. The closet with racks of gear and no cooling is another fond memory. Don't forget the AC terminal strips that power the whole thing, sourced from the local dollar store.
I remember the first time a server caught fire in the closet we kept the rack in. Backups were kept on a server right below the one on fire. But, y'know, we saved money.
I never saw an actual fire, but I we did see smoke, in a closet, on a floor that was 90% patient care, but happened to have a research area as well because the research needed access to expensive radiography equipment. The close was literally stuffed with what amounted to gaming machines, purchased with grant money through an importer, directly from China. The guy who set them up was smart enough to put them all behind a little firewall, so the enterprise network couldn't see them. It was a (literally) hot mess, both infrastructure and security-wise.
Don’t worry, we do incremental backups during weekdays and a full backup on Sunday. We use 2 tapes only, so one is always outside of the building. But you know, we saved money.
We had a million dollars worth of hardware installed in a closet. It had a portable AC hooked up that needed it’s water bin changed every so often.
Well I was in the middle of that. When the Director decided to show off the new security doors. So he closed the room up. Then found out that new security doors didn’t work. I find out as I’m coming back to turn AC back on.
Room will get hot really fast.
We get office Security to unlock door. He says he doesn’t have authority. His supervisor will be by later in the day.
Completely deadpan, and in front of several VPs of a forth one 50.
I turn to guy to my right who lived nearby. “Go home and get your chainsaw”
We were quickly let in. Also got fast approval to install proper cooling.
Homelab-priced sensor is the temp sensor in your server, it's free! Actual servers have a bunch, usually have one at intake, "random old PC" servers can use motherboard temp as rough proxy for environment temp.
Hell, even in DC you can look at temperatures and see in front of which server technican was standing just by those sensors.
Second cheapest would be USB-to-1wire module + some DS18B20 1-sire sensors. Easy hobby job to make. They also come with unique ID which means if you put it in TSDB by that ID it doesn't matter where you plug those sensors.
No more or less chance that really anything else connected to power to catch on fire. Just make sure RCD works.
Hell, some weeks ago my oven decided the lower heater's line is now connected to ground and blows fuses...
Our 8 racks in DC only had single event of something blowing (power supply) and aside of smell and fuse blowing nothing really happened
Servers are essentially metal boxes with a bit of glass-reinforced epoxy and some plastic inside so there is limited amount of stuff that can burn. UPS is probably bigger problem
For example the OVH datacenter fire was more of "stuff around servers was flammable" (they had wooden ceilings for some reason...) rather than "just" servers.
"Residual current device". It detects current leaking to ground; essentially it prevents you from killing someone by throwing a toaster into the bathtub.
I am dealing with the exact opposite problem: "Oh you mean, we should leave the EC2 instance running 24/7??? No way, that would be too expensive"... to which I need to respond "No, it would be like $15/month. Trivial, stop worrying about costs in EC2 and S3, we're like 7 people here with 3 GB of data."
I deal with Scientists that think AWS is some sort of a massively expensive enterprise thing. I can be, but not for the use case they're going to be embarking on. Our budget is $7M spanning 4 years.
Hahaha, may be I need to just go into the AWS ether and start yakking big words like "Elastic Kubernetes Service" to confuse the scientists and get my aws fix. These people are too stingy. I want some shit running in AWS, what good is this admin IAM role.
Your comment about cooling reminded me of a fun anecdote from my time in academia.
My PhD involved quite a lot of high performance computing simulations. We kept running into problems where on warm days, all our jobs would get killed pretty consistently. Our IT guy noticed a pattern where the temps would be perfectly normal, and then suddenly out of nowhere go through the roof, triggering a thermal shutdown of the racks.
In the end our IT guy camped out in the server room on a hot day to watch what was happening. The Astrophysics' cabinet were directly infront of ours, and they had jerry rigged their cabinet door so that when it got too hot, it would swing open and their hot air would be blown all over the neighbouring cabinets...
It's kind of funny around this time of year when some researchers have $10,000 in their budget they need to spend, and they want to 'gift' us with some GPU's.
That was definitely one of the weirdest things of working in academia IT: “hey. Can you buy me a workstation that’s as close to $6,328.45 as it is possible to get, and can you do it by 4pm?”
Same thing happens in the government sector here (US). If you don't spend all of the budget you requested last year, you might not get it next year. There is an entire ecosystem of bottom-feeder GSA companies that apparently exist to spend year-end money that would otherwise go to 'waste'.
> Don't forget the AC terminal strips that power the whole thing, sourced from the local dollar store.
Love how you’ve fuzzed the root cause to make it seem like the “dollar store strip” is the problem and not that it was plugged into an overloaded outlet or run inside a closet at significantly elevated temperatures, leading to plastic melting and wires shorting.
Always helps to keep the “magic” a secret so the rubes have to keep us wizards employed, right?
I think it really depends on the task. Where HIPAA violation is a real threat, the equation changes. And just for CYA purposes those projects can get pushed to a cloud. Which does not necessarily involve any attempts to make them any more secure, but this is a different topic.
That said, many scientists are operating on premise hardware like this: some servers in a shared rack and an el-cheapo storage solutions with an ssh access for people working in the lab. And it works just fine for them.
Cloud services focus for running business computing in a cloud, emphasizing recurring revenue. Most research labs are much more comfortable with spending the hardware portion of a grant upfront and not worrying about some student who, instead of working on some fluid dynamics problem found a script to re-train a stable diffusion and left it running over winter break. My 2c.
Until it doesn't because there's a fire or huge power surge or whatever.
That's the point -- there's a lot of risk they're not taking into account, and by focusing on the "it works just fine for them", you're cherry picking the ones that didn't suffer disaster.
I'd counter by saying I think you're over-estimating how valuable mitigating that risk is to this crowd.
I'd further say that you're probably over-estimating how valuable mitigating that risk is to anyone, although there are a few limited set of customers that genuinely do care.
There are few places I can think of that would benefit more by avoiding cloud costs than scientific computing...
They often have limited budgets that are driven by grants, not derived by providing online services (computer going down does not impact bottom line).
They have real computation needs that mean hardware is unlikely to sit idle.
There is no compelling reason to "scale" in the way that a company might need to in order to handle additional unexpected load from customers or hit marketing campaigns.
Basically... the only meaningful offering from the cloud is likely preventing data loss, and this can be done fairly well with a simple backup strategy.
Again - they aren't a business where losing a few hours/days of customer data is potentially business ending.
---
And to be blunt - I can make the same risk avoidance claims about a lot of things that would simply get me laughed out of the room.
"The lead researcher shouldn't be allowed in a car because it might crash!"
"The lab work must be done in a bomb shelter in case of war or tornados!"
"No one on the team can eat red meat because it increases the risk of heart attack!"
and on and on and on... Simply saying "There's risk" is not sufficient - you must still make a compelling argument that the cost of avoiding that risk is justified, and you're not doing that.
Ummm. I’ve def been unable to do anything for entire days because our AWS region went down and we had to rebuild the database from scratch. AWS goes down, you twiddle your thumbs and the people you report to are going to be asking why, for how long, etc. and you can’t give them an answer until AWS comes back to see how fubar things are.
When your own hardware rack goes down. You know the problem, how much it costs to fix it, and when it will come back up; usually within a few hours (or minutes) of it going down.
Do things catch fire, yes. But I think you’re over-estimating how often. In my entire life, I’ve had a single SATA connector catch fire and it just melted plastic before going out.
I'm not talking about temporary outages, I'm talking about data loss.
With AWS it's extremely easy to keep an up-to-date database backup in a different region.
And it's great that you haven't personally encountered disaster, but of course once again that's cherry-picking. And it's not just a component overheating, it's the whole closet on fire, it's a broken ceiling sprinkler system going off, it's a hurricane, it's whatever.
So was I also talking about data loss. Not everything can be replicated, but backups can and were made.
For the rest, there’s insurance. Most calculations done in a research setting are dependent upon that research surviving. If there’s a fire and the whole building goes down, those calculations are probably worthless now too.
Hell, most companies probably can’t survive their own building/factory burning down.
I would say even easier on prem as you don't need to wade 15 layers deep to do anything. Since I have moved to hosting my own stuff at my house, I have learned that connecting a monitor and keyboard to a 'sever' is awesome for productivity. I know where everything is, its fast as hell, and everything is locked down. Monitoring temps, adjusting and configuring hardware is just better in every imaginable way. Need more RAM, Storage, Compute? Slap those puppies in there and send it.
For home gamers like myself, it's has become a no brainer with advances in tunneling, docker, and cheap prices on Ebay.
May be consider that your use case and the average scientist's use case isn't the same? What works for you won't work for them and vise versa? What you consider a risk, I wouldn't?
Consider the following, I have never considered applying meltdown or spectre mitigations if it makes my code run slower because I plain don't care, assuming anyone even peeks at what my simulations doing, whoopdeedo, I don't care. I won't do that on my laptop I use to buy shit off amazon with, but the workstation I have control of? I don't care. I DO care if my simulation will take 10 days instead of a week.
My use case isn't yours because my needs aren't yours. Not everything maps across domains.
The point is, there's not need for everything to be 100% reliable in this context. If a fire destroys everything and their computational resources is unavailable for a few days, that's somewhat okay. Not ideal, but not a catastrophic loss either. Even data loss is no catastrophic - at worst it means redoing one or two weeks worth of computations.
Some sort of 80/20 principle is at works here. Most of the costs in professional cloud solutions comes from making the infrastructure 99.99% reliable instead of 99% reliable. It is totally worth it if you have millions of customers that expect a certain level of reliability, but a complete overkill if the worst case scenario from a system failure is some graduate student having to redo a few days worth of computations (which probably had to be redone several times anyway because of some bug in the code or something).
> there's a lot of risk they're not taking into account
I see it the other way: experimental scientists operate with unreliable systems all the time: fickle systems, soldered one-time setups, shared lab space, etc. Computing is just one more thing that is not 100% reliable (but way more reliable than some other equipment), and usb data sticks serve as a good enough data backup.
The counterpoint to that point is that a significant percentage of scientific computing doesn't care about any of that. They are unlikely to have enough hardware to cause a fire and they don't care about outages or even data loss in many cases. As others have said, it depends on the specifics of the research. In the cases where that stuff matters, the cloud would be better option.
Even that depends on what you're doing. Most scientists aren't running apps that require several 9's of availability, connect to an irreplaceable customer database, etc.
An outage, or even permanent loss of hardware, might not be a big problem if you're running easily repeatable computations on data of which you have multiple copies. At worst, you might have to copy some data from an external hard drive and redo a few weeks' worth of computations.
Thankfully, only a small part of the academic research enterprise involves human subjects, HIPAA, and all that. Neither fruit flies nor quarks have privacy rights.
Research involving human subjects (psychology, cognitive neuroscience, behavioral economics, etc.) requires institutional review board approval and informed consent, etc. but mostly doesn't involve HIPAA either.
And many, many institutions are over cautious. My own university, for example, has no data classification between "It would be totally okay if anyone in the university has access" and "Regulated data", so "I mean, it's health information, and it's governed by our data use agreement with the provider..." gets it kicked to the same level as full-fat HIPAA data.
This is true, but to say it is "not a law", as you did, completely unqualified, is incorrect. If the research project is connected with a government grant (and many are) you need to pay attention to those laws. Many universities also have their own policies you need to follow, regardless. (Requiring informed consent and protecting people's privacy seems like a good thing.)
Let me repeat it another way. The law only restricts the actions of the government. Members of a university are not the government. Even if they took government money they could legally ignore all of that stuff. Worst case you will not get more funding from them in the future.
I believe you are technically correct, but that does not change the fact that universities have IRBs and will require reviews/approval if you are connected to that institution. You really think they're going to put their funding at risk? This seems very unlikely.
I've been running a group server (basically a shared workstation) for 5 years and it's been great. Way cheaper than cloud, no worrying about national rules on where data can be stored, no waiting in a SLURM batch queue, Jupyter notebooks on tap for everyone. A single $~6k outlay (we don't need GPUs which helps).
Classic big workstations are way more capable than people think - but at the same time it's hard to justify buying one machine per user unless your department is swimming in money. Also, academic budgets tend to come in fixed chunks, and university IT departments may not have your particular group as a priority - so often it's just better to invest once in a standalone server tower that you can set up to do exactly what you need it to than try to get IT to support your needs or the accounting department to pay recurring AWS bills.
Well the title is scientific computing, which includes HPC but not only. Anyway the fact is that a lot of "HPC" in university clusters is smaller jobs that are too much for an average PC to handle, but still fit into a single typical HPC node. These are usually the jobs that people think to farm out to AWS, but that you will generally find are cheaper, faster, and more reliable if you just run them on your own hardware.
Perspective from a computational biologist: Campus hosted HPC means the direct cost pressure is seen as IT staff and hardware related costs. Researchers are encouraged to use the available capacity. This is good.
Externally-hosted HPC means every single compute job is seen as something that directly costs money. This negatively affects the quality of scientific output (research playfulness / creativity / focus on the research / etc).
Yes. The costing models that are used (and often required by granting agencies) make apples to apples cost comparisons almost impossible, and impose undoubted and significant false costs on research budgets. No question this is true.
They leave out major costs because they don't pay those costs. Power, Cooling, Real Estate are all significant drivers of AWS costs. Researchers don't pay those costs directly. The university does, sure, but to the researcher, that means those costs are pre-paid. Going to AWS means you're essentially paying for those costs twice. plus all the profit margin and availability that AWS provides that you also don't need.
Definitely some truth in this, and the way research funding is managed, especially with Federal grants, makes it incredibly difficult to sort out and identify the right solution.
Running a modern AMD-based server that has 48 cores, at least 192 GB of RAM, and no included disk space costs:
~$2670.36/mo for a c5a.24xlarge AWS on-demand instance
~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a three-year term, paid upfront
~$558.65/mo on OVH Cloud[1]
~$512.92/mo on Hetzner[2]
~$200/mo on your own infrastructure as a large institution[3]
Footnote [3] explains this cost estimate as:
"Assumes an AMD EPYC 7552 run at 100% load in Boston with high electricity prices of $0.23/kWh, for $33.24/mo in raw power. Hardware is amortized over five years, for an average monthly price of $67.08/mo. We assume that your large institution already has 24/7 security and public internet bandwidth, but multiply base hardware and power costs by 2x to account for other hardware, cooling, physical space, and a half-a-$120k-sysadmin amortized across 100 servers."
When I worked as a network engineer I spent months working with some great scientists / their team who built a crazy microscope (I assumed it was looking at atoms or something...) the size of a small building.
Their budget for the network gear was a couple hundred bucks and some old garbage consumer grade network gear. For something that spit out 10s of GB a second (at least) across a ton of network connections (they didn't seem to know what would even happen when they ran it), and was so bursty all but the highest end of gear could handle it.
Can confirm sometimes scientists aren't really up on the overall costs. Then they dump it "this isn't working" on their university IT team to absorb the costs / manpower costs.
Power (especially if there is some kind of significant scientific facility on premise), space (especially in reused buildings), manpower (undergrads, grad students, post docs, professional post graduates), running old/reused hardware, etc...
You can get away with those at large research universities. Some of that you can get away with at national lab sorts of places (not going to find as much free/cheap labor, surplus hardware). If you start going down in scale/prestige, etc... none of that holds true.
Running a bunch of hardware from the surplus store in a closet somewhere with Lasko fans taped to the door is cheap. To some extent, the university system encourages such subsidies.
In any case, once you get to actually building a datacenter, if you have to factor in power, if you have a 4 hardware refresh cycle, professional staffing, etc... unless you are in one of those low CoL college towns - cloud is probably more more than 1.5 to 3x more expensive for compute (spot, etc...). Storage on prem is much cheaper - erasure coded storage systems are cheap to buy and run, and everybody wants their own high performance file system.
One continuing cloud obstacle though - researchers don't want to spend their time figuring out how to get their code friendly to preemptible VMs - which is the cost effective way to run on cloud.
Another real issue with sticking to on-prem HPC is talent acquisition and staff development. When you don't care about those things so much, it's easy to say it's cheap to run on-prem, but often the pay is crap for the required expertise, and ignoring cloud doesn't help your staff either.
Which I thought was the best point of the article, that a lot of IT best practice comes from the web app world.
Web apps quickly become finely tuned factory machines, executing a million times a day and being duplicated thousands of times.
Scientific computing projects are often more like workshops. Getting charged by the second while you're sitting at a console trying to figure out what this giant blob you were sent even is is unpleasant. The solution you create is most likely to be run exactly once. If it is a big hit, it may be run a dozen times.
Trying to run scientific workloads on the cloud is like trying to put a human shoe on a horse. It might be possible but it's clearly not designed for that purpose.
Plus the supposed savings of in-house hardware only materialize if you have sufficiently managed and queued load to keep your servers running at 100% 24/7. The advantage of AWS/other is to be able to acquire the necessary amount of compute power for the duration that you need it.
For a large university it probably makes sense to have and manage their own compute infrastructure (cheap post-doc labor, ftw!) but for smaller outfits, AWS can make a lot of sense for scientific computing (said as someone who uses AWS for scientific computing), especially if you have fluctuating loads.
What works best IMO (and what we do) is have a minimum-to-moderate amount of compute resources in house that can satisfy the processing jobs most commonly run (and where you haven't had to overinvest in hardware), and then switch to AWS/other for heavier loads that run for a finite period.
Another problem with in-house hardware is that you spent all that money on Nvidia V100's a few years ago and now there's the A100 that blows it away, but you can't just switch and take advantage of it without another huge capital investment.
You are paying 10x more because no one gets fired for using IBM. AWS has many benefits most which you don't need. Pair up with another school in a different region and backup data. Computers are not scary they rarely catch fire.
Sounds like you've got the kind of outstanding IT guy that was motivated to make the electronics run like the scientists wanted more so than anything.
At the other end of the spectrum there's labs [0] where the scientists need to carefully study and develop increasing skills at operating the electronics the way IT wants it done, and even worse of a distraction when there's a moving target keeping up with the IT electronics approach changing faster than the progress most labs make in their own scientific field.
What labs need more of is your kind of IT operator who can bring that option (your extreme end of the spectrum in favor of lab workers) within reach for when it is the most appropriate choice.
When labs fail to retain such adequate talent, they can rule out that option going forward, and that's one less tool in the toolbox.
[0] Including many which have good records of breakthrough progress before becoming computerized to begin with.
Is a postdoc hacking a cluster something you have seen before? I am genuinely curious because I worked on a cluster owned by my university as an undergrad and everyone was kind of assumed to be trusted. If you had shell access on the main node you could run any job you wanted on the cluster. You could enhance security I just wonder about this threat model, that's an interesting one. I am sure it happens to be clear.
Yes. Probably not surprise that the postdoc was a PRC national. Very competent in their field of study, but also in this country with instructions from an APT group.
Sorry you were on the receiving end of that and had to learn the hard way. We had the dean of the college (large public research university) I was working at the time receive a gift of a bunch of new MBPs as a token of goodwill from a foreign country, and heard in the very next sentence that they would be going straight into a shredder. At the time I thought these brand-new laptops could easily be wiped & repurposed; but now realize it was a potential attack surface that can easily leak info to APT/espionage groups the second it connects to a network.
Less than a year later a story made national headlines that a professor with direct access to classified material had mysteriously disappeared & never disclosed his close ties with their home country. So not only should you be worried about back doors and side doors; but also watch what's going through your front door as well!
I had a feeling you would say that. I don't think this was part of our threat model until pretty recently. That was why I asked, because these stories aren't really internalized collectively yet I think and it's valuable to reconsider who we can trust and what threat actors might value.
I also managed HPC data centers and I agree with you. I feel like the term data center is a key word there. There is a point of scaling where it's just cheaper to manage a data center with a dedicated in house team. I think that holds true in other industries as well.
As far as HPC goes specifically, we could get some of the financial numbers to make sense in the cloud (cpu intensive jobs), but couldn't make it work for others (data intensive jobs shipping PB's around on the reg)
That and HPC has a lot of grant funding. It can be quite advantageous for an org to use an on prem data center almost like a slush fund. Can keep key projects running that would otherwise die when they are having a rough funding year.
I'd much rather store HIPAA data on a server in my office or closet than worry I got all the IAM settings right. And if I fire someone, security makes sure they can't get in the building. You cannot say the same about the cloud. Yes, I know you can do cloud security right, but on prem security is just harder to mess up.
More than half of security penetrations in our institution (A medical center - research - med school complex) over the 8 years I worked there, ending in 2020, came through the research arm, even though Research accounted for no more than 10% of enabled servers in the infrastructure. And we're talking APT penetrations. They weren't looking for HIPAA data (although I used that example in my original post), they were looking for a path to permanent presence in our network in order to mine research. So, why did they come through Research? Because a PI buys some equipment - servers or other network enabled stuff - puts a grad student or post doc in charge of it, and enjoys his or her cheap compute. But that student or post doc is not an infrastructure expert, and most definitely doesn't understand enterprise security. Next thing you know, we've got an APT owned server on the inside of the network. (And none of that is counting the ones where the post doc is a foreign national who actually intends to use their position to compromise their employer. Had that happen too.) There are a some computational scientists who actually do understand this stuff, but they're rare. Being on the cloud does not inherently fix this oroblem, but to fix it you have to be on institutionally, professionally managed infrastructure, and once you are, the cost differential between owned infrastructure and well negotiated, managed cloud infrastructure becomes much more nuanced.
I appreciate your perspective, but it seems the security team should be watching for reverse proxies, tunnels, and other firewall anomalies for on-prem hardware just as a normal course of biz. And if a PI installs a self-managed server, that really should not gum up the works.
All that being said, I have never worked at a place (or in a dept) whose threat profile made APT a real thing.
Obviously we do all those things. But you NEVER want an outside, and particularly APT owned machine inside your network. They can be well hidden and still do very real damage.
You're fortunate if APTs don't consider you a worthy target. They are no joke, and in most cases are playing a long game, more interested in penetration, persistent presence, and quiet theft of information, than in doing anything you'd notice - until they aren't. We had one who burned an asset that had been cultivated on our network for a couple of years to make a hard press play for information about a very particular high profile patient immediately after that patient had been seen (the fact that the patient was seen was public information). But with over 20,000 servers and 300,000 total nodes on the network - some of which cannot be fully patched because they run software someone has to have access to but which won't run on the newest versions of OSs or databases - you still don't know what else they've burrowed into.
A big tech company supplier to our organization had a very telling incident that ended up being detected on our side of the connection, where a brief mistake by network admins opened a channel through their layers of protection for their buid pipeline. In the minutes it was open, an APT detected the access (likely because they already owned something internal), and inserted code into their certified OS build pipeline, which we ended up with in our institution.
Probably not hugely more significant, but there are differences. A workstation should be segmented and limited in the range of nodes it can communicate with, if you're running your network properly. A server will likely be in a segment that has much broader access to it, unless you're doing micro-segmentation, and doing it well. By construction, a server set up by a workgroup team outside your core IT server administration staff is unlikely to be properly segmented. And if you're doing traffic analysis to look for rogue behavior, it's harder to spot from something that profiles as a server, because, again, you expect a server to have lots of contacts within the network. Counterbalancing that, if it profiles as a server, you should be more suspicious of any outbound activity.
Also it assumes full utilization of hardware. If you have variable load (such as only needing to run compute after an experiment). The overhead costs of maintaining a cluster you don't need all time are probably much lower than resources you can schedule on-demand.
This nails so much of the discussion that should be had. When using any cloud service provider, you aren't just paying for the machines/hardware you use - you are paying for people to take care of a bunch of headaches of having to maintain this hardware. It's incredibly easy to overlook this aspect of costs and really easy to oversimplify what's involved if you don't know how these things actually work.
Even as a big cloud detractor, I have to disagree with this.
A lot of scientific computing doesn't need a persistent data center, since you are running a ton of simulations that only take a week or so, and scientific computing centers at big universities are a big expense that isn't always well-utilized. Also, when they are full, jobs can wait weeks to run.
These computing centers have fairly high overhead, too, although some of that is absorbed by the university/nonprofit who runs them. It is entirely possible that this dynamic, where universities pay some of the cost out of your grant overhead, makes these computing centers synthetically cheaper for researchers when they are actually more expensive.
One other issue here is that scientific computing really benefits from ultra-low-latency infiniband networks, and the cloud providers offer something more similar to a virtualized RoCE system, which is a lot slower. That means accounting for cloud servers potentially being slower core-for-core.
Author here. I agree with your points! I use AWS for a computational biology company I'm working on. A lot of scientific computing can spin up and down within a couple hours on AWS and benefits from fast turnaround. Most academic HPCs (by # of clusters) are slower than a mega-cluster on AWS, not well utilized, and have a lot of bureaucratic process.
That said, most of scientific computing (by % of total compute) happens in a different context. There's often a physical machine within the organization that's creating data (e.g. a DNA sequencer, particle accelerator, etc), and a well-maintained HPC cluster that analyzes that data. The researchers have already waited months for their data, so another couple weeks in a queue doesn't impact their cycle.
For that context, AWS doesn't really make sense. I do think there's room for a cloud provider that's geared towards an HPC use-case, and doesn't have the app-inspired limits (e.g data transfer) like AWS, GCP, and Azure.
This is tangential to your point, but I’ll just mention that Azure has some properly specced out HPC gear: IB, FPGAs, the works. You used to be able to get time on a Cray XC with an Ares interconnect, but I never have occasion to use it, so I don’t know if you still can. They’ve been aggressively hiring top-notch HPC people for a while.
That's the Sentinel system. I worked on it when I was at Cray, and we did some covid stuff[1][2] with a researcher at UAH. We accelerated a docking code using some cool tech I created (in Perl, so there!) and some mods my teammates did to the queuing system.
The work won some award at SC20[3] (fka Supercomputing conference). I had considered submitting for the Gordon Bell prize, which had been specifically requesting covid work, though I thought the stuff we had done wasn't terribly sexy. We were getting ~250-500x better performance than single CPU runs.
Looking back over these, I gotta chuckle, as this (press releases) is pretty much the only time I'm called "Dr.". :D
Back to the OPs points, they are right. In most cases, cloud doesn't make sense for traditional HPC workloads. There are some special cases where it does, those tend to be large ephemeral analysis pipelines, as in bioinformatics and related fields. But for hardcore distributed (mostly MPI) code, running for a long time on a set of nodes interconnected with low latency networks, dedicated local nodes are the better economic deal.
During my stint at Cray, I was trying (quite hard) to get supercomputers, real classical ones, into cloud providers, or become a supercomputing cloud provider ourselves. The Met Office system is in Azure, is a Cray Shasta, but that was more of a special case. I couldn't get enough support for this.
Such is life. I've moved on. Still doing HPC, but more throughput maximized.
The Azure Met Office win left me very conflicted. As someone who is relatively positive about cloud adoption for science it was good to see some forward thinking. On the other hand, what I've heard about how the procurement was run plus my taxpayer-based views on where critical national infrastructure should be housed makes me rather less happy about the outcome.
I don't speak for them (never have), but I believe it to be possible. MSFT do a number of things right (and a few really badly wrong), but you can generally spin up a decent bare metal system there. IO is going to be an issue with any cloud, it will cost for real performance. Between that and networking, clouds could potentially throw in the compute for free ...
Reminds me of a quip I made back in my SGI-Cray (1st time) days. A Cray supercomputer (back then) was a bunch of static ram that was sold, along with a free computer ... Not really true, but it gave a sense of the costs involved.
This said, Azure had (last I checked) real Mellanox networking kit for RDMA access. At Cray we placed a cluster in Azure for an end user (who shall rename nameless), and used several of Mellanox's largest switch frames for 100G Infiniband across > 1k nodes, each with many V100 GPUs. Unit would have been mid single digits on the top500 list that year.
AWS is doing their own thing network wise. Not nearly as good from a performance (latency or bandwidth) as the Mellanox kit. I don't know if Google Cloud is doing anything beyond TCP.
You can do bare metal at most/all of these. You can do some version of NVMe/local disk at all of them. Some/most let you spin up a parallel file system (network charges, so beware), either their own Lustre flavor, or one of BeeGFS, Weka, etc.
The Azure FPGAs are a bit tangential from a customer perspective; they are just the equivalent of the AWS Nitro smart-NIC. Azure IB is interesting in that I originally expected it to be a killer feature, but for customers I work with it just isn't enough to overcome the multitude of downsides of having to use Azure for everything else. In the end, hardly any commercially relevant codes absolutely need IB, and work well enough with the low-latency ethernet both AWS and GCP offer.
Neither grant agencies nor universities are ready to pay for commercial compute out of grant money. They'd rather have you run analysis on your work laptop/desktop they already provided. Even some of the folks who manage the HPC are unwilling to help researchers (many of whom are not programmers) to use the HPC, lest they mess up and damage the hardware. Source: I work at a tri-institutional collaboration research center in Georgia.
The author is making a brilliant argument for getting a secondhand workstation and shoving under their desk.
If you are doing multi machine batch style processing, then you won't be using ondemand, you'd use the spot pricing. The missing argument in that part is storage costs. Managing a high speed, highly available synchronous file system that can do a sustained 50gb/sec is hard bloody work (no S3 isnt a good fit, too much management overhead)
Don't get me wrong AWS _is_ expensive if you are using a machine for more than a month or two.
however if you are doing highly parallel stuff, Batch and lustre on demand is pretty ace.
If you are doing a multi-year project, then real steel is where its at. Assuming you have factored in hosting, storage and admin costs.
Even for multi-year, if you factor in everything does it still come out cheaper and AWS? Would you be running everything 24x7 on an HPC? I don’t think so. You need scale at some points and there are probably times where research is done on your desktop.
You could invest in an HPC - but I think the human cost of maintaining one especially if you’re in a high cost of living area (e.g. Bay Area, NYC, etc.) is going to be pretty high. Admin cost, UPS, cable wiring, heat/cooling etc. can all be pretty expensive. Maintenance of these can be pretty pricey too.
Are there any companies that remotely manage data centers and rent out bare metal infra?
Not in the context the person you responded to meant it. Yes, you can very easily get 50GB/s from a few NVMe devices on a single box. Getting 50GB/s on a POSIX-ish filesystem exported to 1000 servers is very possible and common, but orders of magnitude more complicated. 500GB/s is tougher still. 5TB/s is real tough, but real fun.
Checkout Apache Iceberg which makes it fairly trivial to get high throughput from S3 without much fine-tuning. Bursts from 0 to 50Gbps should be possible from S3 without much effort, just have object sizes that are in the NN+ MiB range. Personally, Lustre is a mess, it's expensive and even more pain to fine-tune.
> Lustre is a mess, it's expensive and even more pain to fine-tune.
Its a huge raid-0, so long as your entire team understands that, you'll be ok. Its a lot better than in 2008, but now that AWS have a managed service, I'd just use that. (my heart is always in GPFS land...)
Iceberg doesn’t use the directory structure to represent what objects belong to which table. The change allows objects to have significant entropy in the object keys prefix. This avoids hot spotting s3 and allows for greater bursts to s3 QPs.
I think this post is identifying scientific computing with simulation studies and legacy workflows, to a fault. Scientific computing includes those things, but it also includes interactive analysis of very large datasets as well as workflows designed around cloud computing.
Interactive analysis of large datasets (e.g. genome & exome sequencing studies with 100s of 1000s of samples) is well suited to low-latency, server-less, & horizontally scalable systems (like Dremel/BigQuery, or Hail [1], which we build and is inspired by Dremel, among other systems). The load profile is unpredictable because after a scientist runs an analysis they need an unpredictable amount of time to think about their next step.
As for productionized workflows, if we redesign the tools used within these workflows to directly read and write data to cloud storage as well as to tolerate VM-preemption, then we can exploit the ~1/5 cost of preemptible/spot instances.
One last point: for the subset of scientific computing I highlighted above, speed is key. I want the scientist to stay in a flow state, receiving feedback from their experiments as fast as possible, ideally within 300 ms. The only way to achieve that on huge datasets is through rapid and substantial scale-out followed by equally rapid and substantial scale-in (to control cost).
I've followed Hail and applaud the Broad Institute's work wrt establishing better bioinformatics software and toolkits so I hope this doesn't come as rude, but I can't imagine an instance in a real industry or academic workflow where you need 300ms feedback from an experiment to "maintain flow" considering how long experiments on data that large (especially exome sequencing!) take overall? My (likely lacking) imagination aside I guess what I'm really saying is that I don't know what's preventing the usecase you've described from being performed locally considering there'd be even less latency?
300ms is my ideal latency, but we don’t achieve that under all circumstances. Even for blob storage, I see as much as 100ms latency. That said, my laptop has maybe 8 cores. Even if I had 0ms reads from an SSD, I’m compute bound for some tasks.
Moreover, I think we have differing definitions of “experiment”. In the context of a sequencing study, I think an “experiment” can be as simple as answering the hypothesis: does the missingness of a genotype correlate with any sample metadata (e.g. sequencing platform). You might try to test that hypothesis by looking at a PC1-PC2 plot with points colored by sequencing platform where the PCA is conducted on the 0/1 indicator matrix of missingness.
In the dry lab, that is what I mean by experiment. By that definition, a scientist does many experiments a day. Particularly for sequencing studies, these experiments are data-intensive, I need to run a simple computation on a lot of data to confirm the hypothesis.
A missing elements for me is that with a lot of exploratory scientific work we (often half intentionally) have no idea what we are doing. We can easily run a giant job that uses 100x more compute than expected. Yes you can limit cloud compute resources if you are smart but its much better is the default is that you run out of compute and your job takes longer than you get a 100x bill back from our cloud provider. And if you limited your cloud resources to a fixed amount, didn't you just eliminate half the benefit of using cloud in the first place?
Then the problems of data management, transfer and egress are huge. Again the "no idea what we are doing" factor comes into play. If you have a really good idea up front what is going to happen you can plan out a strategy that minimises costs. But if you have no idea at all - genuinely, because this is science and we are doing new things - then you could end up blowing huge amounts of money on unnecessary egress and storage costs. And at the small end we can be talking about experiments run on a shoestring where a few thousand dollars is a big deal.
The way I see it, we need everything - powerful individual workstations / laptops for direct analysis, then a tier of fixed HPC style compute for this kind of work that is poorly matched to cloud, and then for specific projects where it makes sense (massive scaleout, exotic hardware needs - GPUs, FPGAs etc) you embrace cloud resources for that.
Having worked for 2 of the largest cloud providers (1 of them beimg the largest) i have to say "The Cloud" just doesnt makes sense (maybe with the exception of cloud storage) yet for most use cases, this including start ups, small and, mid size companies its just way to expensive for the benefits it provides, it moves your hardware acquisitions /maintainance cost to development costs, you just think better/cheaper because that cost comes in small monthly chunks rather than as a single bill, plus you add all security risks either those introduced by the vendor or those introduced by the masive complexity and poor training of the developers which if you want to avoid will have to pay by hiring a developer competent in security for that particular cloud provider
Having worked in 3 startups that were AWS-first, I can say that you've learned the completely wrong lessons from your time at your cloud providers.
Building on AWS has provided scale, security, and redundancy at a substantially lower cost than doing any on-prem solution (except for a shitty one strung together with lowendbox machines).
The combined AWS bill for the three startups is less than the cost of an F5, even on a non-inflation adjusted basis.
The cloud doesn't mean that you can be totally clueless. I've had experience in HA/scalability/redundancy/deployment/development/networking/etc. It means that if you do know what you're doing you can deliver a scalable HA solution at a ridiculously lower price point than a DIY solution using bare iron and colo.
1 Month, for sure. What about 1 year? Also did those companies required to provide any training or hiring to achieve that? Because you also need to add that to the cost comparison
If you are comparing one month bill agains 1 time purchase (which if is correctly chosen should not happen but once every 10 years at the earliest) for sure it will be cheaper. When it comes down to scalability, development and deployment, you should check your tech stack rather than your infrastructure. Kubernetees and containerization should easily take care of those with on premise hardware while also reducing complexity + you will no longer have to worry for off the chart network transit fees
This is sort of a confusing article because it assumes the premise of "you have a fixed hardware profile" and then argues within that context ("Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers".) Of course if you're getting 100% utilization then you'll find better raw pricing (and this article conveniently leaves out staffing costs), but this model misses one of the most powerful parts of cloud providers: autoscaling. Why would you want to waste scientist time by making them wait in a queue when you can just instead autoscale as high as needed? Giving scientists a tight iteration loop will likely be the biggest cost reduction and also the biggest benefit. And if you're doing that on prem then you need to provision for the peak load, which drives your utilization down and makes on prem far less cost effective.
For fast-moving researchers who are blocked by a queue, cloud computing still makes sense. I guess I wasn't clear enough in the last section about how I still use AWS for startup-scale computational biology. My scientific computing startup (trytoolchest.com) is 100% built on top of AWS.
Most scientific computing still happens on supercomputers in slower moving academic or big co settings. That's the group for whom cloud computing – or at least running everything on the cloud – doesn't make sense.
Another service that runs on AWS is CodeOcean. It looks like Toolchest is oriented toward facilitating execution of specific packages rather than organization and execution like CodeOcean. Is that a fair summary?
Generally, scientists aren't blocked while they are waiting on a computational queue. The results of a computation are needed eventually, but there is lots of other work that can be done that doesn't depend on a specific calculation.
It's good to learn how not to be blocked on long-running calculations.
On the other hand, if transitioning to a bursty cloud model means you can do your full run in hours instead of weeks, that has real impact on how many iterations you can do and often does appreciably affect velocity.
It can, if you have the technical ability to write code that can leverage the scale-out that most bursty-cloud solutions entail. Coding for clustering can be pretty challenging, and I would generally recommend a user target a single large system with job that takes a week over trying to adapt that job to a clustered solution of 100 smaller systems that can complete it in 8 hours.
This is a big part of it. In my lab, I have a lot of grad students who are computational scientists, not computer scientists. The time it will take them to optimize code far exceeds a quick-and-dirty job array on Slurm and then going back to working on the introduction of the paper, or catching up on the literature, or any one of a dozen other things.
Having worked in the high performance computing field and in cloud hosted commercial applications, I can agree with the article but for entirely different reasons. The reason why some scientific computing shouldn't be done on AWS has to do with networking and latency between compute nodes. Supercomputers often use specialized networking hardware to get single digit microsecond latencies for data transfer between compute nodes and much higher network bandwidth than what you would normally find between EC2 nodes. This allows simulations to efficiently operate on really large data sets that span hundreds or thousands of nodes. The network topology between these nodes is often denser than a tree (think a 2D or 3D grid topology) and offers shorter paths between nodes.
All of this allows you to run code that you can't run in AWS unless it fits on one computer only. It's also way more expensive than clusters of commodity hardware.
For problems that are trivially parallelizable without much communication between nodes - I don't think that most universities can actually operate those cheaper than renting them from cloud computing services. A lot of these calculations don't take the staff to operate data centers, the cost of the building itself or the opportunity cost of using lots of space for this purpose vs something else into account. Economics of scale also kick in here. It's way cheaper per computer for AWS to admin a data center because they do this for orders of magnitude bigger data centers than your typical university.
This is using IP over ethernet in a tree, which has an order of magnitude more latency and less bandwidth than what you would see
in a supercomputer, even if all your instances end up in the same rack.
This is the biggest un-addressed problem, IMO. Getting more scientific computing done in the cloud is where we are inevitably trending, but no-one yet has a good answer for completely ad-hoc, low value experimentation and skill building in cloud. I see universities needing to maintain clusters to allow PhDs and postdocs to develop their computing skills for a good while yet.
I agree that this is a big thing to consider here too. I set up a computing cluster in grad school and it was much less costly to make a mistake there than it would have been in a cloud service. Re-running something only wasted wall time and not any money. That said, money is not the only scarce resource here. Researchers can get allocations at university and government HPC systems, but you then have to be quite careful with your allocation of computing time. I remember keeping track of the number of SUs (core-hours) I was burning quite carefully when I used university clusters, since once it is gone, you might not get any more time.
This rings true for me. I have a federal grant that prohibits me from using its funds for capital acquisitions: i.e. servers. But I can spend it on AWS at massive cost for minimal added utility for my use case. Even though it would be a far better use of taxpayer funds to buy the servers, I have to rent them instead.
Lots of places (Hetzner for example) will rent you servers at 10-25% the cost of AWS if you want dedicated hardware, without the ability to autoscale. You can even set up a K8s cluster there if the overhead is worth it.
Doesn't have to be a university either. Depending on the amount of compute needed any capable IT guy can do it for you from their garage with a contract.
I'm not saying AWS is automatically the best option but the question isn't just servers. It's servers, networking hardware, HVAC, a facility to put them all in, and at least a couple people to run and maintain it all. The TCO of some servers is way higher than the cost of the hardware.
Basically, the granting organization doesn't want to pay for the full cost of capital equipment that will - either via time or capacity - not be fully used for that grant.
There are other grant mechanisms for large capital expenditures.
The problem is the thresholds haven't shifted in a long time, so you can easily trigger it with a nice workstation. But then, the budget for a modular NIH R01 was set in 1999, so thats hardly a unique problem.
I can think of a few ways to abuse it while still spinning it as "for research". The obvious one is to buy a $9999 gaming machine with several of whatever the fanciest GPU on the market is at the time, and say you're doing machine learning.
So my guess is it's an overly broad patch for that sort of thing.
There are other regulations to keep people from installing Steam on their ML workstations (which also cover machines below the threshold).
It's entirely about one grant-giving entity not wanting to pay for a piece of capital equipment that will have use beyond the project they're funding. It's a federal regulation, and it comes up far more commonly with lab equipment than it ever does with computers.
Cloud never has made sense for scientific computing.
Renting someone else's big computer makes good sense in a
business setting where you are not paying for your peak
capacity when you are not using it, and you are not
losing revenue by underestimating whatever the peak capacity
the market happens to dictate.
For business, outsourcing compute cost center
eliminates both cost and risk for a big win each quarter.
Scientists never say,
Gee it isn't the holiday season,
guess we better scale things back.
Instead they will always tend to push whatever compute limit there is,
it is kinda in the job description.
As for the grant argument, that is letting the tool shape the hand.
business-science is not science, we will pay now or pay later.
We have 500-node cluster at a chemical company, and we've been experimenting with "hybrid-cloud". This allows jobs to use servers with resources we just don't have, or couldn't add fast enough.
Storage is a huge issue for us. We have a petabyte of local storage from big name vendor that's bursting at the seams, and expensive to upgrade. A lot of our users leave big files laying around for a large time. Every few months we have to hound everyone to delete old stuff.
The other thing that you get with the cloud is there's way more accountability for who's using how much resources. Right now we just let people have access and roam free. Cloud HPC is 5-10x more in cost and the beancounters would shut shit down real quick if the actual costs were divvied up.
We also still have a legacy datacenter so in a similar vein, it's hard to say how much not having to deal with physical hardware/networking/power/bandwidth would be worth. Our work is maybe 1% of what that team does.
I can relate to these problems. Cloud brings positive accountability that is difficult to justify onprem. I have some hope that higher level tools for project/data/experiment management (as opposed to a bash prompt and a path) will bring some accountability without stifling flexibility.
I think there are some things this misses about the scientific ecosystem in Universities/etc. that can make the cloud more attractive than it first appears:
* If you want to run really big jobs e.g. with multiple multi-GPU nodes, this might not even be possible depending on your institution or your access. Most research-intensive Universities have a cluster but they’re not normally big machines. For regional and national machines, you usually have to bid for access for specific projects, and you might not be successful.
* You have control of exactly what hardware and OS you want on your nodes. Often you’re using an out of date RHEL version and despite spack and easybuild gaining ground, all too often you’re given a compiler and some old versions of libraries and that’s it.
* For many computationally intensive studies, your data transfer actually isn’t that large. For e.g. you can often do the post-processing on-node and then only get aggregate statistics about simulation runs out.
A former colleague did his PHD in particle physics with a novel technique (matrix element method). I can't really explain it, but it is extremely CPU intensive. That working group did it on CERN's resources, and they had to borrow quotas from a bunch of other people. For fun they calculated how much it would have cost on AWS and came up with something ridiculous like 3 million euros.
The bigger experiments will routinely burn through tens of millions worth of computing. But 10 million euros isn't much for these experiments. The issue is that they are publicly funded: any country is much happier to build a local computing center and lend it to scientists than to fork the money over to an American cloud provider.
(The expensive part of these experiments is simulating billions of collisions and how the thousands of outgoing particles propagate though a detector the size of a small building. Simulating a single event takes around a minute on a modern CPU, and the experiments will simulate billions of events in a few months. If AWS is charging 5 cents a minute it works out to tens of millions easy.)
I can’t specifically to CERN and the exact workload. But bear in mind that the 3MM euros is non negotiated sticker pricing. In real life, negotiated pricing can be much much less depending on your org size and spend. This is a variable most people neglect.
That is true, and a large part of the theoretical cost was probably also traffic, and the use of nonstandard nodes. They could have gotten a much more realistic price.
I guess the point is also that scientists often don't realize computer costs money, when the computers are already bought
I almost got a Tenure Track position at a Data Science Faculty in Virginia and I think them not having a HPC was the single issue that blew this move (from both sides). During interviews, I was asking the dean how they set up their HPC - turned out, they hadn't. I then asked a Professor in the next review round how they teach their students without a HPC:
> "I buy all resources on AWS - it's painfull because I have to contact AWS almost monthly for accidental over-billing, but we don't have a solution".
All of this made me really sceptical, since coming from a big University in Germany, we have unlimited HPC resources for free. I have 16 VMs, the biggest one 125 GB Memory, I can set those up or move around how I want. No space limitations - in need 10 TB of space for 3 month? Open a service ticket, 3 hours later it's available. Ports need to be opened worldwide to the web? No problem. Need a Jupyter Hub Cluster on Kubernetes? Here you go. This has really improved my work (quality, performance, and convenience) so much.
I was once coordinator of a research project where we had 30k EUR left and didn't know what to do with it. I contacted our HPC and asked if they want the money - answer: "30k really isn't worth the effort, we don't know what to do with it atm."
I second this, would also ring my alarm bells in an interview. Back as a master student I went from a top 10 research university that didn't have a strong HPC infra at the time (ETH Zurich) to a top 100 one that did (Tokyo Tech), at least as a guest. Difference was staggering - all students just got handed a login on their big (top 100) cluster to be able to work with the actual hardware, even if limited to a couple of nodes. This much increased the likelihood that someone would be able to come up with an interesting project just together with their academic supervisor (no huge group projects with tons of extra funding needed).
I’ve also been skeptical of the commercial cloud for scientific computing workflows. I don’t think this cost benefit analysis mentions it, but the commercial cloud makes even less sense when you take into account brick and mortar considerations. In other words, if your company/institution has already paid for the machine rooms, sys admins, networks, the physical buildings, the commercial cloud is even less appealing. This is especially true with “persistent services” for example data servers that are always on because they handle real-time data, for example.
Another aspect of scientific computing on the commercial cloud that’s a pain if you work in academia is procurement or paying for the cloud. Academic groups are much more comfortable with the grant model. They often operate on shoe-string budgets and are simply not comfortable entering a credit card number. You can also get commercial cloud grants, but they often lack long-term, multiyear continuity.
This. It's got nothing to do with "comfort". I use cloud computing all the time in the rest of my life, but the rest of my life isn't subject to university policies and state regulations.
I completely agree for most cases. In many scientific computing applications, compute time isn’t the factor you prioritize in the good/fast/cheap triad. Instead, you often need to do things as cheaply as possible. And your data access isn’t always predictable, so you need to keep results around for an extended period of time. This makes storage costs a major factor. For us, this alone was enough to move workloads away from cloud and onto local resources.
Actually computing is fine for most use cases (spot instances, preemptible VMs on GCP) and have been used in lots of situations, even at CERN. Where it also excels is if you need any kind of infrastructure, because no HPC center has figured a reasonable approach to that (some are trying with k8s). Also, obviously, you get a huge selection of hardware.
Where cloud/aws doesn’t make sense is storage, especially if you need egress, and if you actually need IB
The killer we've seen is data egress costs. Crunching the numbers for some of our pipelines, we'd actually be paying more to get the data out of AWS than to compute it.
As in, the networking equipment consumes the most energy? Given the 30x markup on AWS egress I'm inclined to say it's more about incentives and marketing, but I'd love to learn otherwise.
This is the case for a large class of big data + high compute applications. Animation / simulation in engineering, planning, forecasting, not to mention entertainment require pipelines the typical cloud is simply too expensive to use.
I'm not in a position to recommend or not a particular provider for gpu-equipped servers, simply because I've never had the need for gpus.
My first thought was related to colocation services. From what I understand, a lot of people avoid on-premise/in-house solutions because they don't want to deal with server rooms, redundant power, redundant networks, etc.
So people go to the cloud and pay horrendous prices there.
Why not take a middle path? Build your own custom server with your perferred hardware and put in a colocation
There are several tier-two clouds that offer GPUs but I think they generally fall prey to the many of the same issues you'll find with AWS. There is a new generation of accelerator native clouds e.g. Paperspace (https://paperspace.com) that cater specifically to HPC, AI, etc. workloads. The main differentiators are:
- much larger GPU catalog
- support for new accelerators e.g. Graphcore IPUs
- different pricing structure that address problematic areas for HPC such as egress
However, one of the most important differences is the lack of unrelated web services related components that pose a major distraction/headache to users that don't have a DevOps background (which AWS obviously caters to). AWS can be incredibly complicated. Simple tasks are encumbered by a whole host of unrelated options/capabilities and the learning curve is very steep. A platform that is specifically designed to serve the scientific computing audience can be much more streamlined and user-friendly for this audience.
Lambda A100s - $1.10 / hr
Paperspace A100s - $3.09 / hr
Genesis A100s - no A100s but their 3090 (1/2 the speed of 100) is - $1.30 / hr for half the speed
Cloud worked really well for me when I was in school. A lot of the time, I would only need a beefy computer for a few hours at a time (often due to high memory usage) and you can/could rent out spot instances for very cheap. There are about 730 hours per month so the cost calculus is very different for a student/researcher who needs fast turnaround times (high performance), but only for a short period of time.
However, I know not all HPC/scientific computing works that way and some workloads are much more continuous.
Thats how my department uses the cloud--we have an image we store up at AWS geared towards a couple of tasks and we spin up a big instance when we need it, run the task, pull out the results, then stop the machine. Total cost sub-100 dollars. If we had to go the HPC group we'd have to fight with them to get the environment configured, get access to the system, get payment setup, teach the faculty to use the environment, etc. Its just a pain for very little gain.
using on-demand for latency insensitive work, especially when you’re also very cost sensitive, isn’t the right choice. spot instances will get you somewhere in the realm of the hetzner/on-prem numbers.
Even more importantly, if you have any reasonable amount of spend on cloud, you can get preferred pricing agreements. As much as I hate to talk to "salespeople", I did manage to cut millions in costs per year with discounts on serving and storage.
Personally, when I estimate the total cost of ownership of scientific cloud computing versus on prem (for extremely large-scale science with both significant server, storage, and bandwidth requirements) the cloud ends up winning for a number of reasons. I've seen a lot of academics who disagree but then I find out they use their grad students to manage their clusters.
But, as the article points out, you are still paying a lot of money for features that you don't need for scientific computing.
Also, AWS is notoriously easy to undercut with on-prem hardware, especially if your budget is large and your uptime requirements aren't - you'll save a few hundred thousand a year alone by not having to hire expert engineers for on-call duty and extreme reliability.
Even spot instances on AWS are still over 2x more expensive per month than Hetzner. The cheapest c5a.24xlarge spot instance right now is $1.5546/hr in us-east-1c. That's $1134.86/mo, excluding data transfer costs. If you transfer out 10 TB over the course of a month, that's another $921.60/mo – or now 4x more expensive than Hetzner.
Using the estimate from the article, spot instances are still over 8x more expensive than on-prem for scientific computing.
This has been my exact field of work for a few years now; in general I have found that:
When people claim it is 10x more expensive to use public cloud, they have no earthly idea what it actually costs to run a HPC service, a data centre, or do any of the associated maintenance.
When the claim is 3x more expensive in the cloud, they do know those things but are making a bad faith comparison because their job involves running an on-premises cluster and they are scared of losing their toys.
When the claim is 0-50% more to run in the cloud, someone is doing the math properly and aiming for a fair comparison.
When the claim is that cloud is cheaper than on-prem, you are probably talking to a cloud vendor account manager whose colleagues are wincing at the fact that they just torched their credibility.
It can categorically be stated that for a year's worth of CPU compute, local will always be less than Amazon. Of course, putting percentages on it doesn't work - there are just too many variables.
There are many admins out there who have no idea what an Alpha is who'll swear that if you're not buying Dell or HP hardware at a premium with expensive support contracts, you're doing things wrong and you're not a real admin. Visit Reddit's /r/sysadmin if you want to see the kind of people I'm talking about.
The point is that if people insist on the most expensive, least efficient type of servers such as Dell Xeons with ridiculous service contracts, the savings over Amazon won't be large.
It's a cumulative problem, because trying to cool and house less efficient hardware requires more power and that hardware ultimately has less tolerance for non-datacenter cooling.
Rethink things. You can have AMD Threadripper / EPYC systems in larger rooms that require less overall cooling, that have better temperature tolerance, that're more reliable in aggregate, which cost less and for which you can easily keep around spare parts which would give better turnaround and availability than support contracts from Dell / HP. Suddenly your compute costs are halved, because of pricing, efficiency, overall power, real estate considerations...
So percentages don't work, but the bottom line is that when you're doing lots of compute, over time it's always cheaper locally, even if you do things the "traditional" expensive and inefficient way, so arguing percentages with so many variables doesn't make any sense - it's still cheaper, no matter what.
All types of HPC; I'm a sysadmin/consultant. I don't think the problem with the cost gap is overestimating cloud costs but rather underestimating on-prem costs. Also, failing to account for financing differences and opportunity-costs of large up-front capital purchases.
> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers
That sounds very much like an argument for a cloud. Instead of waiting months to do your processing, you spin up what you need, then tear it down when you are done.
I've never worked in this space, but I'm curious about the need for massive egress. What's driving the need to bring all that data back to the institution?
Could whatever actions have to be performed on the data also be performed in AWS?
Well for starters, if you are NIH or NSF funded, they have data storage requirements you must meet. So usually this involves something like tape backups in two locations.
The other is for reproducibility - typically you need to preserve lots of in-between steps for peer review and proving that you aren't making things up. Some intermediary data is wiped out, but usually only if it can be quickly and easily regenerated.
I've worked the last few years in a research institute well-funded by academic standards. During that time, people around me have used self-hosted hardware, at least three different cloud providers, and HPC clusters hosted by several different organizations. I've also worked with people from many other institutions, each of which has made its own infrastructure choices.
You may use AWS yourself, but your collaborators have made their own choices for their own reasons. The data goes where it's needed, often crossing institutional boundaries.
Regarding the waiver—"The maximum discount is 15 percent of total monthly spending on AWS services". Was very excited at first.
As for leaving data in AWS, data is often (not always) revisited repeatedly for years after the fact. If new questions are raised about the results it's often much easier to check the output than rerun the analysis. And cloud storage is not cheap. But yes it sometimes makes sense to egress only summary statistics and discard the raw data.
One of the aspects not touched on for this is PII/confidential data/HIPAA data, etc.
For that, whether it makes sense or not, a lot of universities are moving to AWS, and the infrastructure cost of AWS for what would be a pretty modest server are still considerably less than the cost of complying with the policies and regulations involved in that.
Recently at my institution I asked about housing it on premise, and the answer was that IT supports AWS, and if I wanted to do something else, supporting that - as well as the responsibility for a breach - would rest entirely on my shoulders. Not doing that.
My university owns hardware in multiple locations, plus uses hardware in a collocation, and still uses the cloud for bursting (overflow). You can't beat the provisioning time of cloud providers which is measured in seconds.
The author makes a convincing argument against doing this workload on on-demand instances, but what about spot instances? AWS explicitly calls out scientific computing as a major use case for scientific computing in its training/promotional materials. Given the advertised ~70-90% markdown on spot instance time, it seems like a great option compared to paying almost the same amount as the workstation but not having to pay to buy, maintain, or replace the hardware.
Author here! Spot instance pricing is better than on-demand, but it doesn't include data transfer, and it's still more expensive than on-prem/Hetzner/etc. Data transfer costs exceed the cost of the instance itself if you're transferring many TB off AWS.
For of our more popular AWS instance types I use – a c5a.24xlarge, used for comparison in the post – the cheapest spot price over the past month in us-east-1 was $1.69. That's still $1233.70/mo: above on-prem, colo, or Hetzner pricing. Data transfer is still extremely expensive.
That said, for bursty loads that can't be smoothed with a queue, spot instances (or just normal EC2 instances) do make sense! I use them all the time for my computational biology company.
FWIW, spot prices for c5a.24xlarge in us-east-2b and us-east-2c seem to have been under $0.92/hr for most of the last 3 months. So, assuming some flexibility on the choice of region, that would adjust your estimate to $0.92 / $1.69 * $1233.70/mo = $671.60/mo, which looks a lot more reasonable. Hopefully I did that math right. Data egress prices are definitely still ridiculous, I agree.
True! It sometimes drops even more, which definitely makes spot instances attractive. The r5.16xlarges had a ~80% discount and a <5% termination rate for while.
Also, if all of scientific computing switched to AWS in order to exploit spot instance pricing, I don't think those market dynamics would stay the same.
As an aside, I've had trouble created large clusters of high memory instances in us-east-2. They might have increased capacity recently, though.
On the other hand it makes sense if you just need to borrow their infrastructure for a while to calculate something.
A lot of scientific computing isn't happening continuously, and a lot of it is one time experiment or maybe couple of times after which you would have to tear down and reassign.
Another fun fact people forget is our ability to predict future is still pretty poor. Not only that, we are biased towards thinking we can predict it when in fact this is complete bullshit.
You have to buy and set up infrastructure before you can use it and then you have to be ready to use it. What if you are not ready? What if you will not need as much resources? What if you stop needing it earlier than you thought? When you borrow it from AWS you have flexibility to start using it when you are ready and drop it immediately when you no longer need it. Which has value on its own.
At the company I work for we found out and basically banned signing long term contracts for discounts. We found that, on average, we pay many times more for unused services than whatever we gained through discounts. Also when you pay for the resources there is incentive to improve efficiency. When you have basically prepaid for everything that incentive is very small and is basically limited to making sure you have to stay within limits.
> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers – that's the API equivalent of storing your inbound API requests, and then responding to them months later
Makes sense if the jobs are all low urgency.
We have a similar problem in trading so we have a composite solution with non-cloud simulation hardware and additional AWS hardware. That's because we have the high utilization solution combined with high urgency.
I did have to chuckle a bit because, working on HPC simulations of the pandemic during the pandemic, there was an awful lot of "This needs to be done tomorrow" urgency.
What does the landscape look like now for "terraform for bare metal"?. Is ansible/chef still the main name in town? I just wanna netboot some lightweight image, set up some basic network discovery on a control plane, and turn every connected box into a flexible worker bee I can deploy whatever cluster control layer (k8s/nomad) on top of and start slinging containers.
I really like this description of how baremetal infrastructure should work, and this is where I think (shameless self promotion) Triton DataCenter[1] plays really well today on-prem.
PXE booted lightweight compute nodes with a robust API, including operator portal, user portal, and cli.
Keep an eye out for the work we are doing with Triton Linux + K8s. Very lightweight Triton Linux compute node + baremetal k8s deployments on Triton.
Thanks! This is pretty sleek! I'm going to have to dust off my homelab and play around with this.
What is the stack written in? Looks like a lot of javascript and makefiles from the github side of things but idk if that's the whole kit and caboodle.
Is genomic code typically distributed-memory parallel? I'm under the impression that it is more like batch processing, not a ton of node-to-node communication but you want lots of bandwidth and storage.
If you are doing a big distributed-memory numerical simulation, on the other hand, you probably want infiniband I guess.
AWS seems like an OK fit for the former, maybe not great for the latter...
The fastest way to do a lot of genomics stuff is with FPGA accelerators, which also aren't used by most of the other tenants in a multi-tenant scientific computing center. The cloud is perfect for that kind of work.
That's interesting. It is sort of funny that I was right (putting genomics in the "maybe good for cloud" bucket) for the wrong reason (characterizing it as more suited for general-purpose commodity resources, rather than suited for the more niche FPGA platform).
This is a bit overstated. Yes, FPGA accelerators can be used effectively in some common genomics workflows. However, my experience is that the right software running on regular Intel/AMD/ARM processors is very competitive with FPGA-using solutions.
>a month-long DNA sequencing project can generate 90 TB of data
Our EM facility generates 10 TB of raw data per day, and once you start computing it, that increases by 30%-50% depending on what you do with it. Plus, moving between network storage and local scratch for computational steps basically never ends and keeps multiple 10 Gbe links saturated 100% of the time.
A trend I've seen on HN over past few years is that people love showing off how they are able to save money by spending more of their own time, especially on infra/cloud things - if you calculate your own hourly rate correctly, it's oftentimes more costly to DIY than outsourcing to experts (e.g., managed cloud).
AWS is one tool but its a lot like the proprietary computing ecosystems that have existed for a long time (remember the Micro$oft days?). It offers convenience in return for lock in and very high margins. Theres no clear answer but its definitely not a clear cut decision where AWS is guaranteed to save money.
There are 2 major costs that are overlooked by the in-house crowd, which are operational maintenance cost (an increasingly rare and expensive skillset), and also the cost of downtime -- how much does it cost you when your team of data scientists are blocked because of a failed OS update etc. That being said, hiring competent people to maintain AWS properly isn't cheap either -- and it is quite easy to start running up very wasteful AWS bills on things you don't need.
As always there's a tradeoff -- the key is to choose a path and to execute it well.
I imagine what makes this especially hard is you have (at least) three parties in play here:
- the people doing the research
- the institution's IT services group
- the administrator who writes the checks
And in my experience, "actual knowledge of what must be done and what it will or could cost" can vary greatly across these three groups; frequently in very unintuitive ways.
These MPI-based scientific computing applications make up a bulk of the compute hours on hpc clusters, but there is a crazy long tail of scientists who have workloads that can’t (or shouldn’t) run on their personal computers. The other option is HPC. This sucks for a ton of reasons, but I think the biggest one is that it’s more or less impossible to set up a persistent service of any kind. So no databases; if you want spark, be ready to spin it up from nothing every day (also no HDFS unless you spin that up in your SLURM job too). This makes getting work done harder but it also means that it makes integrating existing work so much harder because everyone’s workflow involves reinventing everything, and everyone does it in subtly incompatible ways; there are no natural (common) abstraction layers because there are no services.
AWS is fantastic for scientific computing. With it you can:
- Deploy a thousand servers with GPUs in 10 minutes, churn over a giant dataset, then turn them all off again. Nobody ever has to wait for access to the supercomputer.
- Automatically back up everything into cold storage over time with a lifecycle policy.
- Avoid the massive overhead of maintaining HPC clusters, labs, data centers, additional staff and training, capex, load estimation, months/years of advance planning to be ready to start computing.
- Automation via APIs to enable very quick adaptation with little coding.
- An entire universe of services which ramp up your capabilities to analyze data and apply ML without needing to build anything yourself.
- A marketplace of B2B and B2C solutions to quickly deploy new tools within your account.
- Share data with other organizations easily.
AWS costs are also "retail costs". There are massive savings to be had quite easily.
I don't control my AWS account. I don't even have an AWS account in my professional life.
I tell my IT department what I want. They tell the AWS people in central IT what they want. It's set up. At some point I get an email with login information.
I email them again to turn it off.
Do I hate this system? Yes. Is it the system I have to work with? Also yes.
"AWS as implemented by any large institution" is considerably less agile than AWS itself.
Calculating costs based on sticker price is sometimes misleading because there’s another variable: negotiated pricing, which can be much much lower than sticker prices, depending on your negotiating leverage. Different companies pay different prices for the same product.
If you’ve ever worked at a big company or university (any place where you spend at scale), you’ll know you rarely pay sticker price. Software licensing is particularly elastic because it’s almost zero marginal cost. Raw cloud costs are largely a function of energy usage and amortized hardware costs — there’s a certain minimum you can’t go under but there remains a huge margin that is open to be negotiated on.
Startups/individuals rarely even think about this because they rarely qualify. But big orgs with large spends do. You can get negotiated cloud pricing.
This is definitely true for cloud retail prices. However, this becomes not true in cases I've seen when there is an existing discount. Reserved instances, for example.
When a company reached a certain mass, hardware cost is a factor that is considered but not a big factor.
The bigger problems are lost opportunity costs and unnecessary churns.
Businesses lose a lot when the product launch is delayed by a year simply because the hardware arrived late or have too many defects (Ask your hardware fulfillment people how many defective RAM and SSD they got per new shipment).
Churn can cost the business a lot as well. For example, imagine the model that everyone been using is trained in a Mac Pro under XYZ desk. And then when XYZ quit, s/he never properly backup the code and the model.
Bare metal allows for sloppiness that the cloud cannot afford to allow. Accountability and ownership is a lot more apparent in the cloud.
There is a lot of discussion about supercomputers in this article. I don't think public cloud providers can compete easily with traditional super computers because they are built for optimal processing of extremely large scale MPI workloads. Such workloads are not common so I expect that public cloud providers wouldn't bother optimizing for this niche use case (though I know they all have offerings). Also when you are only optimizing for a single variable (i.e. speed), you can make design choices that would be impossible to make in a more general situation.
Of course, not all scientific computing workloads require a traditional supercomputer. In fact, I suspect most do not.
I agree with the article. We at croit.io support customers around the globe to build their clusters and save huge amounts. For example, AWS S3 compared to Ceph S3 in any data center of your choice is around 1/10 of the AWS price.
>Even 2.5x over building your own infrastructure is significant for a $50M/yr supercomputer.
Can’t imagine you are paying public prices on any cloud provider if you have a $50M/yr budget.
In addition, if, as the article states, the scientists are ok to wait some considerable time for results, then one can run most, if not all, on spot instances, and that can save 10x right there.
If you don’t have $50M/yr there are companies that will move your workload around different AWS regions to get the best price - and will factor in the cost of transferring the data too.
I was architect at large scientific company using AWS.
Author here. I agree that pricing is highly negotiable for any large cloud provider, and there are even (capped) egress fee waivers that you can negotiate as a part of your contract. There's also a place for using AWS; I used it for a smaller DNA sequencing facility, and I use it for my computational biology startup.
That said, I'll repeat something that I commented somewhere else: most of scientific computing (by % of compute) happens in a context that still doesn't make sense in AWS. There's often a physical machine within the organization that's creating data (e.g. a DNA sequencer, particle accelerator, etc), and a well-maintained HPC cluster that analyzes that data.
Spot instances are still pretty expensive for a steady queue (2x of Hetzer monthly costs, for reference), and you still have to pay AWS data transfer egress costs – which are at least 30x more expensive than a colo or on-prem, if you're saturating a 1 Gbps link. Data transfer to optimize for spot instance pricing becomes prohibitive when your job has 100 TB of raw data.
Surely the raw data is input, so ingress costs, which is free?
The problem is if you have large amounts of intermediate data, and you want to transfer that somewhere else to continue analysis. Then it's "expensive". So the logical conclusion is to do all the work on the cloud, so you never have egress costs. That causes anxiety, sure. That said, 100TB costs $1000 to egress, assuming that your $50M/yr covers a Verizon 10GB/s line.
These opinions are my own and not those of my employer or former employer.
Database analyst for a large communication company here.
I have similar doubts about AWS for certain kinds of intensive business analysis. Not API based transactions, but back-office analysis where complex multi-join queries are run in sequence against tables with 10s of millions of records.
We do some of this with SQL servers running right on the desktop (and one still uses Excel with VLOOKUP). We have a pilot project to try these tasks in a new Azure instance. I look forward to seeing how it performs, and at what cost.
I'd love to buy my own servers for small-scale (i.e. startup size or research lab size) projects, but it's very hard to be utilizing them 24x7. Does anyone know of open-source software or tools that allow multiple people to timeshare one of these? A big server full of A100s would be awesome, with the ability to reserve the server on specific days.
> the ability to reserve the server on specific days
In an environment where there are not too many users and everyone is cooperative, using Google Calendar to reserve time slots works very well and is very low maintenance. Technical restrictions are needed only when the users can't be trusted to stay out of each other's way.
If you pay $500 to form an LLC with Stripe Atlas you get $10,000 worth of AWS credits that can be used any way you want. It's a pretty solid way to cost effectively do scientific computing even if you need like five companies.
If a policy change is made because of this comment. I'm sorry. For sure let me know though. I'll put it on my resume.
Five year is a pretty typical amortisation schedule for HPC hardware. During my sysadmin days, of CPU, memory, cooling, power, storage, and networking, the only things that broke were hard disks and a few cooling fans. Disks were replaced by just grabbing a space and slotting it in, and fans were replaced by, well, swapping them out.
Modern CPUs and memory last a very long time. I think I remember seeing Ivy Bridge CPUs running in Hetzner servers in a video they put out, and they're still fine.
if you expect downtime in the 5 year to replace fan and whatnot, you're not getting 100% of your money/perf back - and I didn't see that in the article.
if you have spares, spares need to be in the cost, and value lost to downtime stay minimal. but you have to include spares in the expenses. if you don't have spares, 1-2 day downtime is going to be a decent hit to value.
I’m not sure I understand what you mean. I’ve run HPC clusters for a long time now, and node failures are just a fact of life. If 3 or 4 nodes of your 500 node cluster are down for a few days while you wait for RMA parts to arrive, you haven’t lost much value. Your cluster is still functioning at nearly peak capacity.
You have a handful of nodes that the cluster can’t function without (scheduler, fileservers, etc), but you buy spares and 24x7 contracts for those nodes.
I think spot instances on amazon at least don't do partial hours, do they? So, you'll also have some wasted cycles there. Probably enough to compensate for your on-prem downtime.
I've worked with a yarn cluster with around 200 nodes which ran non stop for well over 5 years and still kicking. There were a handful of failures and replacements, but I'd 95% of the cluster was fine 7 years in.
When I was looking at AWS for personal use, I first thought it was oddly expensive even when factoring in not having to buy the hardware. When I looked at just what the electricity cost to run it myself would be, I think that addition alone turned out AWS was actually cheaper. This is without factoring in cooling / dedicated space / maintenance.
Most of the comments get fixated on the most and least expensive options in this (AWS where you pay through the nose / own DC where you'll get bad service from your institution and have to fight with hw procurement stuff). What about the middle ones presented, the more reasonably priced cloud/rental server providers?
Buying your own fleet of dedicated servers seems like a smart move in the short term, but then five years from now you’ll get someone on the team insisting that they need the latest greatest GPU to run their jobs. Cloud providers give you the option of using newer chipsets without having to re-purchase your entire server fleet every five years.
In HPC land, most hardware is amortized over five years and then replaced! If you keep your service in life for five years at high utilization, you're doing great.
Very odd article (to be written by a scientist)—Shouldn't it be comparing with GCE? Doesn't make sense to compare on cost against AWS instead of GCP, except.. for wow numbers and moar clicks.
"Linkedin doesn't make sense for connecting with friends".
I can tell you that NASA is in the midst of a multi-year effort to move their computing to AWS and that yes.. downloading 324 terabytes of data is very expensive but very soon all this data will just remain in the cloud and be accessed virtually.
The fact that scientific computing has a different pattern than the typical web app is actually a good thing. If you can architect large batch jobs to use spot instances, it's 50-80% cheaper.
Also this bit: "you can keep your servers at 100% utilization by maintaining a queue of requested jobs" isn't true in practice. The pattern of research is the work normally comes in waves. You'll want to train a new model or run a number of large simulations. And then there will be periods of tweaking and work on other parts. And then more need for a lot of training. Yes, you can always find work to put on a cluster to keep it >90% utilization, but if it can be elastic (and has compute has budget attached to it), it will rise and fall.
Author here! I worked for the computing infrastructure for a DNA sequencing facility, and I run a computational biology infrastructure company (trytoolchest.com, YC W22). Both are built on AWS, so I do think AWS in scientific computing has its use-cases – mostly in places where you can't saturate a queue or you want a fast cycle time.
Spot instances are still pretty expensive for a steady queue (2x of Hetzer monthly costs, for reference), and you still have to pay AWS data transfer egress costs – which are at least 30x more expensive than a colo or on-prem, if you're saturating a 1 Gbps link.
This post was born from frustration at AWS for their pricing and offerings after trying to get people to switch to AWS in scientific computing for years :)
I'm surprised no-one mentioned Amazon Lightsail by now. Anyway, yes, AWS can be super expensive especially if you don't know what you're doing, regardless of type of processing.
using your own infrastructure always wins (alsuming free labor) since you can load your own infrastructure to ~95% pretty much 24/7 which is unbeatable.
It might also depend on how long you're actually willing to wait. There's nothing stopping you from having a job queue in AWS, and you can setup things up so that instances are only running if the price is low enough.
Otherwise completely agree, there might be some cases where the cost of labour means that you're better off running something in AWS, even if that requires someone to do the configuration as well.
- CERN started planning its computing grid before AWS was launched.
- It's pretty complicated (politics, mission, vision) for CERN to use external proprietary software/hardware for its main functions (they have even started to MS Office like products.)
- [cost] CERN is quite different than a small team researchers doing few years research. the scale is enormous and very long lived, like for decades continue
- and more...
HPC and scientific computing aside, I would have loved to be able to use AWS when I worked there, internal infra for running web apps and services wasn't nearly good & reliable, neither had a wide catalog of services offered.
I think the spirit of the article is to put the cloud in perspective of the organization size and the workload type. There is a sweet spot where the cloud is the only option that makes sense, definitely with variable loads and capacity to basically scale on demand as big as our budget, there is no match for that. However... there are organizations with certain type of workloads that could afford to put infrastructure in place and even with the costs of staffing, energy etc they will save millions in the long run. NASA, CERN etc are some. This is not limited to HPC, the cloud at scale is not cheap either see: https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap...
The vast majority of researchers don't need anywhere close to the amount of resources that CERN needs. The fact that CERN doesn't use EC2 and lambdas shouldn't be taken as a lesson by anyone who's not operating at their scale.
This feels like a similar argument to the one made by people who use Kubernetes to ensure their web app with 100 visitors a day is web scale.
Toolchest actually runs scientific computing on AWS! I'm just frustrated by what we can build, because most scientific compute can't effectively shift to AWS
As an HPC sysadmin for 3 research institutes (mostly life sciences & biology) I can't see how cloud HPC system could be any cheaper than an on-prem HPC system especially if I look at the resource efficiency (how much resources were requested vs how much were actually useed) of our users SLURM jobs. Often the users request 100s of GB but only use a fraction of it. In our on-prem HPC system this might decrease utilization (which is not great) but in the case of the cloud this would result in increased computing costs (because bigger VM flavor) which would be probably worse (CapEx vs OpEx)
Of course you could argue that the users should do and know better and properly size/measure their resource requirements however most of our users have lab background and are new to computational biology so estimating or even knowing what all the knobs (cores, mem per core, total memory, etc) of the job specification means is hard for them. We try to educate by providing trainings and job efficency reporting however the researchters/users have little incentive to optimize the job requests and are more interested in quick results and turnover which is also understandable (the on-prem HPC system is already payed for). Maybe the cost transparancy of the cloud would force them or rather their group leaders/institute heads to put a focus on this but until you move to the cloud you won't know.
Additionally the typical workloads that run on our HPC system are often some badly maintained bioinformatics software or R/perl/pythong throwaway scripts and often enough a typo in the script causes the entire pipeline to fail after days of running on the HPC system and needs to be restarted (maybe even multiple times). Again on the on-prem system you have wasted electricity (bad enough) but in the cloud you have to pay the computing costs of the failed runs. Again cost transparency might force to fix this but the users are not software engineers.
One thing that the cloud is really good at, is elasticity and access to new hardware. We have seen for example a shift of workloads from pure CPUs to GPUs. A new CryoEM microscope was installed where the downstream analysis is relying heavily on GPUs, more and more resaerch groups run Alpafold predictions and also NGS analysis is now using GPUs.
We have around 100 GPUs and average utlizations has increased to 80-90% and the users are complaining about long waiting/queueing times for their GPU jobs.
For this bursting to the cloud would be nice, however GPUs are prohibitively expensive in the cloud unfortunately and the above mentioned caveats regarding job resource efficiencies still apply.
One thing that will hurt on-prem HPC systems tough are the increased electricity prices. We are now taking measures to actively save energy (i.e. by powering down idle nodes and powering them up again when jobs are scheduled). As far as I can tell the big cloud providers (AWS, etc) haven't increased the prices yet either because they cover elecriticity cost increase with their profit margins or they are not affected as much because they have better deals with elecricity providers.
you touch on a good point in that because cloud compute requires pretty knowledgeable users in order not to waste massive amounts of money it effectively imposes a much higher level of competency requirement onto the users. You can view that different ways, one is that its a good thing if everyone learns to use compute better. But another is you are locking out a whole tier of scientific users from doing computation at all which is a pretty unfortunate thing - we may miss out on real and important scientific discoveries - even if they are horrifically bad at doing it efficiently.
This is less than $300/month for 2.5x the compute capacity he's referencing! The author's estimate is $200/month for an on-prem server with just 48 cores. Scaled down to that level, the equivalent in cloud spot pricing would be $120.
That's assuming on-prem is 100% utilised and the cloud compute is not auto-scaled. If those assumptions are lifted, the cloud is much cheaper.
The cloud makes sense in several other ways also:
- Once the data is in cloud storage like S3 or Azure Storage Accounts, sharing it with government departments, universities, or other research institutes is trivial. Just send them a SAS URL and they can probably download it at 1GB/s without killing the Internet link at the source.
- Many of these processes have 10 GB inputs that produce about 1 TB of output due to all the intermediate and temporary files. These are often kept for later analysis, but they're of low value and go cold very quickly. Tiered storage in the cloud is very easy to set up and dirt cheap compared to on-prem network attached storage. These blobs can be moved to "Cold" storage within a few days, and then to "Archive" within a month or two at most.
- The algorithms improve over time, at which point it would be oh-so-nice to be able to re-run them over the old multi-petabyte data sets. But on-prem, this is an extravagance, and needs a lot of justification. In the cloud, you can just spin up a large pool of Spot instances with a low price cap, and let it chunk through the old data when it can. Unlike on-prem, this can read the old data in much faster, easily up to 30-100 Gbps in my tests. Good luck building a disk array that can stream 100 Gbps and also have good performance for high-priority workloads!
- The hardware is evolving much more rapidly than typical enterprise purchase cycles. We have a customer that is about to buy one (1) NVIDIA A100 GPU to use for bioinformatics. In a matter of months, it'll be superseded by the NVIDIA "Hopper" H100 series, which is 7x faster for the same genomics codes. In the cloud, both AWS and Azure will soon have instances with four H100 cards in them. That'll be 28 times faster than one A100 card, making the on-prem purchase obsolete years before the warranty runs out. A couple of years later when then successor to H100 is available in the cloud, these guys will still be using the A100!
- The cloud provides lots of peripheral services that are a PitA to set up, secure, and manage locally. For example, EKS or AKS are managed Kubernetes clusters that can be used to efficiently bin-pack HPC compute jobs and restart jobs on Spot instances if they're deallocated. Similarly, Azure CycleCloud provides managed Slurm clusters with auto-scale and spot pricing. For Docker workloads there are managed container registries, and both single-instance and scalable "container apps" that work quite well for one-off batch jobs, Jupyter notebooks, and the like.
- In the cloud, it's easy to temporarily spin up a true HPC cluster with 200 Gbps Infiniband and a matching high-performance storage cache. It's like a tiny supercomputer, rented by the hour. On-prem, just buying a single Infiniband switch will set you back more than $30K, and it'll be just the chassis. No cables, SFPs, or host adapters. A full setup is north of $100K. Good luck buying "cheap" storage that can keep up with that network!
Author here. It's clear that you genuinely believe my intent here is to be deceptive, so I think your comment deserves a thoughtful response. For context, I work on a computational biology infrastructure company that only uses cloud computing; my incentive is for more scientific computing to be on the cloud.
Responses to your points below. Azure does have some better HPC infrastructure than AWS, so maybe some of my response to your comment will be wrong. I'm happy to talk about this more if you want, you can reach me using the email address in my profile.
Spot/pre-emptible instances vs. the prices in this post: large CPU/RAM instances have availability issues, especially for spot instances. I spent a lot of time trying to exploit spot instance pricing, and for standard (run command-line tool that reads in file and writes out files) bioinformatics programs, spot instances haven't made sense averaging their performance over a long period after factoring in restarts and their corresponding data transfer vs. other options for decreasing instance cost (reserved instances, negotiating, etc).
Sending S3 links: yeah, AWS makes sending data easier! Although if the destination is not within same same cloud provider (or region), you get hit with a surprising large charge for sufficiently large files.
Input size vs. output size and storing the results: generally, I agree. Cloud storage costs aren't unreasonable for S3, but I want to note how significantly the pipeline can differ. A new Illumina NovaSeq sequencer (about the size of a copy machine) with dual S4 flow cells produces 6Tb every couple days. Some pipelines are definitely inefficient, but others have more raw data. Storing that data in an infrequent access or archive tier decreases the restore speeds and increases the restore cost. If you have 100 TB of data, that increases the cost of re-running data – especially if it's in cold storage or archived.
Improved algorithms and re-running large sets of data: sure, it's a trade-off between cost and the bandwidth of a queue than you can run. For some use cases, the cloud does make sense.
Hardware improvement cycles in bioinformatics: what software are they using in bioinformatics that uses a GPU, AlphaFold? From what I've seen, most computational genomics still happens on a CPU, although fields like computational chemistry use more GPUs.
Infrastructure components and easily installable components: yeah, this is a definite value-add of cloud services, and the off-AWS/GCP/Azure analogues aren't as good yet.
Cost of networking equipment vs. by-the-hour in a cloud: yeah, if you want results quickly and occasionally, this makes sense.
Overall, this post is about most of scientific computing, not all. For this to work, you need a smoothable queue of jobs. Most computational science (by % of compute) run in this context, in universities, larger/growing co's, and government research institutions. If you want instant scalability, the math is different.
> Azure does have some better HPC infrastructure than AWS
That's somewhat surprising to hear, I just assumed AWS has equivalent products. My customer is 80% AWS and 20% Azure, so it's a useful data point to know that some HPC workloads are better off in Azure.
> large CPU/RAM instances have availability issues, especially for spot instances.
I've had two different ~128 vCPU instances running for days and days in my lab environment, but that's probably because my region tends not to have a lot of HPC workloads that would compete for spot instances. I've noticed that "popular" sizes in Azure such as D4, D8, E4, and E8 are pre-empted regularly, but the "special" sizes like HPC not so much.
> surprising large charge for sufficiently large files.
Both Azure and AWS use this as a "roach motel" to encourage vendors and partners to co-locate in the same cloud. It's unfortunate that they charge on the order of $100 per terabyte, but it is what it is. However, bioinformatics files compress well, and compared to getting something out of an on-prem traditional network, the egress fees are a bargain.
My customer has a rural site with a "WAN" link. Their non-cloud option is to buy a NAS, replicate it to another NAS in a data center, and then build a permanent "file sharing solution". This is going to cost tens of thousands of dollars. They might share a few terabytes annually, which makes cloud egress fees look practically free in comparison.
> A new Illumina NovaSeq sequencer (about the size of a copy machine) with dual S4 flow cells produces 6Tb every couple days.
The scientists I talked to raised this, and to be honest I'm also concerned, especially as some of the groups I deal with are in rural areas "far from the cloud." (They analyse samples from cattle ranchers to try and prevent foot and mouth disease.)
Let's say the machine generates 6 terabytes in 2 days, so 3 TB daily. Assuming that's the uncompressed data, it'll be about 1 TB after compression. The location I'm thinking of has a 500 Mbps link, but that can transfer that data volume in just 4 hours: https://www.wolframalpha.com/input?i=%28+1+TB+%29+%2F+500+Mb...
One trick I discovered recently is that both the s3cmd and azcopy tools can take pipeline input. Combine that with a parallel compression tool like 'pigz' that can output to the pipeline, and you can have a workflow where the "raw" input files are compressed and streamed to the cloud storage at the same time. That alone can cut hours off the transfer time!
> If you have 100 TB of data, that increases the cost of re-running data – especially if it's in cold storage or archived.
Not necessarily. Azure Cold storage access has no special charges associated with it. It's a bit slower, but streaming reads were quite fast in my experience. Archive tier of course has some additional costs, but it's not a drama in most cases. For example, "high priority" retrieval costs extra, but normal priority appears to be free.
> what software are they using in bioinformatics that uses a GPU
"An example is the Smith-Waterman algorithm for genomics processing".
Admittedly, GPU usage for genomics is still rare, but it is becoming more common.
> If you want instant scalability, the math is different.
In my example, think of 10-20 scientists doing semi-regular gene sequencing workloads and running related analyses. Sometimes needing a single machine with 2TB of memory, other times running 10,000 trivial jobs.
In principle, the flexibility of the cloud is nearly optimal for a scenario like this.
Data Transfer OUT From Amazon EC2 To Internet
First 10 TB / Month $0.09 per GB
Next 40 TB / Month $0.085 per GB
Next 100 TB / Month $0.07 per GB
Greater than 150 TB / Month $0.05 per GB
Which means if you transfer out 90 TB in one month, it's $0.09 * 10000 + $0.085 * 40000 + $0.07 * 40000 = $7100.
It always was for load that doesn't allow for autoscaling to save you money; the savings were always from convenience of not having to do ops and pay for ops.
Then again a part of ops cost you save is paid again in dev salary that have to deal with AWS stuff instead of just throwing a blob of binaries and letting ops worry about the rest.
Put like for like in a well managed data center against negotiated and planned cloud services, and the former may still win, but it won't be dramatically cheaper, and figured over depreciable lifetime and including opportunity cost, may cost more. It takes work to figure out which is true.