AWS doesn't make sense for scientific computing

walnutclosefarm · on Oct 7, 2022

Having had the responsibility of providing HPC for a literal buildings full of scientists, I can say that it may be true that you can get computation cheaper with owned hardware, than in a cloud. Certainly pay as you go, individual project at a time processing will look that way to the scientist. But I can also say with confidence that the contest is far closer than they think. Scientists who make this argument almost invariably leave major costs out of their calculation - assuming they can put their servers in a closet,maintain them themselves, do all the security infrastructure, provide redundancy and still get to shared compute when they have an overflow need. When the closet starts to smoke because they stuffed it with too many cheaply sourced, hot-running cores and GPUs, or gets hacked by one of their postdocs resulting in an institutional HIPAA violation, well, that's not their fault.

Put like for like in a well managed data center against negotiated and planned cloud services, and the former may still win, but it won't be dramatically cheaper, and figured over depreciable lifetime and including opportunity cost, may cost more. It takes work to figure out which is true.

W-Stool · on Oct 7, 2022

Let me echo this as someone who once was responsible for HPC computing in a research intensive public university. Most career academics have NO IDEA how much enterprise computing infrastructure costs. If a 1 terabyte USB hard drive is $40 at Costco we (university IT) must be getting a much better deal than that. Take this argument and apply it to any aspect of HPC computing and that's what you're fighting against. The closet with racks of gear and no cooling is another fond memory. Don't forget the AC terminal strips that power the whole thing, sourced from the local dollar store.

0xbadcafebee · on Oct 7, 2022

I remember the first time a server caught fire in the closet we kept the rack in. Backups were kept on a server right below the one on fire. But, y'know, we saved money.

walnutclosefarm · on Oct 8, 2022

I never saw an actual fire, but I we did see smoke, in a closet, on a floor that was 90% patient care, but happened to have a research area as well because the research needed access to expensive radiography equipment. The close was literally stuffed with what amounted to gaming machines, purchased with grant money through an importer, directly from China. The guy who set them up was smart enough to put them all behind a little firewall, so the enterprise network couldn't see them. It was a (literally) hot mess, both infrastructure and security-wise.

eastbound · on Oct 7, 2022

Don’t worry, we do incremental backups during weekdays and a full backup on Sunday. We use 2 tapes only, so one is always outside of the building. But you know, we saved money.

treeman79 · on Oct 7, 2022

We had a million dollars worth of hardware installed in a closet. It had a portable AC hooked up that needed it’s water bin changed every so often.

Well I was in the middle of that. When the Director decided to show off the new security doors. So he closed the room up. Then found out that new security doors didn’t work. I find out as I’m coming back to turn AC back on. Room will get hot really fast.

We get office Security to unlock door. He says he doesn’t have authority. His supervisor will be by later in the day.

Completely deadpan, and in front of several VPs of a forth one 50.

I turn to guy to my right who lived nearby. “Go home and get your chainsaw”

We were quickly let in. Also got fast approval to install proper cooling.

bilbo0s · on Oct 7, 2022

A bit off topic, but I gotta say you guys are a riot!

If there was a comedy tour for IT/Programmer types, I'd pay to see you guys in it.

Best thing about your stuff is that it's literally all funny precisely because it's all true.

geoduck14 · on Oct 8, 2022

I don't know why you are being down voted. I'd watch a Netflix special on this stuff

pbronez · on Oct 7, 2022

This is my fear about my homelab lol

Fire extinguisher nearby, smart temp sensors, but still...

W-Stool · on Oct 7, 2022

What are you using for a homelab priced temperature sensor?

xani_ · on Oct 7, 2022

Homelab-priced sensor is the temp sensor in your server, it's free! Actual servers have a bunch, usually have one at intake, "random old PC" servers can use motherboard temp as rough proxy for environment temp.

Hell, even in DC you can look at temperatures and see in front of which server technican was standing just by those sensors.

Second cheapest would be USB-to-1wire module + some DS18B20 1-sire sensors. Easy hobby job to make. They also come with unique ID which means if you put it in TSDB by that ID it doesn't matter where you plug those sensors.

rovr138 · on Oct 7, 2022

oh, nice idea with temp sensor.

I have extinguishers all over the house, but hadn't considered a temperature sensor set to send alerts.

Do you have any recommendations?

xani_ · on Oct 7, 2022

No more or less chance that really anything else connected to power to catch on fire. Just make sure RCD works.

Hell, some weeks ago my oven decided the lower heater's line is now connected to ground and blows fuses...

Our 8 racks in DC only had single event of something blowing (power supply) and aside of smell and fuse blowing nothing really happened

Servers are essentially metal boxes with a bit of glass-reinforced epoxy and some plastic inside so there is limited amount of stuff that can burn. UPS is probably bigger problem

For example the OVH datacenter fire was more of "stuff around servers was flammable" (they had wooden ceilings for some reason...) rather than "just" servers.

absolutelynobo · on Oct 8, 2022

Cool how you use the acronym RCD and don’t expand it anywhere. Or any of the other homelab acronyms you used!

halper · on Oct 8, 2022

"Residual current device". It detects current leaking to ground; essentially it prevents you from killing someone by throwing a toaster into the bathtub.

tiagod · on Oct 8, 2022

RCD is not a homelab acronym, its a type of circuit breaker.

systemvoltage · on Oct 7, 2022

I am dealing with the exact opposite problem: "Oh you mean, we should leave the EC2 instance running 24/7??? No way, that would be too expensive"... to which I need to respond "No, it would be like $15/month. Trivial, stop worrying about costs in EC2 and S3, we're like 7 people here with 3 GB of data."

I deal with Scientists that think AWS is some sort of a massively expensive enterprise thing. I can be, but not for the use case they're going to be embarking on. Our budget is $7M spanning 4 years.

gonzo41 · on Oct 7, 2022

Don't say the budget outloud near AWS. They'll find a way to help you spend it.

systemvoltage · on Oct 7, 2022

Hahaha, may be I need to just go into the AWS ether and start yakking big words like "Elastic Kubernetes Service" to confuse the scientists and get my aws fix. These people are too stingy. I want some shit running in AWS, what good is this admin IAM role.

absolutelynobo · on Oct 8, 2022

GP has already lost that battle.

datavirtue · on Oct 8, 2022

Yeah, by following the "documentation" and "best practices."

geoduck14 · on Oct 8, 2022

We use a phrase around my job: the only attack we suffer from is "denial of wallet"

capableweb · on Oct 7, 2022

> think AWS is some sort of a massively expensive enterprise thing

Compared to using dedicated instances with way cheaper bandwidth, storage and compute power, it might as well be.

Cloud makes sense when you have to scale up/down very quickly, or you'd be losing money fast. But most don't suffer from this problem.

pwinnski · on Oct 7, 2022

At $15/month, it would take longer than a typical school year to beat.

slavik81 · on Oct 8, 2022

What are you getting for $15/mo? That's surprisingly inexpensive for scientific compute.

morbia · on Oct 8, 2022

Your comment about cooling reminded me of a fun anecdote from my time in academia.

My PhD involved quite a lot of high performance computing simulations. We kept running into problems where on warm days, all our jobs would get killed pretty consistently. Our IT guy noticed a pattern where the temps would be perfectly normal, and then suddenly out of nowhere go through the roof, triggering a thermal shutdown of the racks.

In the end our IT guy camped out in the server room on a hot day to watch what was happening. The Astrophysics' cabinet were directly infront of ours, and they had jerry rigged their cabinet door so that when it got too hot, it would swing open and their hot air would be blown all over the neighbouring cabinets...

bluedino · on Oct 7, 2022

It's kind of funny around this time of year when some researchers have $10,000 in their budget they need to spend, and they want to 'gift' us with some GPU's.

davidmr · on Oct 7, 2022

That was definitely one of the weirdest things of working in academia IT: “hey. Can you buy me a workstation that’s as close to $6,328.45 as it is possible to get, and can you do it by 4pm?”

kjs3 · on Oct 10, 2022

Same thing happens in the government sector here (US). If you don't spend all of the budget you requested last year, you might not get it next year. There is an entire ecosystem of bottom-feeder GSA companies that apparently exist to spend year-end money that would otherwise go to 'waste'.

absolutelynobo · on Oct 8, 2022

> Don't forget the AC terminal strips that power the whole thing, sourced from the local dollar store.

Love how you’ve fuzzed the root cause to make it seem like the “dollar store strip” is the problem and not that it was plugged into an overloaded outlet or run inside a closet at significantly elevated temperatures, leading to plastic melting and wires shorting.

Always helps to keep the “magic” a secret so the rubes have to keep us wizards employed, right?

funstuff007 · on Oct 7, 2022

I cannot wait for Ampere servers to become widespread. Most of the power and cooling issues will go away.

getcrunk · on Oct 8, 2022

sarcasm?

ptero · on Oct 7, 2022

I think it really depends on the task. Where HIPAA violation is a real threat, the equation changes. And just for CYA purposes those projects can get pushed to a cloud. Which does not necessarily involve any attempts to make them any more secure, but this is a different topic.

That said, many scientists are operating on premise hardware like this: some servers in a shared rack and an el-cheapo storage solutions with an ssh access for people working in the lab. And it works just fine for them.

Cloud services focus for running business computing in a cloud, emphasizing recurring revenue. Most research labs are much more comfortable with spending the hardware portion of a grant upfront and not worrying about some student who, instead of working on some fluid dynamics problem found a script to re-train a stable diffusion and left it running over winter break. My 2c.

crazygringo · on Oct 7, 2022

> And it works just fine for them.

Until it doesn't because there's a fire or huge power surge or whatever.

That's the point -- there's a lot of risk they're not taking into account, and by focusing on the "it works just fine for them", you're cherry picking the ones that didn't suffer disaster.

horsawlarway · on Oct 7, 2022

I'd counter by saying I think you're over-estimating how valuable mitigating that risk is to this crowd.

I'd further say that you're probably over-estimating how valuable mitigating that risk is to anyone, although there are a few limited set of customers that genuinely do care.

There are few places I can think of that would benefit more by avoiding cloud costs than scientific computing...

They often have limited budgets that are driven by grants, not derived by providing online services (computer going down does not impact bottom line).

They have real computation needs that mean hardware is unlikely to sit idle.

There is no compelling reason to "scale" in the way that a company might need to in order to handle additional unexpected load from customers or hit marketing campaigns.

Basically... the only meaningful offering from the cloud is likely preventing data loss, and this can be done fairly well with a simple backup strategy.

Again - they aren't a business where losing a few hours/days of customer data is potentially business ending.

---

And to be blunt - I can make the same risk avoidance claims about a lot of things that would simply get me laughed out of the room.

"The lead researcher shouldn't be allowed in a car because it might crash!"

"The lab work must be done in a bomb shelter in case of war or tornados!"

"No one on the team can eat red meat because it increases the risk of heart attack!"

and on and on and on... Simply saying "There's risk" is not sufficient - you must still make a compelling argument that the cost of avoiding that risk is justified, and you're not doing that.

withinboredom · on Oct 7, 2022

Ummm. I’ve def been unable to do anything for entire days because our AWS region went down and we had to rebuild the database from scratch. AWS goes down, you twiddle your thumbs and the people you report to are going to be asking why, for how long, etc. and you can’t give them an answer until AWS comes back to see how fubar things are.

When your own hardware rack goes down. You know the problem, how much it costs to fix it, and when it will come back up; usually within a few hours (or minutes) of it going down.

Do things catch fire, yes. But I think you’re over-estimating how often. In my entire life, I’ve had a single SATA connector catch fire and it just melted plastic before going out.

crazygringo · on Oct 7, 2022

I'm not talking about temporary outages, I'm talking about data loss.

With AWS it's extremely easy to keep an up-to-date database backup in a different region.

And it's great that you haven't personally encountered disaster, but of course once again that's cherry-picking. And it's not just a component overheating, it's the whole closet on fire, it's a broken ceiling sprinkler system going off, it's a hurricane, it's whatever.

withinboredom · on Oct 7, 2022

So was I also talking about data loss. Not everything can be replicated, but backups can and were made.

For the rest, there’s insurance. Most calculations done in a research setting are dependent upon that research surviving. If there’s a fire and the whole building goes down, those calculations are probably worthless now too.

Hell, most companies probably can’t survive their own building/factory burning down.

FpUser · on Oct 7, 2022

>"With AWS it's extremely easy to keep an up-to-date database backup in a different region"

It is just as extremely easy on Hetzner or on premises

monkmartinez · on Oct 7, 2022

I would say even easier on prem as you don't need to wade 15 layers deep to do anything. Since I have moved to hosting my own stuff at my house, I have learned that connecting a monitor and keyboard to a 'sever' is awesome for productivity. I know where everything is, its fast as hell, and everything is locked down. Monitoring temps, adjusting and configuring hardware is just better in every imaginable way. Need more RAM, Storage, Compute? Slap those puppies in there and send it.

For home gamers like myself, it's has become a no brainer with advances in tunneling, docker, and cheap prices on Ebay.

noobermin · on Oct 7, 2022

May be consider that your use case and the average scientist's use case isn't the same? What works for you won't work for them and vise versa? What you consider a risk, I wouldn't?

Consider the following, I have never considered applying meltdown or spectre mitigations if it makes my code run slower because I plain don't care, assuming anyone even peeks at what my simulations doing, whoopdeedo, I don't care. I won't do that on my laptop I use to buy shit off amazon with, but the workstation I have control of? I don't care. I DO care if my simulation will take 10 days instead of a week.

My use case isn't yours because my needs aren't yours. Not everything maps across domains.

vjk800 · on Oct 7, 2022

The point is, there's not need for everything to be 100% reliable in this context. If a fire destroys everything and their computational resources is unavailable for a few days, that's somewhat okay. Not ideal, but not a catastrophic loss either. Even data loss is no catastrophic - at worst it means redoing one or two weeks worth of computations.

Some sort of 80/20 principle is at works here. Most of the costs in professional cloud solutions comes from making the infrastructure 99.99% reliable instead of 99% reliable. It is totally worth it if you have millions of customers that expect a certain level of reliability, but a complete overkill if the worst case scenario from a system failure is some graduate student having to redo a few days worth of computations (which probably had to be redone several times anyway because of some bug in the code or something).

ptero · on Oct 7, 2022

> there's a lot of risk they're not taking into account

I see it the other way: experimental scientists operate with unreliable systems all the time: fickle systems, soldered one-time setups, shared lab space, etc. Computing is just one more thing that is not 100% reliable (but way more reliable than some other equipment), and usb data sticks serve as a good enough data backup.

mangecoeur · on Oct 7, 2022

Or your university might have it's own backup system. We have a massive central tape-based archive that you can run nightly backups to.

billythemaniam · on Oct 7, 2022

The counterpoint to that point is that a significant percentage of scientific computing doesn't care about any of that. They are unlikely to have enough hardware to cause a fire and they don't care about outages or even data loss in many cases. As others have said, it depends on the specifics of the research. In the cases where that stuff matters, the cloud would be better option.

Fomite · on Oct 7, 2022

This. If my lab-level server failed tomorrow, I'd be annoyed, order another one, and start the simulations again.

kijin · on Oct 7, 2022

Even that depends on what you're doing. Most scientists aren't running apps that require several 9's of availability, connect to an irreplaceable customer database, etc.

An outage, or even permanent loss of hardware, might not be a big problem if you're running easily repeatable computations on data of which you have multiple copies. At worst, you might have to copy some data from an external hard drive and redo a few weeks' worth of computations.

secabeen · on Oct 7, 2022

Thankfully, only a small part of the academic research enterprise involves human subjects, HIPAA, and all that. Neither fruit flies nor quarks have privacy rights.

dmicah · on Oct 7, 2022

Research involving human subjects (psychology, cognitive neuroscience, behavioral economics, etc.) requires institutional review board approval and informed consent, etc. but mostly doesn't involve HIPAA either.

Fomite · on Oct 7, 2022

And many, many institutions are over cautious. My own university, for example, has no data classification between "It would be totally okay if anyone in the university has access" and "Regulated data", so "I mean, it's health information, and it's governed by our data use agreement with the provider..." gets it kicked to the same level as full-fat HIPAA data.

charcircuit · on Oct 7, 2022

That is not a law.

icedchai · on Oct 7, 2022

There are actually laws around such things. You can read about them here: https://www.hhs.gov/ohrp/index.html

charcircuit · on Oct 8, 2022

That only applies to government / government sponsored research. That is different from someone doing their own research

icedchai · on Oct 8, 2022

This is true, but to say it is "not a law", as you did, completely unqualified, is incorrect. If the research project is connected with a government grant (and many are) you need to pay attention to those laws. Many universities also have their own policies you need to follow, regardless. (Requiring informed consent and protecting people's privacy seems like a good thing.)

charcircuit · on Oct 8, 2022

Let me repeat it another way. The law only restricts the actions of the government. Members of a university are not the government. Even if they took government money they could legally ignore all of that stuff. Worst case you will not get more funding from them in the future.

icedchai · on Oct 8, 2022

I believe you are technically correct, but that does not change the fact that universities have IRBs and will require reviews/approval if you are connected to that institution. You really think they're going to put their funding at risk? This seems very unlikely.

I found this article: https://journals.sagepub.com/doi/10.1177/1073110520917030

mangecoeur · on Oct 7, 2022

I've been running a group server (basically a shared workstation) for 5 years and it's been great. Way cheaper than cloud, no worrying about national rules on where data can be stored, no waiting in a SLURM batch queue, Jupyter notebooks on tap for everyone. A single $~6k outlay (we don't need GPUs which helps).

Classic big workstations are way more capable than people think - but at the same time it's hard to justify buying one machine per user unless your department is swimming in money. Also, academic budgets tend to come in fixed chunks, and university IT departments may not have your particular group as a priority - so often it's just better to invest once in a standalone server tower that you can set up to do exactly what you need it to than try to get IT to support your needs or the accounting department to pay recurring AWS bills.

killingtime74 · on Oct 7, 2022

Aren't you talking about 1 server when this is talking about HPC?

mangecoeur · on Oct 7, 2022

Well the title is scientific computing, which includes HPC but not only. Anyway the fact is that a lot of "HPC" in university clusters is smaller jobs that are too much for an average PC to handle, but still fit into a single typical HPC node. These are usually the jobs that people think to farm out to AWS, but that you will generally find are cheaper, faster, and more reliable if you just run them on your own hardware.

cdsx · on Oct 8, 2022

Perspective from a computational biologist: Campus hosted HPC means the direct cost pressure is seen as IT staff and hardware related costs. Researchers are encouraged to use the available capacity. This is good.

Externally-hosted HPC means every single compute job is seen as something that directly costs money. This negatively affects the quality of scientific output (research playfulness / creativity / focus on the research / etc).

walnutclosefarm · on Oct 8, 2022

Yes. The costing models that are used (and often required by granting agencies) make apples to apples cost comparisons almost impossible, and impose undoubted and significant false costs on research budgets. No question this is true.

osigurdson · on Oct 8, 2022

>> seen as something that directly costs money.

Seen by whom? Dummies it seems. Maybe the dummies are the problem, not the computing or accounting models.

secabeen · on Oct 7, 2022

They leave out major costs because they don't pay those costs. Power, Cooling, Real Estate are all significant drivers of AWS costs. Researchers don't pay those costs directly. The university does, sure, but to the researcher, that means those costs are pre-paid. Going to AWS means you're essentially paying for those costs twice. plus all the profit margin and availability that AWS provides that you also don't need.

walnutclosefarm · on Oct 8, 2022

Definitely some truth in this, and the way research funding is managed, especially with Federal grants, makes it incredibly difficult to sort out and identify the right solution.

pbronez · on Oct 7, 2022

The article estimated:

Running a modern AMD-based server that has 48 cores, at least 192 GB of RAM, and no included disk space costs:

    ~$2670.36/mo for a c5a.24xlarge AWS on-demand instance
    ~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a three-year term, paid upfront
    ~$558.65/mo on OVH Cloud[1]
    ~$512.92/mo on Hetzner[2]
    ~$200/mo on your own infrastructure as a large institution[3]

Footnote [3] explains this cost estimate as:

"Assumes an AMD EPYC 7552 run at 100% load in Boston with high electricity prices of $0.23/kWh, for $33.24/mo in raw power. Hardware is amortized over five years, for an average monthly price of $67.08/mo. We assume that your large institution already has 24/7 security and public internet bandwidth, but multiply base hardware and power costs by 2x to account for other hardware, cooling, physical space, and a half-a-$120k-sysadmin amortized across 100 servers."

duxup · on Oct 7, 2022

When I worked as a network engineer I spent months working with some great scientists / their team who built a crazy microscope (I assumed it was looking at atoms or something...) the size of a small building.

Their budget for the network gear was a couple hundred bucks and some old garbage consumer grade network gear. For something that spit out 10s of GB a second (at least) across a ton of network connections (they didn't seem to know what would even happen when they ran it), and was so bursty all but the highest end of gear could handle it.

Can confirm sometimes scientists aren't really up on the overall costs. Then they dump it "this isn't working" on their university IT team to absorb the costs / manpower costs.

prpl · on Oct 7, 2022

The things that tend to be "cheap" on campuses:

Power (especially if there is some kind of significant scientific facility on premise), space (especially in reused buildings), manpower (undergrads, grad students, post docs, professional post graduates), running old/reused hardware, etc...

You can get away with those at large research universities. Some of that you can get away with at national lab sorts of places (not going to find as much free/cheap labor, surplus hardware). If you start going down in scale/prestige, etc... none of that holds true.

Running a bunch of hardware from the surplus store in a closet somewhere with Lasko fans taped to the door is cheap. To some extent, the university system encourages such subsidies.

In any case, once you get to actually building a datacenter, if you have to factor in power, if you have a 4 hardware refresh cycle, professional staffing, etc... unless you are in one of those low CoL college towns - cloud is probably more more than 1.5 to 3x more expensive for compute (spot, etc...). Storage on prem is much cheaper - erasure coded storage systems are cheap to buy and run, and everybody wants their own high performance file system.

One continuing cloud obstacle though - researchers don't want to spend their time figuring out how to get their code friendly to preemptible VMs - which is the cost effective way to run on cloud.

Another real issue with sticking to on-prem HPC is talent acquisition and staff development. When you don't care about those things so much, it's easy to say it's cheap to run on-prem, but often the pay is crap for the required expertise, and ignoring cloud doesn't help your staff either.

COGlory · on Oct 7, 2022

>the security infrastructure, provide redundancy and still get to shared compute when they have an overflow need

The article points out that this is mostly not necessary for scientific computing.

jrumbut · on Oct 7, 2022

Which I thought was the best point of the article, that a lot of IT best practice comes from the web app world.

Web apps quickly become finely tuned factory machines, executing a million times a day and being duplicated thousands of times.

Scientific computing projects are often more like workshops. Getting charged by the second while you're sitting at a console trying to figure out what this giant blob you were sent even is is unpleasant. The solution you create is most likely to be run exactly once. If it is a big hit, it may be run a dozen times.

Trying to run scientific workloads on the cloud is like trying to put a human shoe on a horse. It might be possible but it's clearly not designed for that purpose.

insane_dreamer · on Oct 7, 2022

Plus the supposed savings of in-house hardware only materialize if you have sufficiently managed and queued load to keep your servers running at 100% 24/7. The advantage of AWS/other is to be able to acquire the necessary amount of compute power for the duration that you need it.

For a large university it probably makes sense to have and manage their own compute infrastructure (cheap post-doc labor, ftw!) but for smaller outfits, AWS can make a lot of sense for scientific computing (said as someone who uses AWS for scientific computing), especially if you have fluctuating loads.

What works best IMO (and what we do) is have a minimum-to-moderate amount of compute resources in house that can satisfy the processing jobs most commonly run (and where you haven't had to overinvest in hardware), and then switch to AWS/other for heavier loads that run for a finite period.

Another problem with in-house hardware is that you spent all that money on Nvidia V100's a few years ago and now there's the A100 that blows it away, but you can't just switch and take advantage of it without another huge capital investment.

ipaddr · on Oct 7, 2022

You are paying 10x more because no one gets fired for using IBM. AWS has many benefits most which you don't need. Pair up with another school in a different region and backup data. Computers are not scary they rarely catch fire.

whatever1 · on Oct 7, 2022

Nah for us it was the department IT guy who set up once everything (a full cluster of 50 r720s) and works like a dream.

Properly provisioned linux machines need no maintenance. You drive them until there is a hardware failure.

fuzzfactor · on Oct 8, 2022

Sounds like you've got the kind of outstanding IT guy that was motivated to make the electronics run like the scientists wanted more so than anything.

At the other end of the spectrum there's labs [0] where the scientists need to carefully study and develop increasing skills at operating the electronics the way IT wants it done, and even worse of a distraction when there's a moving target keeping up with the IT electronics approach changing faster than the progress most labs make in their own scientific field.

What labs need more of is your kind of IT operator who can bring that option (your extreme end of the spectrum in favor of lab workers) within reach for when it is the most appropriate choice.

When labs fail to retain such adequate talent, they can rule out that option going forward, and that's one less tool in the toolbox.

[0] Including many which have good records of breakthrough progress before becoming computerized to begin with.

onetimeusename · on Oct 7, 2022

Is a postdoc hacking a cluster something you have seen before? I am genuinely curious because I worked on a cluster owned by my university as an undergrad and everyone was kind of assumed to be trusted. If you had shell access on the main node you could run any job you wanted on the cluster. You could enhance security I just wonder about this threat model, that's an interesting one. I am sure it happens to be clear.

walnutclosefarm · on Oct 8, 2022

Yes. Probably not surprise that the postdoc was a PRC national. Very competent in their field of study, but also in this country with instructions from an APT group.

kregasaurusrex · on Oct 8, 2022

Sorry you were on the receiving end of that and had to learn the hard way. We had the dean of the college (large public research university) I was working at the time receive a gift of a bunch of new MBPs as a token of goodwill from a foreign country, and heard in the very next sentence that they would be going straight into a shredder. At the time I thought these brand-new laptops could easily be wiped & repurposed; but now realize it was a potential attack surface that can easily leak info to APT/espionage groups the second it connects to a network.

Less than a year later a story made national headlines that a professor with direct access to classified material had mysteriously disappeared & never disclosed his close ties with their home country. So not only should you be worried about back doors and side doors; but also watch what's going through your front door as well!

onetimeusename · on Oct 8, 2022

I had a feeling you would say that. I don't think this was part of our threat model until pretty recently. That was why I asked, because these stories aren't really internalized collectively yet I think and it's valuable to reconsider who we can trust and what threat actors might value.

mattpallissard · on Oct 10, 2022

I also managed HPC data centers and I agree with you. I feel like the term data center is a key word there. There is a point of scaling where it's just cheaper to manage a data center with a dedicated in house team. I think that holds true in other industries as well.

As far as HPC goes specifically, we could get some of the financial numbers to make sense in the cloud (cpu intensive jobs), but couldn't make it work for others (data intensive jobs shipping PB's around on the reg)

That and HPC has a lot of grant funding. It can be quite advantageous for an org to use an on prem data center almost like a slush fund. Can keep key projects running that would otherwise die when they are having a rough funding year.

funstuff007 · on Oct 7, 2022

I'd much rather store HIPAA data on a server in my office or closet than worry I got all the IAM settings right. And if I fire someone, security makes sure they can't get in the building. You cannot say the same about the cloud. Yes, I know you can do cloud security right, but on prem security is just harder to mess up.

walnutclosefarm · on Oct 8, 2022

More than half of security penetrations in our institution (A medical center - research - med school complex) over the 8 years I worked there, ending in 2020, came through the research arm, even though Research accounted for no more than 10% of enabled servers in the infrastructure. And we're talking APT penetrations. They weren't looking for HIPAA data (although I used that example in my original post), they were looking for a path to permanent presence in our network in order to mine research. So, why did they come through Research? Because a PI buys some equipment - servers or other network enabled stuff - puts a grad student or post doc in charge of it, and enjoys his or her cheap compute. But that student or post doc is not an infrastructure expert, and most definitely doesn't understand enterprise security. Next thing you know, we've got an APT owned server on the inside of the network. (And none of that is counting the ones where the post doc is a foreign national who actually intends to use their position to compromise their employer. Had that happen too.) There are a some computational scientists who actually do understand this stuff, but they're rare. Being on the cloud does not inherently fix this oroblem, but to fix it you have to be on institutionally, professionally managed infrastructure, and once you are, the cost differential between owned infrastructure and well negotiated, managed cloud infrastructure becomes much more nuanced.

funstuff007 · on Oct 8, 2022

I appreciate your perspective, but it seems the security team should be watching for reverse proxies, tunnels, and other firewall anomalies for on-prem hardware just as a normal course of biz. And if a PI installs a self-managed server, that really should not gum up the works.

All that being said, I have never worked at a place (or in a dept) whose threat profile made APT a real thing.

walnutclosefarm · on Oct 8, 2022

Obviously we do all those things. But you NEVER want an outside, and particularly APT owned machine inside your network. They can be well hidden and still do very real damage.

You're fortunate if APTs don't consider you a worthy target. They are no joke, and in most cases are playing a long game, more interested in penetration, persistent presence, and quiet theft of information, than in doing anything you'd notice - until they aren't. We had one who burned an asset that had been cultivated on our network for a couple of years to make a hard press play for information about a very particular high profile patient immediately after that patient had been seen (the fact that the patient was seen was public information). But with over 20,000 servers and 300,000 total nodes on the network - some of which cannot be fully patched because they run software someone has to have access to but which won't run on the newest versions of OSs or databases - you still don't know what else they've burrowed into.

A big tech company supplier to our organization had a very telling incident that ended up being detected on our side of the connection, where a brief mistake by network admins opened a channel through their layers of protection for their buid pipeline. In the minutes it was open, an APT detected the access (likely because they already owned something internal), and inserted code into their certified OS build pipeline, which we ended up with in our institution.

funstuff007 · on Oct 8, 2022

Is an APT-owned server a significantly different risk than an APT-owned desktop? They are both inside your house.

walnutclosefarm · on Oct 11, 2022

Probably not hugely more significant, but there are differences. A workstation should be segmented and limited in the range of nodes it can communicate with, if you're running your network properly. A server will likely be in a segment that has much broader access to it, unless you're doing micro-segmentation, and doing it well. By construction, a server set up by a workgroup team outside your core IT server administration staff is unlikely to be properly segmented. And if you're doing traffic analysis to look for rogue behavior, it's harder to spot from something that profiles as a server, because, again, you expect a server to have lots of contacts within the network. Counterbalancing that, if it profiles as a server, you should be more suspicious of any outbound activity.

jacobr1 · on Oct 7, 2022

Also it assumes full utilization of hardware. If you have variable load (such as only needing to run compute after an experiment). The overhead costs of maintaining a cluster you don't need all time are probably much lower than resources you can schedule on-demand.

forgomonika · on Oct 7, 2022

This nails so much of the discussion that should be had. When using any cloud service provider, you aren't just paying for the machines/hardware you use - you are paying for people to take care of a bunch of headaches of having to maintain this hardware. It's incredibly easy to overlook this aspect of costs and really easy to oversimplify what's involved if you don't know how these things actually work.

pclmulqdq · on Oct 7, 2022

Even as a big cloud detractor, I have to disagree with this.

A lot of scientific computing doesn't need a persistent data center, since you are running a ton of simulations that only take a week or so, and scientific computing centers at big universities are a big expense that isn't always well-utilized. Also, when they are full, jobs can wait weeks to run.

These computing centers have fairly high overhead, too, although some of that is absorbed by the university/nonprofit who runs them. It is entirely possible that this dynamic, where universities pay some of the cost out of your grant overhead, makes these computing centers synthetically cheaper for researchers when they are actually more expensive.

One other issue here is that scientific computing really benefits from ultra-low-latency infiniband networks, and the cloud providers offer something more similar to a virtualized RoCE system, which is a lot slower. That means accounting for cloud servers potentially being slower core-for-core.

lebovic · on Oct 7, 2022

Author here. I agree with your points! I use AWS for a computational biology company I'm working on. A lot of scientific computing can spin up and down within a couple hours on AWS and benefits from fast turnaround. Most academic HPCs (by # of clusters) are slower than a mega-cluster on AWS, not well utilized, and have a lot of bureaucratic process.

That said, most of scientific computing (by % of total compute) happens in a different context. There's often a physical machine within the organization that's creating data (e.g. a DNA sequencer, particle accelerator, etc), and a well-maintained HPC cluster that analyzes that data. The researchers have already waited months for their data, so another couple weeks in a queue doesn't impact their cycle.

For that context, AWS doesn't really make sense. I do think there's room for a cloud provider that's geared towards an HPC use-case, and doesn't have the app-inspired limits (e.g data transfer) like AWS, GCP, and Azure.

primax · on Oct 8, 2022

Honestly spot instances sound perfect for those kind of requirements

davidmr · on Oct 7, 2022

This is tangential to your point, but I’ll just mention that Azure has some properly specced out HPC gear: IB, FPGAs, the works. You used to be able to get time on a Cray XC with an Ares interconnect, but I never have occasion to use it, so I don’t know if you still can. They’ve been aggressively hiring top-notch HPC people for a while.

hpcjoe · on Oct 8, 2022

That's the Sentinel system. I worked on it when I was at Cray, and we did some covid stuff[1][2] with a researcher at UAH. We accelerated a docking code using some cool tech I created (in Perl, so there!) and some mods my teammates did to the queuing system.

The work won some award at SC20[3] (fka Supercomputing conference). I had considered submitting for the Gordon Bell prize, which had been specifically requesting covid work, though I thought the stuff we had done wasn't terribly sexy. We were getting ~250-500x better performance than single CPU runs.

Looking back over these, I gotta chuckle, as this (press releases) is pretty much the only time I'm called "Dr.". :D

Back to the OPs points, they are right. In most cases, cloud doesn't make sense for traditional HPC workloads. There are some special cases where it does, those tend to be large ephemeral analysis pipelines, as in bioinformatics and related fields. But for hardcore distributed (mostly MPI) code, running for a long time on a set of nodes interconnected with low latency networks, dedicated local nodes are the better economic deal.

During my stint at Cray, I was trying (quite hard) to get supercomputers, real classical ones, into cloud providers, or become a supercomputing cloud provider ourselves. The Met Office system is in Azure, is a Cray Shasta, but that was more of a special case. I couldn't get enough support for this.

Such is life. I've moved on. Still doing HPC, but more throughput maximized.

[1] https://www.uah.edu/science/departments/math/news/14954-uah-...

[2] A whole marketing writeup was done here https://www.hpe.com/us/en/newsroom/journey-to-accelerate-dru... . I tried very hard to correct the errors in the writeups. Sadly I wasn't successful.

[3] https://baudry-lab.uah.edu/news#h.121c63ayp0k0

Moissanite · on Oct 8, 2022

The Azure Met Office win left me very conflicted. As someone who is relatively positive about cloud adoption for science it was good to see some forward thinking. On the other hand, what I've heard about how the procurement was run plus my taxpayer-based views on where critical national infrastructure should be housed makes me rather less happy about the outcome.

anonymousDan · on Oct 8, 2022

So in Azure it's possible to get access to an infiniband cluster somehow? Bare metal?

hpcjoe · on Oct 8, 2022

I don't speak for them (never have), but I believe it to be possible. MSFT do a number of things right (and a few really badly wrong), but you can generally spin up a decent bare metal system there. IO is going to be an issue with any cloud, it will cost for real performance. Between that and networking, clouds could potentially throw in the compute for free ...

Reminds me of a quip I made back in my SGI-Cray (1st time) days. A Cray supercomputer (back then) was a bunch of static ram that was sold, along with a free computer ... Not really true, but it gave a sense of the costs involved.

This said, Azure had (last I checked) real Mellanox networking kit for RDMA access. At Cray we placed a cluster in Azure for an end user (who shall rename nameless), and used several of Mellanox's largest switch frames for 100G Infiniband across > 1k nodes, each with many V100 GPUs. Unit would have been mid single digits on the top500 list that year.

AWS is doing their own thing network wise. Not nearly as good from a performance (latency or bandwidth) as the Mellanox kit. I don't know if Google Cloud is doing anything beyond TCP.

You can do bare metal at most/all of these. You can do some version of NVMe/local disk at all of them. Some/most let you spin up a parallel file system (network charges, so beware), either their own Lustre flavor, or one of BeeGFS, Weka, etc.

Moissanite · on Oct 8, 2022

The Azure FPGAs are a bit tangential from a customer perspective; they are just the equivalent of the AWS Nitro smart-NIC. Azure IB is interesting in that I originally expected it to be a killer feature, but for customers I work with it just isn't enough to overcome the multitude of downsides of having to use Azure for everything else. In the end, hardly any commercially relevant codes absolutely need IB, and work well enough with the low-latency ethernet both AWS and GCP offer.

esalman · on Oct 8, 2022

Neither grant agencies nor universities are ready to pay for commercial compute out of grant money. They'd rather have you run analysis on your work laptop/desktop they already provided. Even some of the folks who manage the HPC are unwilling to help researchers (many of whom are not programmers) to use the HPC, lest they mess up and damage the hardware. Source: I work at a tri-institutional collaboration research center in Georgia.

KaiserPro · on Oct 7, 2022

Its much more complex than described.

The author is making a brilliant argument for getting a secondhand workstation and shoving under their desk.

If you are doing multi machine batch style processing, then you won't be using ondemand, you'd use the spot pricing. The missing argument in that part is storage costs. Managing a high speed, highly available synchronous file system that can do a sustained 50gb/sec is hard bloody work (no S3 isnt a good fit, too much management overhead)

Don't get me wrong AWS _is_ expensive if you are using a machine for more than a month or two.

however if you are doing highly parallel stuff, Batch and lustre on demand is pretty ace.

If you are doing a multi-year project, then real steel is where its at. Assuming you have factored in hosting, storage and admin costs.

gautamdivgi · on Oct 7, 2022

Even for multi-year, if you factor in everything does it still come out cheaper and AWS? Would you be running everything 24x7 on an HPC? I don’t think so. You need scale at some points and there are probably times where research is done on your desktop.

You could invest in an HPC - but I think the human cost of maintaining one especially if you’re in a high cost of living area (e.g. Bay Area, NYC, etc.) is going to be pretty high. Admin cost, UPS, cable wiring, heat/cooling etc. can all be pretty expensive. Maintenance of these can be pretty pricey too.

Are there any companies that remotely manage data centers and rent out bare metal infra?

lostmsu · on Oct 7, 2022

Isn't 50GB sec like 5 NVMe Gen 5 SSDs + 1 or 2 for redundancy?

Actually, you are right. Consumer SSDs I've seen only do about 1.5GB/s sustained.

pclmulqdq · on Oct 7, 2022

Even (high-end) consumer SSDs can saturate a PCIe gen 4 x4 link if you are doing sequential reads. Non-sequential hurts on even enterprise SSDs.

davidmr · on Oct 7, 2022

Not in the context the person you responded to meant it. Yes, you can very easily get 50GB/s from a few NVMe devices on a single box. Getting 50GB/s on a POSIX-ish filesystem exported to 1000 servers is very possible and common, but orders of magnitude more complicated. 500GB/s is tougher still. 5TB/s is real tough, but real fun.

bushbaba · on Oct 7, 2022

Checkout Apache Iceberg which makes it fairly trivial to get high throughput from S3 without much fine-tuning. Bursts from 0 to 50Gbps should be possible from S3 without much effort, just have object sizes that are in the NN+ MiB range. Personally, Lustre is a mess, it's expensive and even more pain to fine-tune.

KaiserPro · on Oct 10, 2022

> Lustre is a mess, it's expensive and even more pain to fine-tune.

Its a huge raid-0, so long as your entire team understands that, you'll be ok. Its a lot better than in 2008, but now that AWS have a managed service, I'd just use that. (my heart is always in GPFS land...)

awiesenhofer · on Oct 7, 2022

From https://iceberg.apache.org

> Iceberg is a high-performance format for huge analytic tables.

How would that help speedup S3? Genuine question?

bushbaba · on Oct 8, 2022

Iceberg doesn’t use the directory structure to represent what objects belong to which table. The change allows objects to have significant entropy in the object keys prefix. This avoids hot spotting s3 and allows for greater bursts to s3 QPs.

awiesenhofer · on Oct 8, 2022

Oh wow, thats neat! Thanks!

danking00 · on Oct 7, 2022

I think this post is identifying scientific computing with simulation studies and legacy workflows, to a fault. Scientific computing includes those things, but it also includes interactive analysis of very large datasets as well as workflows designed around cloud computing.

Interactive analysis of large datasets (e.g. genome & exome sequencing studies with 100s of 1000s of samples) is well suited to low-latency, server-less, & horizontally scalable systems (like Dremel/BigQuery, or Hail [1], which we build and is inspired by Dremel, among other systems). The load profile is unpredictable because after a scientist runs an analysis they need an unpredictable amount of time to think about their next step.

As for productionized workflows, if we redesign the tools used within these workflows to directly read and write data to cloud storage as well as to tolerate VM-preemption, then we can exploit the ~1/5 cost of preemptible/spot instances.

One last point: for the subset of scientific computing I highlighted above, speed is key. I want the scientist to stay in a flow state, receiving feedback from their experiments as fast as possible, ideally within 300 ms. The only way to achieve that on huge datasets is through rapid and substantial scale-out followed by equally rapid and substantial scale-in (to control cost).

[1] https://hail.is

jessfyi · on Oct 7, 2022

I've followed Hail and applaud the Broad Institute's work wrt establishing better bioinformatics software and toolkits so I hope this doesn't come as rude, but I can't imagine an instance in a real industry or academic workflow where you need 300ms feedback from an experiment to "maintain flow" considering how long experiments on data that large (especially exome sequencing!) take overall? My (likely lacking) imagination aside I guess what I'm really saying is that I don't know what's preventing the usecase you've described from being performed locally considering there'd be even less latency?

danking00 · on Oct 8, 2022

300ms is my ideal latency, but we don’t achieve that under all circumstances. Even for blob storage, I see as much as 100ms latency. That said, my laptop has maybe 8 cores. Even if I had 0ms reads from an SSD, I’m compute bound for some tasks.

Moreover, I think we have differing definitions of “experiment”. In the context of a sequencing study, I think an “experiment” can be as simple as answering the hypothesis: does the missingness of a genotype correlate with any sample metadata (e.g. sequencing platform). You might try to test that hypothesis by looking at a PC1-PC2 plot with points colored by sequencing platform where the PCA is conducted on the 0/1 indicator matrix of missingness.

In the dry lab, that is what I mean by experiment. By that definition, a scientist does many experiments a day. Particularly for sequencing studies, these experiments are data-intensive, I need to run a simple computation on a lot of data to confirm the hypothesis.

zmmmmm · on Oct 8, 2022

A missing elements for me is that with a lot of exploratory scientific work we (often half intentionally) have no idea what we are doing. We can easily run a giant job that uses 100x more compute than expected. Yes you can limit cloud compute resources if you are smart but its much better is the default is that you run out of compute and your job takes longer than you get a 100x bill back from our cloud provider. And if you limited your cloud resources to a fixed amount, didn't you just eliminate half the benefit of using cloud in the first place?

Then the problems of data management, transfer and egress are huge. Again the "no idea what we are doing" factor comes into play. If you have a really good idea up front what is going to happen you can plan out a strategy that minimises costs. But if you have no idea at all - genuinely, because this is science and we are doing new things - then you could end up blowing huge amounts of money on unnecessary egress and storage costs. And at the small end we can be talking about experiments run on a shoestring where a few thousand dollars is a big deal.

The way I see it, we need everything - powerful individual workstations / laptops for direct analysis, then a tier of fixed HPC style compute for this kind of work that is poorly matched to cloud, and then for specific projects where it makes sense (massive scaleout, exotic hardware needs - GPUs, FPGAs etc) you embrace cloud resources for that.

ordiel · on Oct 7, 2022

Having worked for 2 of the largest cloud providers (1 of them beimg the largest) i have to say "The Cloud" just doesnt makes sense (maybe with the exception of cloud storage) yet for most use cases, this including start ups, small and, mid size companies its just way to expensive for the benefits it provides, it moves your hardware acquisitions /maintainance cost to development costs, you just think better/cheaper because that cost comes in small monthly chunks rather than as a single bill, plus you add all security risks either those introduced by the vendor or those introduced by the masive complexity and poor training of the developers which if you want to avoid will have to pay by hiring a developer competent in security for that particular cloud provider

manv1 · on Oct 7, 2022

Having worked in 3 startups that were AWS-first, I can say that you've learned the completely wrong lessons from your time at your cloud providers.

Building on AWS has provided scale, security, and redundancy at a substantially lower cost than doing any on-prem solution (except for a shitty one strung together with lowendbox machines).

The combined AWS bill for the three startups is less than the cost of an F5, even on a non-inflation adjusted basis.

The cloud doesn't mean that you can be totally clueless. I've had experience in HA/scalability/redundancy/deployment/development/networking/etc. It means that if you do know what you're doing you can deliver a scalable HA solution at a ridiculously lower price point than a DIY solution using bare iron and colo.

ordiel · on Oct 7, 2022

"The combined bill" during which time period?

1 Month, for sure. What about 1 year? Also did those companies required to provide any training or hiring to achieve that? Because you also need to add that to the cost comparison

If you are comparing one month bill agains 1 time purchase (which if is correctly chosen should not happen but once every 10 years at the earliest) for sure it will be cheaper. When it comes down to scalability, development and deployment, you should check your tech stack rather than your infrastructure. Kubernetees and containerization should easily take care of those with on premise hardware while also reducing complexity + you will no longer have to worry for off the chart network transit fees

aschleck · on Oct 7, 2022

This is sort of a confusing article because it assumes the premise of "you have a fixed hardware profile" and then argues within that context ("Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers".) Of course if you're getting 100% utilization then you'll find better raw pricing (and this article conveniently leaves out staffing costs), but this model misses one of the most powerful parts of cloud providers: autoscaling. Why would you want to waste scientist time by making them wait in a queue when you can just instead autoscale as high as needed? Giving scientists a tight iteration loop will likely be the biggest cost reduction and also the biggest benefit. And if you're doing that on prem then you need to provision for the peak load, which drives your utilization down and makes on prem far less cost effective.

lebovic · on Oct 7, 2022

For fast-moving researchers who are blocked by a queue, cloud computing still makes sense. I guess I wasn't clear enough in the last section about how I still use AWS for startup-scale computational biology. My scientific computing startup (trytoolchest.com) is 100% built on top of AWS.

Most scientific computing still happens on supercomputers in slower moving academic or big co settings. That's the group for whom cloud computing – or at least running everything on the cloud – doesn't make sense.

adolph · on Oct 7, 2022

Another service that runs on AWS is CodeOcean. It looks like Toolchest is oriented toward facilitating execution of specific packages rather than organization and execution like CodeOcean. Is that a fair summary?

https://codeocean.com/explore

lebovic · on Oct 7, 2022

Yep, that's right! Toolchest focuses on compute, deploying and optimizing popular scientific computing packages.

secabeen · on Oct 7, 2022

Generally, scientists aren't blocked while they are waiting on a computational queue. The results of a computation are needed eventually, but there is lots of other work that can be done that doesn't depend on a specific calculation.

jefftk · on Oct 7, 2022

It's good to learn how not to be blocked on long-running calculations.

On the other hand, if transitioning to a bursty cloud model means you can do your full run in hours instead of weeks, that has real impact on how many iterations you can do and often does appreciably affect velocity.

secabeen · on Oct 7, 2022

It can, if you have the technical ability to write code that can leverage the scale-out that most bursty-cloud solutions entail. Coding for clustering can be pretty challenging, and I would generally recommend a user target a single large system with job that takes a week over trying to adapt that job to a clustered solution of 100 smaller systems that can complete it in 8 hours.

Fomite · on Oct 7, 2022

This is a big part of it. In my lab, I have a lot of grad students who are computational scientists, not computer scientists. The time it will take them to optimize code far exceeds a quick-and-dirty job array on Slurm and then going back to working on the introduction of the paper, or catching up on the literature, or any one of a dozen other things.

esalman · on Oct 8, 2022

Grad student here, I can attest to that.

jupp0r · on Oct 8, 2022

Having worked in the high performance computing field and in cloud hosted commercial applications, I can agree with the article but for entirely different reasons. The reason why some scientific computing shouldn't be done on AWS has to do with networking and latency between compute nodes. Supercomputers often use specialized networking hardware to get single digit microsecond latencies for data transfer between compute nodes and much higher network bandwidth than what you would normally find between EC2 nodes. This allows simulations to efficiently operate on really large data sets that span hundreds or thousands of nodes. The network topology between these nodes is often denser than a tree (think a 2D or 3D grid topology) and offers shorter paths between nodes.

All of this allows you to run code that you can't run in AWS unless it fits on one computer only. It's also way more expensive than clusters of commodity hardware.

For problems that are trivially parallelizable without much communication between nodes - I don't think that most universities can actually operate those cheaper than renting them from cloud computing services. A lot of these calculations don't take the staff to operate data centers, the cost of the building itself or the opportunity cost of using lots of space for this purpose vs something else into account. Economics of scale also kick in here. It's way cheaper per computer for AWS to admin a data center because they do this for orders of magnitude bigger data centers than your typical university.

shadycuz · on Oct 8, 2022

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placemen...

jupp0r · on Oct 8, 2022

This is using IP over ethernet in a tree, which has an order of magnitude more latency and less bandwidth than what you would see in a supercomputer, even if all your instances end up in the same rack.

aBioGuy · on Oct 7, 2022

Furthermore, scientific computing often (usually?) involves trainees. It can difficult to train people when small mistakes can have five figure bills.

Moissanite · on Oct 7, 2022

This is the biggest un-addressed problem, IMO. Getting more scientific computing done in the cloud is where we are inevitably trending, but no-one yet has a good answer for completely ad-hoc, low value experimentation and skill building in cloud. I see universities needing to maintain clusters to allow PhDs and postdocs to develop their computing skills for a good while yet.

atrettel · on Oct 7, 2022

I agree that this is a big thing to consider here too. I set up a computing cluster in grad school and it was much less costly to make a mistake there than it would have been in a cloud service. Re-running something only wasted wall time and not any money. That said, money is not the only scarce resource here. Researchers can get allocations at university and government HPC systems, but you then have to be quite careful with your allocation of computing time. I remember keeping track of the number of SUs (core-hours) I was burning quite carefully when I used university clusters, since once it is gone, you might not get any more time.

idiot900 · on Oct 7, 2022

This rings true for me. I have a federal grant that prohibits me from using its funds for capital acquisitions: i.e. servers. But I can spend it on AWS at massive cost for minimal added utility for my use case. Even though it would be a far better use of taxpayer funds to buy the servers, I have to rent them instead.

boldlybold · on Oct 7, 2022

Lots of places (Hetzner for example) will rent you servers at 10-25% the cost of AWS if you want dedicated hardware, without the ability to autoscale. You can even set up a K8s cluster there if the overhead is worth it.

intelVISA · on Oct 7, 2022

Fond memories of Hetzner asking for my driving license as ID for renting a $2 VPS. Lost a customer for life with that nonsense.

perihelions · on Oct 8, 2022

Also lost a customer for life here. In my case they asked to let an AI scan my face through a webcam.

Nah.

adgjlsfhk1 · on Oct 7, 2022

Can you get your university to buy some servers for unrelated reasons and have them rent them to you?

chrisseaton · on Oct 7, 2022

Well that’s just rebuilding AWS badly. I’ve used academic-managed time-sharing setups and have some horror stories.

lostmsu · on Oct 7, 2022

Doesn't have to be a university either. Depending on the amount of compute needed any capable IT guy can do it for you from their garage with a contract.

giantrobot · on Oct 7, 2022

I'm not saying AWS is automatically the best option but the question isn't just servers. It's servers, networking hardware, HVAC, a facility to put them all in, and at least a couple people to run and maintain it all. The TCO of some servers is way higher than the cost of the hardware.

testplzignore · on Oct 7, 2022

> prohibits me from using its funds for capital acquisitions

What is a legitimate reason for this restriction?

Fomite · on Oct 7, 2022

Basically, the granting organization doesn't want to pay for the full cost of capital equipment that will - either via time or capacity - not be fully used for that grant.

There are other grant mechanisms for large capital expenditures.

The problem is the thresholds haven't shifted in a long time, so you can easily trigger it with a nice workstation. But then, the budget for a modular NIH R01 was set in 1999, so thats hardly a unique problem.

blep_ · on Oct 7, 2022

I can think of a few ways to abuse it while still spinning it as "for research". The obvious one is to buy a $9999 gaming machine with several of whatever the fanciest GPU on the market is at the time, and say you're doing machine learning.

So my guess is it's an overly broad patch for that sort of thing.

Fomite · on Oct 7, 2022

Not really - this is also true for things with no particular "civilian" use.

blep_ · on Oct 7, 2022

Yes, that's why I described it as "overly broad".

Fomite · on Oct 7, 2022

But it's also not why the law was made.

There are other regulations to keep people from installing Steam on their ML workstations (which also cover machines below the threshold).

It's entirely about one grant-giving entity not wanting to pay for a piece of capital equipment that will have use beyond the project they're funding. It's a federal regulation, and it comes up far more commonly with lab equipment than it ever does with computers.

tejtm · on Oct 7, 2022

Cloud never has made sense for scientific computing. Renting someone else's big computer makes good sense in a business setting where you are not paying for your peak capacity when you are not using it, and you are not losing revenue by underestimating whatever the peak capacity the market happens to dictate.

For business, outsourcing compute cost center eliminates both cost and risk for a big win each quarter.

Scientists never say, Gee it isn't the holiday season, guess we better scale things back.

Instead they will always tend to push whatever compute limit there is, it is kinda in the job description.

As for the grant argument, that is letting the tool shape the hand.

business-science is not science, we will pay now or pay later.

bluedino · on Oct 7, 2022

We have 500-node cluster at a chemical company, and we've been experimenting with "hybrid-cloud". This allows jobs to use servers with resources we just don't have, or couldn't add fast enough.

Storage is a huge issue for us. We have a petabyte of local storage from big name vendor that's bursting at the seams, and expensive to upgrade. A lot of our users leave big files laying around for a large time. Every few months we have to hound everyone to delete old stuff.

The other thing that you get with the cloud is there's way more accountability for who's using how much resources. Right now we just let people have access and roam free. Cloud HPC is 5-10x more in cost and the beancounters would shut shit down real quick if the actual costs were divvied up.

We also still have a legacy datacenter so in a similar vein, it's hard to say how much not having to deal with physical hardware/networking/power/bandwidth would be worth. Our work is maybe 1% of what that team does.

adolph · on Oct 7, 2022

I can relate to these problems. Cloud brings positive accountability that is difficult to justify onprem. I have some hope that higher level tools for project/data/experiment management (as opposed to a bash prompt and a path) will bring some accountability without stifling flexibility.

rpep · on Oct 7, 2022

I think there are some things this misses about the scientific ecosystem in Universities/etc. that can make the cloud more attractive than it first appears:

* If you want to run really big jobs e.g. with multiple multi-GPU nodes, this might not even be possible depending on your institution or your access. Most research-intensive Universities have a cluster but they’re not normally big machines. For regional and national machines, you usually have to bid for access for specific projects, and you might not be successful.

* You have control of exactly what hardware and OS you want on your nodes. Often you’re using an out of date RHEL version and despite spack and easybuild gaining ground, all too often you’re given a compiler and some old versions of libraries and that’s it.

* For many computationally intensive studies, your data transfer actually isn’t that large. For e.g. you can often do the post-processing on-node and then only get aggregate statistics about simulation runs out.

captainmuon · on Oct 7, 2022

A former colleague did his PHD in particle physics with a novel technique (matrix element method). I can't really explain it, but it is extremely CPU intensive. That working group did it on CERN's resources, and they had to borrow quotas from a bunch of other people. For fun they calculated how much it would have cost on AWS and came up with something ridiculous like 3 million euros.

dguest · on Oct 7, 2022

The bigger experiments will routinely burn through tens of millions worth of computing. But 10 million euros isn't much for these experiments. The issue is that they are publicly funded: any country is much happier to build a local computing center and lend it to scientists than to fork the money over to an American cloud provider.

(The expensive part of these experiments is simulating billions of collisions and how the thousands of outgoing particles propagate though a detector the size of a small building. Simulating a single event takes around a minute on a modern CPU, and the experiments will simulate billions of events in a few months. If AWS is charging 5 cents a minute it works out to tens of millions easy.)

wenc · on Oct 7, 2022

I can’t specifically to CERN and the exact workload. But bear in mind that the 3MM euros is non negotiated sticker pricing. In real life, negotiated pricing can be much much less depending on your org size and spend. This is a variable most people neglect.

captainmuon · on Oct 7, 2022

That is true, and a large part of the theoretical cost was probably also traffic, and the use of nonstandard nodes. They could have gotten a much more realistic price.

I guess the point is also that scientists often don't realize computer costs money, when the computers are already bought

rirze · on Oct 7, 2022

I would imagine CERN's resources are essentially a data center comparable to a small cloud provider's resources.

Helmut10001 · on Oct 8, 2022

I almost got a Tenure Track position at a Data Science Faculty in Virginia and I think them not having a HPC was the single issue that blew this move (from both sides). During interviews, I was asking the dean how they set up their HPC - turned out, they hadn't. I then asked a Professor in the next review round how they teach their students without a HPC:

> "I buy all resources on AWS - it's painfull because I have to contact AWS almost monthly for accidental over-billing, but we don't have a solution".

All of this made me really sceptical, since coming from a big University in Germany, we have unlimited HPC resources for free. I have 16 VMs, the biggest one 125 GB Memory, I can set those up or move around how I want. No space limitations - in need 10 TB of space for 3 month? Open a service ticket, 3 hours later it's available. Ports need to be opened worldwide to the web? No problem. Need a Jupyter Hub Cluster on Kubernetes? Here you go. This has really improved my work (quality, performance, and convenience) so much.

I was once coordinator of a research project where we had 30k EUR left and didn't know what to do with it. I contacted our HPC and asked if they want the money - answer: "30k really isn't worth the effort, we don't know what to do with it atm."

m_mueller · on Oct 8, 2022

I second this, would also ring my alarm bells in an interview. Back as a master student I went from a top 10 research university that didn't have a strong HPC infra at the time (ETH Zurich) to a top 100 one that did (Tokyo Tech), at least as a guest. Difference was staggering - all students just got handed a login on their big (top 100) cluster to be able to work with the actual hardware, even if limited to a couple of nodes. This much increased the likelihood that someone would be able to come up with an interesting project just together with their academic supervisor (no huge group projects with tons of extra funding needed).

julienchastang · on Oct 7, 2022

I’ve also been skeptical of the commercial cloud for scientific computing workflows. I don’t think this cost benefit analysis mentions it, but the commercial cloud makes even less sense when you take into account brick and mortar considerations. In other words, if your company/institution has already paid for the machine rooms, sys admins, networks, the physical buildings, the commercial cloud is even less appealing. This is especially true with “persistent services” for example data servers that are always on because they handle real-time data, for example.

Another aspect of scientific computing on the commercial cloud that’s a pain if you work in academia is procurement or paying for the cloud. Academic groups are much more comfortable with the grant model. They often operate on shoe-string budgets and are simply not comfortable entering a credit card number. You can also get commercial cloud grants, but they often lack long-term, multiyear continuity.

mattkrause · on Oct 7, 2022

It's often not that they're "not comfortable"; it's that we're often flat-out not allowed to.

Fomite · on Oct 7, 2022

This. It's got nothing to do with "comfort". I use cloud computing all the time in the rest of my life, but the rest of my life isn't subject to university policies and state regulations.

mbreese · on Oct 7, 2022

I completely agree for most cases. In many scientific computing applications, compute time isn’t the factor you prioritize in the good/fast/cheap triad. Instead, you often need to do things as cheaply as possible. And your data access isn’t always predictable, so you need to keep results around for an extended period of time. This makes storage costs a major factor. For us, this alone was enough to move workloads away from cloud and onto local resources.

prpl · on Oct 7, 2022

Actually computing is fine for most use cases (spot instances, preemptible VMs on GCP) and have been used in lots of situations, even at CERN. Where it also excels is if you need any kind of infrastructure, because no HPC center has figured a reasonable approach to that (some are trying with k8s). Also, obviously, you get a huge selection of hardware.

Where cloud/aws doesn’t make sense is storage, especially if you need egress, and if you actually need IB

fwip · on Oct 7, 2022

The killer we've seen is data egress costs. Crunching the numbers for some of our pipelines, we'd actually be paying more to get the data out of AWS than to compute it.

bhewes · on Oct 7, 2022

Data movement has become the number one cost in system builds energy wise.

boldlybold · on Oct 7, 2022

As in, the networking equipment consumes the most energy? Given the 30x markup on AWS egress I'm inclined to say it's more about incentives and marketing, but I'd love to learn otherwise.

bsenftner · on Oct 7, 2022

This is the case for a large class of big data + high compute applications. Animation / simulation in engineering, planning, forecasting, not to mention entertainment require pipelines the typical cloud is simply too expensive to use.

zatarc · on Oct 7, 2022

Why does no one consider colocation services anymore?

And why do people only know Hetzner, OVH and Linode as alternatives to the big cloud providers?

There are so many good and inexpensive server hosting providers, some with decades of experience.

lostmsu · on Oct 7, 2022

Any particular you could recommend for GPU?

zatarc · on Oct 7, 2022

I'm not in a position to recommend or not a particular provider for gpu-equipped servers, simply because I've never had the need for gpus.

My first thought was related to colocation services. From what I understand, a lot of people avoid on-premise/in-house solutions because they don't want to deal with server rooms, redundant power, redundant networks, etc.

So people go to the cloud and pay horrendous prices there.

Why not take a middle path? Build your own custom server with your perferred hardware and put in a colocation

dkobran · on Oct 7, 2022

There are several tier-two clouds that offer GPUs but I think they generally fall prey to the many of the same issues you'll find with AWS. There is a new generation of accelerator native clouds e.g. Paperspace (https://paperspace.com) that cater specifically to HPC, AI, etc. workloads. The main differentiators are: - much larger GPU catalog - support for new accelerators e.g. Graphcore IPUs - different pricing structure that address problematic areas for HPC such as egress

However, one of the most important differences is the lack of unrelated web services related components that pose a major distraction/headache to users that don't have a DevOps background (which AWS obviously caters to). AWS can be incredibly complicated. Simple tasks are encumbered by a whole host of unrelated options/capabilities and the learning curve is very steep. A platform that is specifically designed to serve the scientific computing audience can be much more streamlined and user-friendly for this audience.

Disclosure: I work on Paperspace.

sabalaba · on Oct 7, 2022

Lambda GPU Cloud has the cheapest A100s of that group. https://lambdalabs.com/service/gpu-cloud

Lambda A100s - $1.10 / hr Paperspace A100s - $3.09 / hr Genesis A100s - no A100s but their 3090 (1/2 the speed of 100) is - $1.30 / hr for half the speed

lostmsu · on Oct 7, 2022

That's still way too expensive. 3090 is less than 2x of the monthly cost in Genesis. A100 is priced better here.

latchkey · on Oct 7, 2022

Coreweave. I know the CTO. They are doing great work over there.

https://www.coreweave.com

theblazehen · on Oct 7, 2022

https://www.genesiscloud.com/ is pretty decent

tryauuum · on Oct 7, 2022

datacrunch.io has some 80G A100s

slaymaker1907 · on Oct 7, 2022

Cloud worked really well for me when I was in school. A lot of the time, I would only need a beefy computer for a few hours at a time (often due to high memory usage) and you can/could rent out spot instances for very cheap. There are about 730 hours per month so the cost calculus is very different for a student/researcher who needs fast turnaround times (high performance), but only for a short period of time.

However, I know not all HPC/scientific computing works that way and some workloads are much more continuous.

a2tech · on Oct 7, 2022

Thats how my department uses the cloud--we have an image we store up at AWS geared towards a couple of tasks and we spin up a big instance when we need it, run the task, pull out the results, then stop the machine. Total cost sub-100 dollars. If we had to go the HPC group we'd have to fight with them to get the environment configured, get access to the system, get payment setup, teach the faculty to use the environment, etc. Its just a pain for very little gain.

secabeen · on Oct 7, 2022

The general rule of thumb in the HPC world is if you can keep a system computing for more than 40% of the time, it will be cheaper to buy.

dastbe · on Oct 7, 2022

using on-demand for latency insensitive work, especially when you’re also very cost sensitive, isn’t the right choice. spot instances will get you somewhere in the realm of the hetzner/on-prem numbers.

dekhn · on Oct 7, 2022

Even more importantly, if you have any reasonable amount of spend on cloud, you can get preferred pricing agreements. As much as I hate to talk to "salespeople", I did manage to cut millions in costs per year with discounts on serving and storage.

Personally, when I estimate the total cost of ownership of scientific cloud computing versus on prem (for extremely large-scale science with both significant server, storage, and bandwidth requirements) the cloud ends up winning for a number of reasons. I've seen a lot of academics who disagree but then I find out they use their grad students to manage their clusters.

Sebb767 · on Oct 7, 2022

But, as the article points out, you are still paying a lot of money for features that you don't need for scientific computing.

Also, AWS is notoriously easy to undercut with on-prem hardware, especially if your budget is large and your uptime requirements aren't - you'll save a few hundred thousand a year alone by not having to hire expert engineers for on-call duty and extreme reliability.

lebovic · on Oct 7, 2022

Even spot instances on AWS are still over 2x more expensive per month than Hetzner. The cheapest c5a.24xlarge spot instance right now is $1.5546/hr in us-east-1c. That's $1134.86/mo, excluding data transfer costs. If you transfer out 10 TB over the course of a month, that's another $921.60/mo – or now 4x more expensive than Hetzner.

Using the estimate from the article, spot instances are still over 8x more expensive than on-prem for scientific computing.

Moissanite · on Oct 7, 2022

This has been my exact field of work for a few years now; in general I have found that:

When people claim it is 10x more expensive to use public cloud, they have no earthly idea what it actually costs to run a HPC service, a data centre, or do any of the associated maintenance.

When the claim is 3x more expensive in the cloud, they do know those things but are making a bad faith comparison because their job involves running an on-premises cluster and they are scared of losing their toys.

When the claim is 0-50% more to run in the cloud, someone is doing the math properly and aiming for a fair comparison.

When the claim is that cloud is cheaper than on-prem, you are probably talking to a cloud vendor account manager whose colleagues are wincing at the fact that they just torched their credibility.

johnklos · on Oct 7, 2022

This is oversimplifying things a bit.

It can categorically be stated that for a year's worth of CPU compute, local will always be less than Amazon. Of course, putting percentages on it doesn't work - there are just too many variables.

There are many admins out there who have no idea what an Alpha is who'll swear that if you're not buying Dell or HP hardware at a premium with expensive support contracts, you're doing things wrong and you're not a real admin. Visit Reddit's /r/sysadmin if you want to see the kind of people I'm talking about.

The point is that if people insist on the most expensive, least efficient type of servers such as Dell Xeons with ridiculous service contracts, the savings over Amazon won't be large.

It's a cumulative problem, because trying to cool and house less efficient hardware requires more power and that hardware ultimately has less tolerance for non-datacenter cooling.

Rethink things. You can have AMD Threadripper / EPYC systems in larger rooms that require less overall cooling, that have better temperature tolerance, that're more reliable in aggregate, which cost less and for which you can easily keep around spare parts which would give better turnaround and availability than support contracts from Dell / HP. Suddenly your compute costs are halved, because of pricing, efficiency, overall power, real estate considerations...

So percentages don't work, but the bottom line is that when you're doing lots of compute, over time it's always cheaper locally, even if you do things the "traditional" expensive and inefficient way, so arguing percentages with so many variables doesn't make any sense - it's still cheaper, no matter what.

lebovic · on Oct 7, 2022

Author here! I think running an HPC service that has a steady queue in AWS can be more than 3x as expensive.

What type of HPC do you work in? Maybe I'm over-indexing on computational biology.

Moissanite · on Oct 7, 2022

All types of HPC; I'm a sysadmin/consultant. I don't think the problem with the cost gap is overestimating cloud costs but rather underestimating on-prem costs. Also, failing to account for financing differences and opportunity-costs of large up-front capital purchases.

thayne · on Oct 7, 2022

> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers

That sounds very much like an argument for a cloud. Instead of waiting months to do your processing, you spin up what you need, then tear it down when you are done.

withinboredom · on Oct 7, 2022

The queue then just turns into the bank account. The queue doesn’t magically go away.

adolph · on Oct 7, 2022

Classic iron triangle, pick two:

  * cheap
  * fast
  * available

https://en.wikipedia.org/wiki/Project_management_triangle

adamsb6 · on Oct 7, 2022

I've never worked in this space, but I'm curious about the need for massive egress. What's driving the need to bring all that data back to the institution?

Could whatever actions have to be performed on the data also be performed in AWS?

Also while briefly looking into this I found that AWS has an egress waiver for researchers and educational instiutions: https://aws.amazon.com/blogs/publicsector/data-egress-waiver...

COGlory · on Oct 7, 2022

Well for starters, if you are NIH or NSF funded, they have data storage requirements you must meet. So usually this involves something like tape backups in two locations.

The other is for reproducibility - typically you need to preserve lots of in-between steps for peer review and proving that you aren't making things up. Some intermediary data is wiped out, but usually only if it can be quickly and easily regenerated.

jltsiren · on Oct 8, 2022

I've worked the last few years in a research institute well-funded by academic standards. During that time, people around me have used self-hosted hardware, at least three different cloud providers, and HPC clusters hosted by several different organizations. I've also worked with people from many other institutions, each of which has made its own infrastructure choices.

You may use AWS yourself, but your collaborators have made their own choices for their own reasons. The data goes where it's needed, often crossing institutional boundaries.

jpeloquin · on Oct 7, 2022

Regarding the waiver—"The maximum discount is 15 percent of total monthly spending on AWS services". Was very excited at first.

As for leaving data in AWS, data is often (not always) revisited repeatedly for years after the fact. If new questions are raised about the results it's often much easier to check the output than rerun the analysis. And cloud storage is not cheap. But yes it sometimes makes sense to egress only summary statistics and discard the raw data.

Fomite · on Oct 7, 2022

One of the aspects not touched on for this is PII/confidential data/HIPAA data, etc.

For that, whether it makes sense or not, a lot of universities are moving to AWS, and the infrastructure cost of AWS for what would be a pretty modest server are still considerably less than the cost of complying with the policies and regulations involved in that.

Recently at my institution I asked about housing it on premise, and the answer was that IT supports AWS, and if I wanted to do something else, supporting that - as well as the responsibility for a breach - would rest entirely on my shoulders. Not doing that.

citizenpaul · on Oct 7, 2022

No one seems to consider colo data centers anymore as even an option?

remram · on Oct 7, 2022

My university owns hardware in multiple locations, plus uses hardware in a collocation, and still uses the cloud for bursting (overflow). You can't beat the provisioning time of cloud providers which is measured in seconds.

somesortofthing · on Oct 7, 2022

The author makes a convincing argument against doing this workload on on-demand instances, but what about spot instances? AWS explicitly calls out scientific computing as a major use case for scientific computing in its training/promotional materials. Given the advertised ~70-90% markdown on spot instance time, it seems like a great option compared to paying almost the same amount as the workstation but not having to pay to buy, maintain, or replace the hardware.

lebovic · on Oct 7, 2022

Author here! Spot instance pricing is better than on-demand, but it doesn't include data transfer, and it's still more expensive than on-prem/Hetzner/etc. Data transfer costs exceed the cost of the instance itself if you're transferring many TB off AWS.

For of our more popular AWS instance types I use – a c5a.24xlarge, used for comparison in the post – the cheapest spot price over the past month in us-east-1 was $1.69. That's still $1233.70/mo: above on-prem, colo, or Hetzner pricing. Data transfer is still extremely expensive.

That said, for bursty loads that can't be smoothed with a queue, spot instances (or just normal EC2 instances) do make sense! I use them all the time for my computational biology company.

timmclean · on Oct 7, 2022

FWIW, spot prices for c5a.24xlarge in us-east-2b and us-east-2c seem to have been under $0.92/hr for most of the last 3 months. So, assuming some flexibility on the choice of region, that would adjust your estimate to $0.92 / $1.69 * $1233.70/mo = $671.60/mo, which looks a lot more reasonable. Hopefully I did that math right. Data egress prices are definitely still ridiculous, I agree.

lebovic · on Oct 8, 2022

True! It sometimes drops even more, which definitely makes spot instances attractive. The r5.16xlarges had a ~80% discount and a <5% termination rate for while.

Also, if all of scientific computing switched to AWS in order to exploit spot instance pricing, I don't think those market dynamics would stay the same.

As an aside, I've had trouble created large clusters of high memory instances in us-east-2. They might have increased capacity recently, though.

twawaaay · on Oct 7, 2022

On the other hand it makes sense if you just need to borrow their infrastructure for a while to calculate something.

A lot of scientific computing isn't happening continuously, and a lot of it is one time experiment or maybe couple of times after which you would have to tear down and reassign.

Another fun fact people forget is our ability to predict future is still pretty poor. Not only that, we are biased towards thinking we can predict it when in fact this is complete bullshit.

You have to buy and set up infrastructure before you can use it and then you have to be ready to use it. What if you are not ready? What if you will not need as much resources? What if you stop needing it earlier than you thought? When you borrow it from AWS you have flexibility to start using it when you are ready and drop it immediately when you no longer need it. Which has value on its own.

At the company I work for we found out and basically banned signing long term contracts for discounts. We found that, on average, we pay many times more for unused services than whatever we gained through discounts. Also when you pay for the resources there is incentive to improve efficiency. When you have basically prepaid for everything that incentive is very small and is basically limited to making sure you have to stay within limits.

renewiltord · on Oct 7, 2022

> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers – that's the API equivalent of storing your inbound API requests, and then responding to them months later

Makes sense if the jobs are all low urgency.

We have a similar problem in trading so we have a composite solution with non-cloud simulation hardware and additional AWS hardware. That's because we have the high utilization solution combined with high urgency.

Fomite · on Oct 7, 2022

I did have to chuckle a bit because, working on HPC simulations of the pandemic during the pandemic, there was an awful lot of "This needs to be done tomorrow" urgency.

kortex · on Oct 7, 2022

What does the landscape look like now for "terraform for bare metal"?. Is ansible/chef still the main name in town? I just wanna netboot some lightweight image, set up some basic network discovery on a control plane, and turn every connected box into a flexible worker bee I can deploy whatever cluster control layer (k8s/nomad) on top of and start slinging containers.