Hacker News new | past | comments | ask | show | jobs | submit login
Intel announces the Aurora supercomputer has broken the exascale barrier (intel.com)
146 points by mepian 5 months ago | hide | past | favorite | 140 comments



More context: This is related to today's release of the Spring Top 500 list (https://news.ycombinator.com/item?id=40346788). Aurora rated 1,012.00 PetaFLOPS/second Rmax, and is in 2nd place, behind Frontier.

In the November 2023 list, Aurora was also in second place, with an Rmax of 585.34 PetaFLOPS/second.

See https://www.top500.org/system/180183/ for the specs on Aurora, and https://www.top500.org/system/180047/ for the specs on Frontier.

See https://www.top500.org/project/top500_description/ and https://www.top500.org/project/linpack/ for a description of Rmax and the LINPACK benchmark, by which supercomputers are generally ranked. The Top 500 list only includes supercomputers that are able to run the LINPACK benchmark, and where the owner is willing to publish the results.

The jump in Aurora's Rmax scope is explained by Aurora's difficult birth. https://morethanmoore.substack.com/p/5-years-late-only-2 (published when the November 2023 list came out) has a good explanation of what's been going on.


Looking at the two specs, interesting to see how Frontier (the first, running AMD CPUs) has much better power efficiency than Aurora (the second, running Intel), 18.89 kW/PFLOPS vs 38.24 kW/PFLOPS respectively... Good advertisement for AMD? :)


These days this is true from top to bottom, desktop, servers, ... Even in gaming, the 7800X3D is cheaper than the 14700K, it is also more performant and yet uses roughly 20% less power at idle and the gap only grows at full charge.

AMD's current architecture is very power responsible, and Intel has more or less used watt overfeeding to catch back in performance.


Is there any good estimate of how much of AMD’s power efficiency advantage can be attributed to TSMC’s process vs Intel’s? I know in GPUs AMD doesn’t enjoy the same advantage vs nVidia since they’re both manufactured by TSMC, and with nVidia actually being on a smaller node, iirc.


7800x3d maxes out around 80 watts (has to be gentle to the vcache), the 14900k can go up to 300w (out of box, though Intel is issuing a new bios to limit that), and they trade blows in gaming.

I would say that's a bit more than process efficiency?

https://youtu.be/2MvvCr-thM8?t=423


Oh, certainly there are significant architectural advantages, especially for the vcache SKUs in gaming. It would just be interesting to see how much TSMC is still (or maybe further) ahead of Intel. Intel was so used to having the process advantage vs AMD that their architecture could afford to be less efficient. But now that they're the ones behind in both process and arch, they're really hurting, especially on mobile now that AMD is making inroads and Snapdragon X is about to get a serious launch in a week. I'm typing this on a ThinkPad 13s with a Snapdragon 8cx CPU running Windows, and it's a pretty usable device that lasts much longer on a smaller battery than my comparable Intel laptop. It seems to particularly use much less power on standby, although it can't seem to wake up from hibernation reliably.


Aurora has 21K Xeons and 64K Intel X(e) GPUs which provide most of the compute power. The GPUs are made by TSMC.

https://en.wikipedia.org/wiki/Intel_Xe


I was under the impression that AMD desktops/home servers generally don't go below 15-20 W, while Intel can get down to 4-6 W idle for the full system. Has that changed? AMD seems to generally be the better perf/$, but I thought power usage at idle was their big drawback for desktops/low-usage servers.

IIRC the numbers I've read are that (at least desktop) Intel CPUs should be using something like 0.2 W package power at idle if the OS is correctly configured, regardless of whether it's a performance (K) or "efficiency" (T) model. Most power usage is the rest of the system.


https://en.wikipedia.org/wiki/Cool%27n%27Quiet

They both have similar frequency and voltage scaling algorithms at this point. You will probably not see 0.2W idle though, both probably idle around 10W on desktop and 5W on laptop. But Intel is getting much more aggressive with "turbo boost" to try to hide their IPC/process deficit vs. AMD/TSMC, to the point that a 14900k will use 120W+ to match the performance of a 7800x3d at 60W.


As far as I can gather, that's not the case. These guys[0] have been crowdsourcing information about power efficiency for a while now, and the big takeaways right now seem to be that

* Intel is the best for idle (there's several people that have systems that run at less than 5 W for the full system using modified old business minipcs off ebay). Allegedly someone has a 9500T at less than 2 W full system power.

* It doesn't matter which Intel processor you use; all of them for many years will get down to 1 W or less for the CPU at idle. A 14900K will idle just as well as an 8100T, which will be much better than a Ryzen 7950X.

* AMD pretty much never gets below 10 W with any of the Ryzen chiplet CPUs. Only their mobile processors can do it, but they don't sell them retail and they're usually (always?) soldered.

* Every component except the CPU is more important. Your motherboard and PCIe devices need to support power management. You need an efficient PSU (which has nothing to do with the 80-plus rating, which doesn't consider power draw at idle). One bad PCIe device like an SSD or a NIC can draw 10s of watts if it breaks sleep states. Unfortunately, this information seems to be almost entirely undocumented beyond these crowdsourcers.

For a usually idle home-server, Intel seems to be better for power usage, which is unfortunate because AMD tends to have more IO and supports ECC.

[0] https://www.hardwareluxx.de/community/threads/die-sparsamste...


Also the delta between theoretical performance and benchmarked performance is much smaller for Frontier (AMD) than for Aurora (Intel).

That being said, note that the software is also different on the two computers.


Wouldn't be surprised if it's the same thing : more watt usage, more heat, more throttling.


Note all mentions of FLOPS in this thread refer to FP64 (double precision), unlike more popular “AI OPS”, which are typically INT8, specified for modern GPUs.


> which are typically INT8

These systems are used for training which is VERY rarely INT8. On Frontier, for example, it's recommended to use bfloat16 or float32 if that doesn't work for you/your application.

Nvidia has FP8 with >=Hopper and supposedly AMD MI300 has it as well although I have no experience with the MI300 so I can't speak to that.


What does FLOPS/second mean? Isn’t FLOPS already per second? Are they accelerating?


I'd actually be interested in an estimate of the world's overall flop/s^2. Could someone please run a back of the envelope calculation for me, e.g. looking at least year's data?


We added 6.3 gigaFLOPS per second on 2022-23, based on an increase of 200 million gigaFLOPS observed on that period. This is in contrast to 20 gigaFLOPS per second in 2021-22. It’s nil in 2020-21, but that seems only partially attributable to the pandemic, as there appears to be a tick-tock pattern going back to 2013.

https://ourworldindata.org/grapher/supercomputer-power-flops


Yeah, the top500 pages cited use Flop/s (apparently using Flop for “Floating point operations” – not sure which “o” and “p” are used), I’ve could swear I’ve seen FLOPS and seen it expanded specifically as “FLoating point Operations Per Second” when I first encountered it, FLOPS/s seems to be using “FLOPS” like the “Flop” above (probably as “FLoating point OPerationS”, in which case the “/s” makes sense.)



Made me chuckle. F=ma, where a is the derivative of FLOPS with respect to time.


Some people treat FLOPS as “FLoating point OPerationS”.


But that doesn't make much sense in comparison to evaluating system performance. A Pentium III could have a billion FP32 operations given almost 16 years, but you wouldn't say its 1 GFLOPS. Assuming the "S" is seconds, it becomes a useful metric and we can say it has 2 FLOPS.


Then it should be "FLOPs" to indicate that the S is not separate word in the acronym, just the plural form.


It’s not an acronym, it’s an abbreviation. Unfortunately, the rules for abbreviations are effectively arbitrary (or specific to some etymology that’s not available from context). The “s” could be op(s) or seconds, but back in 90s trade publications it was seconds.


I don't know anything about supercomputer architecture; are lifetime upgrades that double the performance typical, let alone YoY?

What do those kinds of upgrades entail from a hardware side? Software side? Is this just a horizontal scaling of a cluster?


This isn’t really an upgrade, it’s the system still being commissioned.

See the last paragraph of my post for a link to more info.


Serious question: my understanding of HPC is that there are many workloads running on a given supercomputer at any time. There is no singular workload that takes up the entire or most of the resources of a supercomputer.

Is my understanding correct? If yes, then why is it important to build supercomputers with more and more compute? Wouldn't it be better to build smaller systems that focus more on power/cost/space efficiency?


There's many variables that go into supercomputers, of which "company/country propaganda" is just one of them.

Supercomputer admins would love to have a single code that used the whole machine, both the compute elements and the network elements, at close to 100%. In fact they spend a significant fraction on network elements to unblock the compute elements, but few codes are really so light on networking that the program scales to the full core count of the machine. So, instead they usually have several codes which can scale up to a significant fraction of the machine and then backfill with smaller jobs to keep the utilization up (because the acquisition cost and the running cost are so high).

Supercomputers have limited utility- beyond country bragging rights, only a few problems really justify spending this kind of resource. I intentionally switched my own research in molecular dynamics away from supercomputers (where I'd run one job on 64-128 processors for a 96X speedup) to closet cllusters, where I'd run 128 indpendent jobs for a 128X speedup, but then have to do a bunch of post-processing to make the results comparable to the long, large runs on the supercomputer (https://research.google/pubs/cloud-based-simulations-on-goog...). I actually was really relieved when my work no longer depended on expensive resources with little support, as my scientific productivity went up and my costs went way down.

I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.


Ever used GPT-3, DALL-E, or other LLMs?

The GPUs used to train them only existed because the DoE explicitly worked with Nvidia on a decade-long roadmap for delivery in it's various supercomputers, and would often work in tandem with private sector players to coordinate purchases and R&D (for example, protein folding and just about every Big Pharma company).

Hell, the only reason AMD EPYC exists is for the same reason.


Yes. In my computational history I have: used the largest non-classified DOE supercomputers, built my own modest closet clusters, developed an embarassingly parallel computing system using Google's idle prod cycles, and helped debug training of LLMs when I worked on the TPU team at Google. I work for big pharma now (and my phd is in biophysics) and I'm also comfortable with other HPC domains.

I know the DOE/Nvidia history quite well as the Chief Scientist of NVIDIA visited LBL around 2005(6? 7?) and talked about their new hardware they were just starting to build and sell, with the goal of getting them into supercomputers.

We asked if they had double precision performance yet (because that was a must for many supercomputer jobs), but at the time, nvidia DP was still lagging SP (I guess it still does?) and we also quibbled about their non-compliance with some esoteric details in IEEE 754. The best part of the whole talk was when he walked us through the idea of visualizing our operations by drawing the matrices as textures, because you can easily see the NaNs- they render as nvidia Green!

I left DOE (Berkeley Lab) shortly after to work in industry because it was clear that ML wasn't going to be innovated in the government labs.


Thank you for that comment, as someone who has had exposure to the DOE as well (and LBL at that) I can only echo your sentiment. I'd even go further and state that is hard for me to believe that any innovation can happen in the calcified structures of governmental labs. Maybe the classified ones are different.

The "DOE made NVIDIA" myth is a story I haven't seen pushed outside the DOE complex. It is true that the supercomputers the DOE pushes could be considered industry subsidies, by providing industry companies with a steady customer with a very high tolerance for unfinished products. That applies to NVIDIA, AMD, Intel, HPE/Cray and IBM more or less equally.

I also want to stress what often gets overlooked: supercomputers are hell to operate and use. Aurora runs on Slingshot, a Cray interconnect. Those things look good on paper. Examples: Cray Aries (and "network quiesces") or Cray DataWarp. Who knows how Slingshot actually works in practice, for a hero run it only needs to hold things together for a few hours. As long as you get a high TOP500 ranking, a supercomputer is a success.

There is no market for those things anymore and they are beholden to the same economics as everything else, hence codes that can't afford an army of PostDocs to work around the bugs and design decisions that are only necessary due to scale of those systems are better suited to plain old mid-range clusters. And I haven't even mentioned the eccentric userland of supercomputers.

There are many reasons the DOE affords to run those behemoths. Some more trivial and petty than most people would like to believe. Like the author of the parent post, I have come to believe that the best bang for the buck on scientific output can be found elsewhere.


> I have come to believe that the best bang for the buck on scientific output can be found elsewhere.

The best bang for the buck is never at the very top. The top is just for the biggest bang.


>The GPUs used to train them only existed because the DoE explicitly worked with Nvidia

Do you have a source for this claim? Isn't eg. an H100 basically just a RTX GPU with more and faster memory? (Or, at least, an RTX GPU with the same VRAM as an H100 would perform similarly.) And these GPUs were created to run video games. Unless you are referring to something like NVLink?


> Unless you are referring to something like NVLink

Yep

> an H100

H100 NVL is the SKU for HPC.


> I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.

Or doing Numerical Weather Prediction. :-)

But seriously, as a cluster sysadmin, the “128 jobs, followed by post-processing” is great for me, because it lets those separate jobs be scheduled as soon as resources are available.

> expensive resources with little support

Unfortunately, there isn’t as much funding available in places for user training and consultation. Good writing & education is a skill, and folks aren’t always interested in a job that has is term-limited, or whose future is otherwise unclear.


Not all jobs can be parallelized in that manner without communication.


Thanks for the great reply and for linking the research paper - I am going to check it out. Could you suggest a good review paper for someone new to get into modern HPC?


It depends what you are looking for. This provides a good overview:

https://hpc.llnl.gov/documentation/tutorials/introduction-pa...

This is a very detailed free book focusing on programming:

https://theartofhpc.com

But HPC is very diverse. Some care about compute performance, others about memory bandwidth and others about IO performance. Some run a ton of small jobs while others run a single large job.


Most of the time yes, HPC systems are shared among many users. Sometimes though the whole system (or near it) will be used in a single run. These are sometimes referred to as "hero runs" and while they're more common for benchmarking and burn-ins there are some tightly-coupled workloads that perform well in that style of execution. It really depends on a number of factors like the workloads being run, the number of users, and what the primary business purpose of the HPC resource is. Sites that have to run both types of jobs will typically allow any user to schedule jobs most of the time but then pre-reserve blocks of time for hero runs to take place where other user jobs are held until the primary scheduled run is over.


Thanks for the reply! Can you give some examples of these "hero runs"?


At our university, at least when I studied there some 15 years ago, the whole cluster was occupied doing weather predictions each and every night.

No point in staying up waiting for a job, it'd get rescheduled in the early morning at best.

It wasn't the largest cluster around, IIRC 768 quad-core nodes, but I'm sure the meteorological department would find a way to utilize any extra capacity, so still requiring the whole thing all night.


Feels like this might be an invariant in computer science. I once worked on an IBM/360. The majority of the compute was taken up by a single person who did weather simulation.

An IBM/360 has laughably less compute than your phone.


ENIAC ran weather as well. It's a kind of numerical simulation that will always be in need of compute cycles. We'll need more of it as we go deeper into climate change.


"Computational fluid dynamics is hard" seems like a true bona fide CS invariant.


It might still have produced a better weather prediction than my phone, though /s


I have a project on Frontier. Generally these systems (including Frontier) use slurm[0] for scheduling and workload management.

The OLCF Frontier user guide[1] has some information on scheduling and Frontier specific quirks (very minor).

Current status of jobs on Frontier:

[kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

137

[kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

1016

The running jobs are relatively low because there are some massive jobs using a significant number of nodes ATM.

[0] - https://slurm.schedmd.com/documentation.html

[1] - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

EDIT: I give up on HN code formatting


> EDIT: I give up on HN code formatting

Just FYI: https://news.ycombinator.com/formatdoc

> Text after a blank line that is indented by two or more spaces is reproduced verbatim. (This is intended for code.)

  [kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

  137

  [kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

  1016


Yeah I've seen that but was annoyed by not being able to just use backticks like everywhere else.

Oh the irony of using Frontier but not "understanding" HF formatting ;).


> There is no singular workload that takes up the entire or most of the resources of a supercomputer.

I performed molecular dynamics simulations on the Titan supercomputer at ORNL during grad school. At the time, this supercomputer was the fastest in the world.

At least back then around 2012, ORNL really wanted projects that uniquely showcased the power of the machine. Many proposals for compute time were turned down for workloads that were “embarrassingly parallel” because these computations could be split up across multiple traditional compute clusters. However, research that involved MD simulations or lattice QCD required the fast Infiniband interconnects and the large amount of memory that Titan had, so these efforts were more likely to be approved.

The lab did in fact want projects that utilized the whole machine at once to take maximum advantage of its capabilities. It’s just that oftentimes this wasn’t possible, and smaller jobs would be slotted into the “gaps” between the bigger ones.


> my understanding correct

Yes

> why is it important to build supercomputers with more and more compute

A mix of

- research in distributed systems (there are plenty of open questions in Concurrency, Parallelization, Computer Architecture, etc)

- a way to maintain an ecosystem of large vendors (Intel, AMD, Nvidia and plenty of smaller vendors all get a piece of the pie to subsidize R&D)

- some problems are EXTREMELY computationally and financially expensive, so they require large On-Prem compute capabilities (eg. Protein folding, machine learning when I was in undergrad [DGX-100s were subsidized by Aurora], etc)

- some problems are extremely sensitive for national security reasons and it's best to keep all personnel in a single region (eg. Nuclear simulations, turbine simulations, some niche ML work, etc)

In reality you need to do both, and planners know this fact, and have known this fact for decades


You are generally correct, however there are workloads that do use larger portions of a supercomputer that wouldn't be feasible on smaller systems.

Also, I guess I'm not sure what you mean by "smaller systems that focus more on power/cost/space". A proper queueing system generally efficiently allocates the resources of a large supercomputer to smaller tasks, while also making larger tasks possible in the first place. And I imagine there's somewhat an efficiency of scale in a large installation like this.

There are, of course, many many smaller supercomputers, such as at most medium to large universities. But even those often have 10-50k cores or so.

(In general, efficiency is a consideration when building/running, but not of using. Scientists want the most computational power they can get, power usage be damned :) )

edit: A related topic is capacity vs. capability: https://en.wikipedia.org/wiki/Supercomputer#Capability_versu...


From my experience they are running whole cluster dedicated jobs quite frequently. Climate models can use whatever resources they get, nuclear weapons modelling esp for old warheads can use a lot.


What is being calculated with nuclear weapons? I understand it must have been computationally expensive to get them working, but once completed, what is there left to calculate?


I don't work in this area, but think about all of the variables that go into warhead maintenance. Your single supercomputer simulation can show that the warhead should work as-designed, but what will happen after the Plutonium pit has been sitting for a decade, slowly decaying? Will satisfactory implosion still happen if storage conditions slightly change the performance of the conventional explosives?

Since modern warheads are all fusion-type warheads, there's also the fusion stage to consider with even more highly classified top-secret sauce. It appears that the conditions for fusion are triggered by radiation pressure, and that likely makes things even more complicated. Now, you need not just a successful supercritical fission event, but one of the right shape(?), timing, and interaction with other secret-sauce materials that might have their own degradation curves.

So, rather than simulate one design, now you need to simulate hundreds to thousands to explore the full decay-over-time space. Getting the answer wrong means either very expensive premature warhead refurbishments or a nuclear stockpile that wouldn't work properly.


You can’t test idle weapons to make sure they still go boom in real life, so you have to simulate it.


The original nuclear powers are accumulating really old atomic bombs / rockets. We dont know for sure what is going on inside the warheads. (or possibly a sub section of the war head)

Cracking them open to have a look may not be a good idea. but leaving them alone for another few decades might not be wise either.

Funny fact: A lot of the nuclear weapons that have been destroyed, have removed them from bombs or rockets. But in many case the warheads were moved into storage. Ready to slap them on a rocket if that should become needed.


>The original nuclear powers are accumulating really old atomic bombs / rocket

The US is replacing its nukes: the new nukes have all new parts except the fissile material. Russia is ahead of the US here and has finished replacing its Soviet-era nukes in this way up to the limit of what they are allowed to deploy under the START treaties. (I.e., they might have some Soviet-era nukes, but if so, they need to stay in storage for Russia to stay in compliance with its treaty obligations.)

US and Russia no longer explode nukes to make sure they still work, which is where simulations using supercomputers come in.


Bigger systems when utilized at 100% are more efficient than multiple smaller systems when utilized at 100%, in terms of engineering work, software, etc.

But also, bigger systems have more opportunities to achieve higher utilization than smaller systems due to the dynamics of bin packing problem.


While in general there can be many smaller workloads running in parallel. However, periodically the whole supercomputer can be reserved for a "hero" run.


there’s also Bell Prize submissions, which is the only time some machines get completely reserved


Off topic but "Break the ___ barrier" has got to be my least favorite expression that PR people love. It's a "tell" that an article was written as meaningless pop science fluff instead of anything serious. The sound barrier is a real physical phenomenon. This is not. There's no barrier! Nothing was broken!


The Exascale barrier is an actual barrier in HPC/Distributed Systems.

It took 15-20 years to reach this point [0]

A lot of innovations in the GPU and distributed ML space were subsidized by this research.

Concurrent and Parallel Computing are VERY hard problems.

[0] - http://helper.ipam.ucla.edu/publications/nmetut/nmetut_19423...


The comment isn't saying that the benchmark isn't useful. They are saying that there is no 'barrier' to be broken.

The sound barrier was relevant because there was a significant physical effects to overcome specifically when going trans-sonic. It wasn't a question of just adding more powerful engines to existing aircraft. They didn't choose the sound barrier because its a nice number, it was a big deal because all sorts of things behaved outside of their understanding of aerodynamics at that point. People died in the pursuit of understanding the sound barrier.

The 'exascale barrier', afaict, is just another number chosen specifically because it is a round(ish) number. It didn't turn computer scientists into smoking holes in the desert when it went wrong. This is an incremental improvement in an incredible field, but not a world changing watershed moment.


When Exascale was defined as a barrier in the mid-late 2000s, a lot of computing technologies and techniques that are taken for granted today did not exist outside of the lab.

For example, FPGAs were considered as much more viable for sparse matrix computation instead of GPUs, BLAS implementations were not as robust yet, parallel programming APIs like CUDA and Vulkan were in their infancy, etc.

Just because you didn't do well in your systems classes or you think spinning up an EC2 instance on AWS is "easy" doesn't mean it's an easy problem.

That's like saying Einstein or Planck are dummies because AP Physics E&M students can handle basic relatively and quantum theory.


Exascale was an arbitrary target- it didn't unlock any magical capabilities. In fact the supercomputer folks are now saying the real barrier that will unlock their science is "zettaflops" (just saw a post from a luminary about it).

(also please don't be rude or condescending, it detracts from your argument)


Exascale is the primary target decided among everyone in the HPC community in the mid-late 2000s because FLOPS (floating point operations per second) is the unit used to benchmark, as there are various different variables like compiler, architecture, etc are very difficult to account for.

It's functionally the same as arguing that GB or TB are arbitrary units to represent storage.


Yes, that's correct: GB and TB are arbitrary units. We use them because historically, successive multiples of 1000 have been used to represent significant changes in size, mass, and velocity.

I used to work with/for NERSC and was around when they announced exascale as a target, and now they want to target zettascale. There is no magic threshold where science simulations suddenly work better as you scale up. It's mainly about setting goals that are 10-15 years away to stimulate spending and research.


> I used to work with/for NERSC and was around when they announced exascale as a target

We most likely crossed paths. How close were you to faculty in AMPLab?

> It's mainly about setting goals that are 10-15 years away to stimulate spending and research.

And that's not a barrier to you?

That feels very condescending about applied research or the amount of effort put into the entire Distributed Systems field.


It’s not a barrier because there is nothing qualitatively different at 999 vs 1000. It’s just a goal.

This is not condescending to the field at all. Crossing an arbitrary goal that is very difficult to get to is still impressive. Just stop using the “breaking the barrier” phrase.


I am being condescending about the effort put into the distributed systems field- very specifically, about classical supercomputers.

How close was I to faculty in AMPLab? Pretty close; I attended a retreat one year, helped steer funding their way, tried to hire Matei into Google Research, and have chatted with Patterson extensively around the time he wrote this (https://www.nytimes.com/2011/12/06/science/david-patterson-e...) then later when he worked on TPUs at Google.

(I'm not a stellar researcher or anything, really more of a functionary, but the one thing I do have is an absolutely realistic understanding of how academic and industrial HPC works)


> > It's mainly about setting goals that are 10-15 years away to stimulate spending and research.

> And that's not a barrier to you?

In what way is that a barrier? Barrier and goal aren't synonymous, it's just marketing speak that confuses them.

Running a marathon is not a barrier, it's a goal, even if many people can't reach it. A combat zone at the halfway point of the marathon is a barrier because it requires a completely different approach to solve.


> There is no magic threshold where science simulations suddenly work better as you scale up.

For certain problems there are absolutely magic thresholds. ML was famously abandoned for 2 decades because the computers were too slow, and the ML revolution has only been possible because of having a ~teraflop on a per-machine level. Weather and climate models are another where there have been concrete compute targets, a whole earth model requires about 10 teraflops (hence the earth simulator super computer). ~1 meter resolution is an exascale level target.


What specifically about the system changes as you cross from "almost exascale" to "definitely exascale" to justify calling it a 'barrier'?


The same reason we choose to define a Gigabyte as 10^9 or use the Richter scale to measure earthquakes.

We need a benchmark to delineate between large magnitudes.

Furthermore, there are very real engineering problems that had to be solved to even reach this point.

A lot of noobs take GPUs, Scipy, BLAS, CUDA, etc for granted when in reality all of these were subsidized by HPC research.

The DoE's Exascale project was one of the largest buyers for compute for much of the 21st century and helped subsidize Nvidia, Intel, AMD, and other vendors when they were at their worst financially.


Did you even read the question?


You're wasting your time by engaging here.


Indeed.

I rarely do these days.


And I answered - building the logistics and ecosystem


You did not answer the question. Reaching 0.7 exaflops also requires those logistics and ecosystem. You didn't say what changes when you reach 1.0 (because it's nothing).

It's an easy to understand mark on a very smooth difficulty curve. Not a barrier.


He asked what is significant about 1 exascale that it's a "barrier".

Now granted, the original rendition of this saying ("Breaking the sound barrier.") is also arbitrary because mach 1 is the speed of sound travelling through air on planet Earth, it's still a valid question that you did not answer.


Breaking the sound barrier isn't arbitrary. Indeed even measuring it using the Mach number shows that. Mach 1 isn't fixed, it is variable based on a number of attributes.

At Mach numbers above 1 the compressibility of the air is entirely different. The medium in which the airplane operates behaves differently, in other words. The sound barrier was a barrier because the planes they were using stopped behaving predictably at Mach > 1. They had to learn to design planes differently if they wanted to fly at those speeds.

Mach 1 is an external constraint mandated by the laws of physics. There is a good reason that sound can't travel faster.

That is why it is a barrier to be broken. It is a paradigm shift imposed entirely by the properties of our physical world.


> Just because you didn't do well in your systems classes or you think spinning up an EC2 instance on AWS is "easy" doesn't mean it's an easy problem.

Please leave personal attacks out of this, it is not in the spirit of HN, or in helping to see people's perspectives.

I'm not saying its not a hard problem, not at all. I respect the hell out of the work that has been done here.

I'm saying that the sound barrier is called a barrier for a very good reason. Aerodynamics on one side of the sound barrier are different than aerodynamics on the other. It is a different game entirely. That is why it is considered a barrier. A plane that has superb subsonic aerodynamics will not perform well on the other side of that barrier

Exascale computers on the other hand, while truly amazing, are not operating differently by hitting 10^18 FLOPS. If you're computer does 10^18 -20 FLOPS it is not operating in a fundamentally different set of rules than one running above the exaflop benchmark.

I never said that the achievement wasn't laudable. I argued that there is no barrier there.

If I'm wrong I would appreciate you explaining why doing things at 10^18 FLOPS is fundamentally different than computing just below that benchmark.


Because the unit used to measure IO is FLOPS [0], and most societies have settled on base-10 as their numerical system of choice for thousands of years. Furthermore, it is not possible to predict the exact number of cycles you'll need, since that depends on the architecture, compiler and many other factors. This is why FLOPS are used as the unit of choice.

Each jump in flops by a magnitude of 10^3 is a significant problem in concurrency, IO, parallelism, storage, and existing compute infrastructure.

Managing racks is difficult, managing concurrent workloads is difficult, managing I/O and storage is difficult, designing compute infra like FPGAs/GPUs/CPUs for this is difficult, etc.

[0] - https://en.m.wikipedia.org/wiki/FLOPS


So its just an arbitrary base-10 number, and not a constraint imposed by physics or some other outside constraint like the sound barrier?

That's kind of my point, an Exaflop is a benchmark and not a 'barrier'. The sound barrier wasn't a 10^3 change in speed, the difference between a subsonic plane and a supersonic plane is measured in percentages when it comes to speed, e.g. a plane that is happy at .8 Mach for a top speed is going to be designed under a completely different set of rules than one that tops out at 1.2 Mach.

Again, I'm not saying that the accomplishments are insignificant, I'm just arguing press-release semantics here.


If you're not being facetious I recommend listening to the 2022 ACM Gordon Bell Award Winner lecture.

What you're doing is the equivalent of asking why do we use Gigabyte or Terabyte as a metric.

Just reaching 10^12 floating point operations per second was not something that was done until 2018.


No, it’s not equivalent at all. Nobody is asking about the unit of measure.


It is definitely not incremental. If you watch some talk on Gordon Bell prize for Frontier, you see that the dynamics have changed completely.

Data, software stack, I/O have suddenly become bottlenecks in multiple places. So yes, it is a watershed moment..


Can you explain how 10^18 FLOPS is fundamentally different than (10^18 - 20) FLOPS? Do the conventional rules of computing completely change at that exact number?


https://www.osti.gov/servlets/purl/1902810

See this for example. Different applications have different scales at which they reach similar problems. Exascale is decent upper bound for most of the fields. If you really dig in deep, the bound may be found at slightly lower value > 500 Pflops. But it's a good rule of thumb to consider 1 EFlop/s to be safe.

Also see this https://irp.fas.org/agency/dod/jason/exascale.pdf


But that same point was made at every 3 levels of magnitude improvement in supercomputing history.


For example, In situ visualization is now preferred due I/O bottleneck in application instead of compute bottleneck. This seems different from last generations.


“It didn't turn computer scientists into smoking holes in the desert when it went wrong. “

Shame, really, HPC could use a little bit of high stakes adventure to make it sexier. (Funny how risking death makes things more attractive ?)

There’s something about working with equipment where the line between top performance and a smoking hole is a matter of degree.

Also opens up a lot more Netflix production opportunities and I bet code safety would get a bump as well.


100% agreed. I might put my thoughts about it into an essay and name it "Breaking barriers considered harmful".


Was there actually a barrier at exascale? I mean, was this like the sound barrier in flight where there is some discontinuity that requires some qualitatively different approaches? Or is the headline just a fancy way of saying, "look how big/fast it is!"


One thing is having a bunch of computers. Another thing is to get the working efficiently on the same problem.

While I'm not sure exascale was something like the sound barrier, I do know a lot of hard work has been done to be able to efficiently utilize such large clusters.

Especially the interconnects and network topology can make a huge difference in efficiency[1] and Cray's Slingshot interconnect[2], used in Aurora[3], is an important part of that[4].

[1]: https://www.hpcwire.com/2019/07/15/super-connecting-the-supe...

[2]: https://www.nextplatform.com/2022/01/31/crays-slingshot-inte...

[3]: https://www.alcf.anl.gov/aurora

[4]: https://arxiv.org/abs/2008.08886


Diagonalization of working set and scaling-up and -out coordination. Some programs (algorithms) just have >= O(n) time and space, temporal-dependent "map" or "reduce" steps that require enormous amounts of "shared" storage and/or "IPC".


[flagged]


I specifically said I wasn't sure it's a barrier. My point was that you can't scale up without hard work. That is, you can't scale just by buying more of the same hardware.


I specifically said I wasn't sure it's a barrier.

Why would it be?

That is, you can't scale just by buying more of the same hardware.

What did the other super computers get wrong that made them slower?


It isn't comparable to the sound barrier, but it was still a challenge.

It took significantly longer than it should have if it was just business as usual: "At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018." [1]

We got the first true exascale system with Frontier in 2022.

Part of the problem was the power consumption and having a purely CPU based system online for an exascale job. From slide 12 from[2]: "Aggressive design is at 70 MW" and "HW failure every 35 minutes".

[1]: https://en.wikipedia.org/wiki/Exascale_computing [2]: https://wgropp.cs.illinois.edu/bib/talks/tdata/2011/aramco-k...


It's a milestone, not a barrier.


Barrier sounds cooler in a press release than "we made it fast enough to surpass an arbitrary benchmark"


Yeah. Back in the 90s, "terascale" was the same kind of milestone buzzword that was being thrown around all the time.

Because of that, I felt a bit of nostalgia when I first saw consumer-accessible GPUs hitting the 1 TFLOP performance level, which now I suppose qualifies as a cheap iGPU.


I have no problem with “exascale” or whatever, it’s just not a barrier, there was nothing in particular to overcome to get there, it’s just a signpost along the normal path forwards.

Unlike the sound barrier in flight, it was for a while thought impossible to fly faster than sound and there are indeed models of subsonic flight that have an infinity in them at the speed of sound. It indeed took new models and radically different designs to fly faster than sound.

Making this computer do a certain number of calculations didn’t involve any such problem to overcome.


It's like a speedrunning barrier


On target to be somewhat slower than Frontier at double the power consumption.


And very, very late


Just use more efficiency cores that aren't efficient.


"and is the fastest AI system in the world dedicated to AI for open science"

Cool. Please ask it to sue for peace in several parts of the world, in an open way. Whilst it is at it, get it to work out how to be realistically carbon neutral.

I'm all in favour of willy waving when you have something to wave but in the end this beast will not favour humanity as a whole. It will suck up useful resources and spit out some sort of profit somewhere for someone else to enjoy.


They’re simply latching onto the AI buzzwords for the good press. Leadership class HPCs have been designed around GPUs for over a decade now, it just so happens they can use those GPUs to run the AI models in addition to the QCD or Black hole simulations etc that they’ve been doing for ages.


It's much more than that.

For example, I have an "AI" project on Frontier. The process was remarkably simple and easy - a couple of Google Meets, a two page screed on what we're doing, a couple of forms, key fob showed up. Entire process took about a month and a good chunk of that was them waiting on me.

Probably half a days work total for 20k node hours (four MI250x per node) on Frontier for free, which is an incredible amount of compute my early resource constrained startup would have never been able to even fathom on some cloud, etc. It was like pulling teeth to even get a single H100 x8 on GCP for what would cost at least $250k for what we're doing. And that's with "knowing people" there...

These press releases talking about AI are intended to encourage these kinds of applications and partnerships. It's remarkable to me how many orgs, startups, etc don't realize these systems exist (or even consider them) and go out and spend money or burn "credits" that could be applied to more suitable things that make more sense.

They're saying "Hey, just so you know these things can do AI too. Come talk to us."

As an added bonus you get to say you're working with a national lab on the #1 TOP500 supercomputer in the world. That has remarkable marketing, PR, and clout value well beyond "yeah we spent X$ on $BIGCLOUD just like everyone else".


Nvidia's entire DGX and Maxwell product line was subsidized by Aurora's precursor, and Nvidia worked very closely with Argonne to solve a number of problems in GPU concurrency.

A lot of the foundational models used today were trained on Aurora and its predecessors, as well as tangential research such as containerizarion (eg. In the early 2010s, a joint research project between ANL's Computing team, one of the world's largest Pharma companies, and Nvidia became one of the largest customers of Docker and sponsored a lot of it's development)


We are in an age of incessant whining


…and, apparently, unintended irony.


No, the irony was understood. Some f-wad always likes to point out the “irony“. It’s not the first time I’ve addressed the problem. Human nature seems to head straight to complaining

That’s why it appears unstoppable.


Every single paragraph contains the word "AI".


According to Wikipedia[0] it uses 38.7MW of power, beating Fugaku (29.9MW) to be #1 in the TOP500 for power consumption.

[0] https://en.wikipedia.org/wiki/Aurora_(supercomputer)


Aurora was supposed to go into production years ago and be the first exascale supercomputer. Intel is dog shit, so this kept getting delayed. Now they're going for the two exaflop mark. It's pathetic that Aurora is only now benchmarking over an exaflop, and it's even more pathetic than this is apparently newsworthy.


The link is Intel's own website. Has excellent quotes on it like:

> Why It Matters: Designed as an AI-centric system from its inception

Announcement was in 2015. I'm curious whether Argonne are pleased with it.


>Announcement was in 2015. I'm curious whether Argonne are pleased with it.

All I'll say is that you can probably guess how they feel about it given the context.


Intel was chosen because the DoE wanted to foster an entire ecosystem of GPU vendors, because having a single vendor is a MASSIVE bottleneck.

Intel, AMD, and Nvidia were all vendors on Exascale projects [0]

> Intel is dog shit

Intel has issues with execution, but their engineers are still top notch. They are the last American company to actually do semiconductor fabrication, and only fell behind TSMC and Samsung in fabrication only 6-7 years ago because they didn't choose to invest in EUV lithography instead of other methods.

[0] - http://helper.ipam.ucla.edu/publications/nmetut/nmetut_19423...


Case in point, Intel’s discrete consumer GPUs: They’re making major gains with each driver release, and I hope to see them seriously competing with nvidia in the gaming market.


A solution can always be achieved by spending more money.™ Rather than engineering for economic scale, engineer for maximizing billable hours and make yourself indispensable. - Consultant's credo

With all of the taxpayer money they've wasted so far, they could've bought zillions of Cerebras WSE-3 and exceeded 10 exaflops.


I feel like a lot of the big GPU clusters that companies like Meta are using would score highly on the benchmark, even though Linpack seems to be based on fp64 which the H200 is kinda sucky at. However, there's only one GH200-based machine (Alps, with 2688 GH200 nodes) in the top 10 even though I'm fairly sure it's quite a bit smaller than whatever Meta is training Llama 3 on. Is there any reason why we don't see those show up in TOP500?


The machines that the large ML companies are using are typically not going to run TOP500 competitively. They tend to skip the MPI stack, and weren't optimized for LINPACK performance. When I worked on TPUs there was no interest in attempting to benchmark them using supercomputer workloads (probably a good thing).


> I feel like a lot of the big GPU clusters that companies like Meta are using would score highly

Some might, but most of the work done on GPU compute was subsidized by DoE Exascale projects like Aurora, like my anecdote about the DGX product line above.


They are busy running training/inference on them and haven't benchmarked them?


So these benchmark lists like TOP500 are just fundamentally flawed because the serious players are too busy doing actual work than to participate?


Or have them set up in a way that makes them hard to run full-system benchmarks on. I can think of a couple of financial firms that have clusters that would rate on the Top 10, but a) as you suspected, they’re too busy running the money printers to take them down for a few days to run benchmarks, and b) they’re split up into more manageable little clusters, so the high speed fabrics don’t connect and allow every node to talk to every other node, which you need for an HPL run.


Are you willing to dump a name or two of those finance firms?


TOP500 and other similar HPC benchmark lists are only for FP64 computations.

While both NVIDIA and AMD design their top GPU model for both FP64 and AI/ML workloads, to save on the design cost, you can do AI training using GPUs that have only high AI performance (like RTX 4090 or its workstation counterpart, RTX 6000), without implementing FP64 operations at all (the FP64 performance of RTX 4090 is negligible, being worse than of any decent cheap CPU, it is provided only for compatibility in testing).


It is a good advertisement for interconnect vendors.


I mean there is no rule if you buy a supercomputer that you have to benchmark it and submit the results. This said, in the days before AI the number and type of players that had this amount of compute were also commonly the types to submit their scores to said benchmark lists.


Most players that had this amount of compute also tend to pay less than the equivalent corporate wage, and so besides being a good way to stress-test your cluster (prior to general availability), it gives you the positive feeling of “I helped make this, I help run this.”


I work with a couple of the national labs and the talent there is generally incredible.

"I went to school for > 25 years so my life's work could be selling more ads" isn't a motivator for most of these people.


Thanks much for the info! Sentiments like that make me hopeful for humanity.


A single 8xH100 node hits 15.6 fp16 petaFlops


Wasn't it supposed to be a 2 exascale system?


Modern day Manhattan Project


That line seems to emerge every time a new supercomputer is commissioned on the US… in fact, the naming of Los Alamos’ new Crossroads, along with its predecessor, Trinity, have obvious roots in IS nuclear testing history.

https://discover.lanl.gov/news/0830-crossroads/


Imagine a Beowulf cluster of those!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: