Hacker News new | past | comments | ask | show | jobs | submit login
The art of high performance computing (theartofhpc.com)
623 points by rramadass 8 months ago | hide | past | favorite | 120 comments



The hardware / datacenter side of this is equally fascinating.

I used to work in AWS, but on the software / services side of things. But now and then, we would crash some talks from the datacenter folks.

One key relevation for me was that increasing compute power in DCs is primarily a thermodynamics problem than actual computing. The nodes have become so dense that shipping power in and shipping heat out, with all kinds of redundancies is an extremely hard problem. And it's not like you can perform a software update if you've discovered some inefficiencies.

This was ~10 years ago, so probably some things have changed.

What blows me away is that Amazon, starting out as an internet bookstore is at the cutting edge of solving thermodynamics problems.


Seymour Cray used to say this all the way back in the 1970s: his biggest problems were associated with dissipating heat. For the Cray 2 he took an even more dramatic approach: "The Cray-2's unusual cooling scheme immersed dense stacks of circuit boards in a special non-conductive liquid called Fluorinert™" (https://www.computerhistory.org/revolution/supercomputers/10...)


Few days ago I saw an article passing by,about chips hiting the kw floor.


It always made me wonder why liquid cooling wasn't more of a thing for datacenters.

Water has a massive amount of thermal capacity and can quickly and in bulk be cooled to optimal temperatures. You'd probably still need fans and AC to dissipate heat of non-liquid cooled parts, but for the big energy items like CPUs and GPUs/compute engines, you could ship out huge amounts of heat fairly quickly and directly.

I guess the complexity and risk of a leak would be a problem, but for amazon sized data centers that doesn't seem like a major concern.


> It always made me wonder why liquid cooling wasn't more of a thing for datacenters.

Liquid cooling is almost a defacto-standard in data centers in the HPC world. The Top of the TOP500 machines are all liquid cooled. Not by choice, but due to physics constraints.

There is a big gap in power density between the HPC world and the usual datacenter-commodity-hardware world.

Commodity DS are designed with the assumption that the average machine will run with a fraction of it's maximum load. HPC systems at the opposite are designed to operate safely at 100% load all the time.

In a previous company where I worked, we attempted to install a medium size HPC cluster in a well-known commerical datacenter and network provider. The commercial of the DS almost felt from his chair when we announced the power requirements.


> we attempted to install a medium size HPC cluster in a well-known commerical Datacenter and network provider. The commercial of the DS almost fall from his chair when we announced the power requirements.

Heh. We tried it too. They didn’t believe that a single node used their entire rack’s budget at first.


I'm going through a similar thing. Next week our procurement is arriving, but most likely majority of the nodes will sit in a box for a while before they can figure out how to power them...

Also, this involves a UK university where it would takes forever to upgrade the power delivery to the building/floor/room. There's no planned upgrade whatsoever so people just need to make do with what we have.


Sounds fascinating. Can you give any more details? What kind of nodes are they and how they differ from "traditional" DC hardware, say from Supermicro?


The difference is GPUs. A normal dual socket system serving a database or webserver use under medium load around 200-300W, One of these [1] equipped with 10xA100 can easily use in the ballpark of 3kW under load. So we are talking 10x the power usage.

[1]https://www.supermicro.com/en/products/system/gpu/5u/sys-521...


Thanks, this makes sense!


I’m a bit skeptical about the claim that the top of the TOP500 are all liquid cooled due to physical constraints. (Where the “physics” here seems to mean the density and property of air cool in general but ignoring the environment of the machine such as the ambient weather.)

The one data point I know well is NERSC, and has been air cooled until relatively recently. Part of the success of air cool in the past was the great Bay Area weather that is typically cooled enough. I forgot the exact reason for upgrading to water cool, but it was before the recent upgrade to a new machine (Perlmutter). It may have to do with our weather getting more extreme so that there are more incidences that air cool only will “throttle” your machine.

The reason I’m skeptical about that claim is that the ambient temperature also play a role. But certainly the density is increasing so may be the biggest supercomputers at the moment are too dense (in power consumption) to be air cooled.


If you take the top5 of the top500 (01/2024):

- Frontier: Liquid cooled

- Aurora: liquid cooled

- Eagle: liquid cooled

- Fugaku: liquid cooled

- Lumi: Liquid cooled with heat recycling.

The need for liquid cooled arrived with the usage of accelerators that bumped significantly the power density.

That said, Fugaku which is not accelerator driven is also liquid cooled today.

I am surprised to learn that NERSC was air cooled until recently. Most of the BGQ machines, that were dominating the top500 10 years ago, were already liquid cooled.


I probably wasn't clear. I don't doubt your first statement "The Top of the TOP500 machines are all liquid cooled." My skepticism is your second statement "due to physics constraints". It may very well be true but I'm just skeptical that it can't be done with a combination of engineering and ambient environment (mostly temperature but also a tradeoff between density and space). May be I focused on the word physics too much (being a physicist), but it seems the decision would basically be a cost-benefit analysis and risk management which involves many factors including money, maturity of solutions in the market, safety, etc. For example, at NERSC, there are real risk of a massive earthquake long overdue (the whole floor is quake-proof, but I guess the risk is reduced, not eliminated), so I guess they probably have considered this in that design choice.

But perhaps you're right that the physics is the ultimate factor here: what comes in must goes out. With order of magnitude scale of increase in power coming in, water seems to be the most effective and cheap entity to absorb those heat and be transported outside the floor very efficiently.


Because it’s complex. Even more complex than “engineered” air.

You need two circuits, and a CDU between them. Coolants needs maintaining. You add antifreeze, biocides, etc.

Air is brute force. It cools everything it touches. Liquid cooling is serialized in a node. Two sockets? Second will be hotter. HBA not making good contact? It’ll overheat.

You add extensive leak detection subsystems, the amount of coolant moving in your primary circuit becomes massive.

Currently you can remove 97% of the heat via liquid (including the PSUs), and it’s cheaper to do so than air, but it’s not “rails, screws, cables, power on”. Air cooled systems can be turned on in a week. Liquid cooled ones take a month.

However, using liquid is mandatory after some point. You can’t cool systems that dense and under that load with air. They’ll melt.


> Liquid cooling is serialized in a node. Two sockets?

I've seen tests done on heavy PC loops (ie multi-GPU) both high-flow and low-flow, as well as on car engines, in different coolant flow configurations. The results from all of those are that the water doesn't rise meaningfully in temperature between components.

Unless I did my back-of-the-napkin math wrong, this seems reasonable. If you have a single 10mm ID pipe going through a 1U server and up to the next, then for a full 42U rack you have about 1.7kg of water going through the servers. If the flow rate is about 1s per server (so 42 seconds for the full rack) and each 1U server dumps 500W of energy into the water, there should be just a 3 degree C difference in the water temperature between the first and the last server.


In our system every node gets inlet water at the same temperature via parallel piping, but when it’s in node, it goes through processors first, then RAM, then PCIe and disks. Delta T between two sockets is 5 degrees C, and the delta T between input and output is around 15-18 C depending on load.


First, thanks for sharing these details, I find them fascinating because they are not so common to be read or heard about.

> Delta T between two sockets is 5 degrees C

And secondly, ~5-10 degrees is what I see on my dual-socket workstation, and have been wondering about this delta ever since the first day I started monitoring the temperatures. At first, I thought that the heat sink wasn't installed properly but after reinstalling it the delta remained. Since I didn't notice any CPU throttling or whatsoever I figured it's "normal" and ignored it.


Hey, no worries. Using one is equally fascinating as much as reading about it. It feels like a space shuttle, so different, yet so enjoyable.

I mean, water travels from one socket to another, so one processor adds heat equal to 5 degrees C under nominal load. The second socket doesn’t complain much, but this is enormous amounts of heat transferred in a such quick pace.


Interesting. What's your flow rate and pipe size?


Is rising air's water content (humidity) worth it? Humid air can "store" more heat.

I guess it could be bad past some %, but there's a probably a point where it's worth it.


What's this all look like without an atmosphere?


Worse, heat dissipation is a major constraint for spacecraft and satellites because you can only radiate heat away as infrared photons.


Doesn’t have to be infrared but yeah, space isn’t “cold” so much as it’s an insulator.


at ~2 kelvin I'd have thought you can radiate away a truckload of heat surely


You can radiate easily.

But not convect. Hence why it’s much, much harder than removing heat on Earth.


Of course you're not convecting but if you are radiating from a hot body into an ambient two Kelvin then you are going to lose heat really, really fast. IIRC heat loss by black body radiation into its surroundings is proportional to the fourth power of the temperature difference between the body and surroundings (from memory, and going back a very long way, so maybe incorrect).


> proportional to the fourth power of the temperature difference between the body and surroundings

Almost. It’s proportional [0] to T_hot^4 - T_cold^4. For a 100C surface with emissivity 1, that’s about 1kW/m2 if there is no radiation coming back, which really isn’t very high. You cannot cheat this with fancy folded-up radiating surfaces (it’s thermodynamically impossible, and the actual mechanism that kills it is one fin of the heatsink radiating right at the next one).

So cooling in space is hard. You’re not getting GPU-like power densities without a physically immense radiating surface extending way past those GPUs.

[0] Caveat: emissivity can depend on wavelength, and the law holds independently at each wavelength. So this can introduce interesting effects, which is how all the fancy prototype roof-cooling materials work, and it’s also related to how “spectrally selective” windows and window films work.


Thanks, you clearly know your subject. I must say that without doubting your figures, I am astonished at how little heat can be lost radiatively. Given the 4th powers I assumed it must be vastly higher, but clearly not, much against my intuition. Thanks for a good answer!


Amazing considering how much heat travels from Sun (and punches through atmosphere) to Earth surface. Didn’t realize there was that much of an insulation property.


Besides the heat insulation, without the vacuum of space we would all be deafened by the sun's roar.


Vacuum is a great insulator, which is why we have vacuum flasks.


You just need to radiate in the visible spectrum then the problem will be much reduced.


Couldn't you heat up and poop out blocks of material all day long, or is that starting down a "solution creates more heat than it dissipates" path?



I wonder if you could design a rack or half rack that has a connector for attaching a minisplit refrigerant line directly to it, instead of cooling the entire room.

Problem is that’s how you sell 1-3 racks not an entire room of them.

Ethylene lines might be more practical, easier to wire into existing systems.


Immersion cooling is getting big. At the last Supercomputing conference I probably saw at least a dozen vendors of immersion cooling equipment. My datacenter has one cluster with liquid cooling caps over the sockets, and two immersed clusters. The latter two have basins of various degrees of sophistication under them for when they do spring a leak.


In contexts where there's a good chance of standardisation, I believe it is. Both OCP [0] and Open19 [1] have liquid cooling as part of the standard.

[0]: https://www.opencompute.org/projects/cooling-environments

[1]: https://gitlab.com/open19/v2-specification/-/blob/main/syste...


Is there any good data on the scale of this problem or that can be used to visualise it?

What is the cutting edge of cooling tech like?


It's very interesting how abtracted away HPC sometimes looks from hardware. The books seem to revolve a lot around SPMD programming, algo & DS, task parallelism, synchronization etc, but very little about computer architecture details like supercomputer memory subsystems, high-bandwidth interconnects like CXL, GPU architecture and so on. Are the abstractions and tooling already good enough that you don't need to worry about these details? I'm also curious if HPC practitioners have to fiddle a lot of black-box knobs to squeeze out performance?


For most HPC, you will not be able to maximize parallelism and throughput without intimate knowledge of the hardware architecture and its behavior. As a general principle, you want the topology of the software to match the topology of the hardware as closely as possible for optimal scaling behavior. Efficient HPC software is strongly influenced by the nature of the hardware.

When I wrote code for new HPC hardware, people were always surprised when I asked for the system hardware and architecture docs instead of the programming docs. But if you understood the hardware design, the correct way of designing software for it became obvious from first principles. The programming docs typically contained quite a few half-truths intended to make things seem misleadingly easier for developers than a proper understanding would suggest. In fact, some HPC platforms failed in large part because they consistently misrepresented what was required from developers to achieve maximum performance in order to appear "easy to use", and then failing to deliver the performance the silicon was capable of if you actually wrote software the way the marketing implied would be effective.

You can write HPC code on top of abstractions, and many people do, but the performance and scaling losses are often unavoidably integer factor. As with most software, this was considered an acceptable loss in many cases if it allowed less capable software devs to design the code. HPC is like any other type of software in that most developers that notionally specialize in it struggle to produce consistently good results. Much of the expensive hardware used in HPC is there to mitigate the performance losses of worse software designs.

In HPC there are no shortcuts to actually understanding how the hardware works if you want maximum performance. Which is no different than regular software, in HPC the hardware systems are just bigger and more complex.


I started in HPC about 2 years ago on a ~500 node cluster at a Fortune 100 company. I was really just looking for a job where I was doing Linux 100% of the time, and it's been fun so far.

But it wasn't what I thought it would be. I guess I expected to be doing more performance oriented work, analyzing numbers and trying to get every last bit of performance out of the cluster. To be honest, they didn't even have any kind of monitoring running. I set some up, and it doesn't really get used. Once in a while we get questions from management about "how busy is the cluster", to justify budgets and that sort of thing.

Most of my 'optimization' work ends up being things like making sure people aren't (usually unknowingly) requesting 384 CPUs when their script only uses 16, testing software to see what # of CPU's it works with before you see a degradation, etc. I've only had the Intel profiler open twice.

And I've found that most of the job is really just helping researchers and such with their work. Typically running either a commercial or open-source program, troubleshooting it, or getting some code written by another team on another cluster and getting it built and running on yours. Slogging through terrible Python code. Trying to get a C++ project built on a more modern cluster in a CentOS 7 environment.

It can be fun in a way. I've worked with different languages over the years so I enjoy trying to get things working, digging through crashes and stack traces. And working with such large machines, your sense of normal gets twisted when you're on a server with 'only' 128GB of RAM or 20TB of disk.

It's a little scary when you know the results of some of this stuff are being used in the real world, and the people running the simulations aren't even doing things right. Incorrect code, mixed up source code, not using the data they thing they are, I once found a huge bug that had existed for 3 years. Doesn't this invalidate all the work you've done on this subject?

The one drawback I find is that a lot of HPC jobs want you do have a masters degree. Even to just run the cluster. Doesn't make sense to me, I'm not writing the software you're running, we aren't running some state of the art, TOP500 cluster. We're just getting a bunch of machines networked together and running some code.


> The one drawback I find is that a lot of HPC jobs want you do have a masters degree.

Is it possible that pretty much any specialization, outside of the most common ones, engages in a lot of gatekeeping? I remember how difficult it appeared to be after I graduated to break into embedded systems (I never did). I persisted until I realized it doesn't even pay very well, comparatively.


Yes, but to varying degrees. I imagine the whole "fortune 500" deal probably gatekept more than it really needed to. While I don't think any specialization NEEDS a masters since 2-3 years of industry work in that field will do just as much 90% of the tie (a few may need PhD's, mostly for R&D labs), some can justify it more than others.

It's also cultural. From what I hear, the east cost US cares a lot more about prestige than the west coast that focuses a lot more on performance.


I always found that funny too. A business who needs a powerful computing solution can come up with some amazingly robust stuff, whereas science/research just buys a big mainframe and hopes it works.


I was working in a company that had been spun out of a university until recently and it was shocking how hopeless the researchers were. I've always been critical of how poor the job security in academia is but you'd think it's still too much given how slapdash some of the crap you see is. We basically had to reinvent their product from the ground up, awful.


This is probably a naive question but isn't that the point of having developers on staff? The researchers aren't coders and vice versa, so having researchers produce prototypes that are productized by engineers makes sense to me.


Exactly! This is how it should be.

Researchers/Scientists with their hard earned PhDs should only concentrate on doing cutting-edge "researchy" stuff. It is hard enough that they should not be asked to learn all the intricacies/problems inherent in Software Development. That is the domain of a "Professional Software Engineer".

There is now in fact a new class called "Research Software Engineer" who are Software Developers working in Research developing code specific to their needs - https://www.nature.com/articles/d41586-022-01516-2 and https://en.wikipedia.org/wiki/Research_software_engineering


I've had very similar experiences working with former researchers including at a university spinout. Mechanical rather than CS. It was perplexing how they still carried the elitism that industry was mostly for people who can't hack it in academia given the quality of their work. Would be unacceptable coming from a new hire PD engineer at Apple yet you're demanding respect because you used to lead a whole lab apparently producing rubbish?


would love to connect and talk more about HPC - let me know if you'd be up for a chat :)


Emails in my profile, feel free to reach out


There is a lot of abstraction, but knowing which abstraction to use still takes knowing a lot about the hardware.

> I’m also curious if HPC practitioners have to fiddle a lot of black-box knobs to squeeze out performance?

In my experience with CUDA developers, yes the Shmoo Plot (https://en.wikipedia.org/wiki/Shmoo_plot, sometimes called a ‘wedge’ in some industries) is one of the workhorses of every day optimization. I’m not sure I’d call it black-box, though maybe the net effect is the same. It’s really common to have educated guesses and to know what the knobs do and how they work, and still find big surprises when you measure. The first rule of optimization is measure. I always think of Michael Abrash’s first chapter in the “Black Book”: “The Best Optimizer is Between Your Ears” http://twimgs.com/ddj/abrashblackbook/gpbb1.pdf. This is a fabulous snippet of the philosophy of high performance (even though it’s PC game centric and not about modern HPC.)

Related to your point about abstraction, the heaviest knob-tuning should get done at the end of the optimization process, because as soon as you refactor or change anything, you have to do the knob tuning again. A minor change in register spills or cache access patterns can completely reset any fine-tuning of thread configuration or cache or shared memory size, etc.. Despite this, some healthy amount of knob tuning is still done along the way to check & balance & get an intuitive sense of the local perf space of the code. (Just noticed Abrash talks a little about why this is a good idea.)


Could you explain how you use a shmoo plot for optimization? Do you just have a performance metric at each point in parameter space?


The shmoo plot is just the name for measuring something (such as perf) over a range of parameter space. The simplest and most straightforward application is to pick a parameter or two that you don’t know what value they should be using, do the shmoo over the range of parameter space, and then set the knobs at whatever values give you the optimal measurement.

Usually though, you have to iterate. Doing shmoos along the way can help with understanding the effects of code changes, help understand how the hardware works, and it can sometimes help identify what code changes you might need to make. A simple abstract example might be I know what my theoretical peak bandwidth is, but my program only gets 30% of peak. I suspect it has to do with how many registers are used, and I have a knob to control it, so I turn the knob and plot all possible register settings, and find out that I can get 45% of peak with a different value. Now I know it was partially registers I was limited by, but I also know to look for something else too. Then I profile, examine the code, maybe refactor or adjust some things, hypothesize, test, and then shmoo again on a different knob or two if I suspect something else is the bottleneck.


Bayesian optimization is very good an optimizing black box Knob in shmoo plot usually called meta parameter. Assuming each probing is costly they allow you to find the optional combination of value for many knob (dimension) with a minimum of probing


HPC admin here, generally serving "long tail of science" researchers.

In today's x86_64 hardware, there's no "supercomputer memory subsystem". It's just a glorified NUMA system, and the biggest problem is putting the memory close to your core, i.e. keeping data local in your NUMA node to reduce latencies.

Your resource mapping is handled by your scheduler. It knows your hardware, hence it creates a cgroup which satisfies your needs and as optimized as possible, and stuffs your application into that cgroup and runs it.

Currently king of high performance interconnects is Infiniband, and it accelerates MPI at the fabric level. You can send messages, broadcasts and reduce results like there's no tomorrow. Because when the message arrives you, it's already reduced. When you broadcast, you only send a single message which is broadcasted at fabric layer. Multiple Context IB cards have many queues and more than one MPI job can run on the same node/card with queue/context isolation.

If you're using a framework for GPU work, the architecture & optimization is done at that level automatically (the framework developers do the hard work generally). NVIDIA's drivers are pure black magic, too. They handle some parts of the optimization, too. InterGPU connection is handled by a physical fabric, managed by drivers and its own daemon.

If you're CPU bound, your libraries are generally hand tuned by its vendor (Intel MKL, BLAS, Eigen, etc.). I personally used Eigen, and it has processor specific hints and optimizations baked in.

The things you have to worry is to compile your code for the correct architecture, make sure that the hardware you run on can satisfy your demands (i.e.: do not make too many random memory accesses, keep the prefetcher and branch predictor happy if you're trying to go "all-out fast" on the node, do not abuse disk access, etc.).

On the number crunching side, keeping things independent (so they can be instruction level parallelized/vectorized), making sure you're not doing unnecessary calculations, and not abusing MPI (reducing inter-node talk to only necessary chatter) is the key.

It's way easier said than done, but when you get the hang of it, it becomes like a second nature to think about these things, if these kinds of things are your cup of tea.


Thanks for the thoughtful comment, pretty fascinating stuff.

> In today's x86_64 hardware, there's no "supercomputer memory subsystem". It's just a glorified NUMA system, and the biggest problem is putting the memory close to your core, i.e. keeping data local in your NUMA node to reduce latencies.

I mean, memory topology varies greatly by uarch (doubly so between vendors). I can't take a routine tuned to Nehalem, run it on Haswell or Skylake and expect it to stay competitive. More generally, different hardware has different bandwidth and latency ratios, which affects software design (e.g. software written for commodity Dell w/ PCIe cards probably won't translate to Cray accelerator grid connected by HPE slingshot). And then there's hardware-specific features like RNICs bypassing DRAM and writing RDMA messages directly into the receiver's cache. So I think that ccNUMA and data locality is not sufficient to reason about memory perf.


> I mean, memory topology varies greatly by uarch...

You're absolutely right, this is why I said that if you're using libraries, this burden is generally handled by them. Also compilers do this and handle this very well.

If you're writing your own routines, the best way is to read the arch docs, maybe some low-level sites like chips and cheese, do some synthetic benchmarks and write your code in a semi informed way.

After writing the code, a suite of cachegrind, callgrind and perf is on order. See if there are any other bottlenecks, and tune your code accordingly. Add hints for your compiler, if possible.

I was able to reach insane saturation levels with Eigen plus, some hand-tuned code. For the next level, I needed to change my matrix ordering, but it was already fast enough (30 minutes to 45 seconds: 40x speedup), so I left it there.

Sometimes there are no replacement for blood, sweat and tears in this thing.

I have never played with custom interconnects (Slingshot, etc.), yet, so I can't tell much.


I would contest the remark on Infiniband, since there is Slingshot which will be faster for real-world applications, and also Omnipath.


You mentioned group I am very curious what you think of the work Facebook did on memory pressure metric


Yes and no.

MPI and OpenMP are the primary abstractions from the hardware in HPC, with MPI being an abstracted form of distributed-memory parallel computing and OpenMP being an abstracted form of shared-memory parallel computing. Many researchers write their codes purely using those, often both in the same code. When using those, you really do not need to worry about the architectural details most of the time.

Still, some researchers who like to further optimize things do in fact fiddle with a lot of small architectural details to increase performance further. For example, loop unrolling is pretty common and can get quite confusing in my opinion. I vaguely recall some stuff about trying to vectorize operations by preferring addition over multiplication due to the particular CPU architecture, but I do not think I've seen that in practice.

Preventing cache misses is another major one, where some codes are written so that the most needed information is stored in the CPU's cache rather than memory. Most codes only handle this by ensuring column-major order loops for array operations in Fortran or row-major order loops in C, but the concept can be extended further. If you know the cache size for your processors, you could hypothetically optimize some operations to keep all of the needed information inside the cache to minimize cache misses. I've never seen this in practice but it was actively discussed in the scientific computing course I took in 2013.

The use of particular GPUs depends heavily on the problem being solved, with some being great on GPUs and others being too difficult. I'm not too knowledgeable about that, unfortunately.


Of course, not every problem can be solved by BLAS, but if you are doing linear algebra, the cache stuff should be mostly handled by BLAS.

I’m not sure how much multiplication vs addition matters on a modern chip. You can have a bazillion instructions in flight after all, as long as they don’t have any dependencies, so I’d go with whichever option shortens the data dependencies on the critical path. The computer will figure out where to park longer instruction if it needs to.


You're right that the addition vs. multiplication issue likely does not matter on a modern chip. I just gave the example because it shows how the CPU architecture can affect how you write the code. I do not recall precisely when or where I heard the idea, but it was about a decade ago --- ages ago by computing standards.


I wrote scientific simulation software in academia for a few years. None of us writing the software had any formal software engineering training above what we’d pieced together ourselves from statistics courses. We wrote our simulations to run independently on many nodes and aggregated the results at the end, no use of any HPC features other than “run these 100 scripts on a node each please, thank you slurm”. That approach worked very well for our problem.

I’d bet a significant part of compute work on HPC clusters in academia works the same way. The only thing we paid attention to was number of cores on the node and preferring node local storage over the shared volumes for caching. No MPI.

There are of course problems requiring “genuine” HPC clusters but ours could have run on any pile of workers with a job queue.


That's often the ideal case. Individual tasks are small enough to run on commodity hardware but large enough that you don't have an excessive number of them. That means you can write simple software without wasting effort on distributed computing.

I've seen similar things at the intersection of bioinformatics and genomics. Computers are getting bigger but the genomes aren't, and tasks that require distributed computing are getting rare.


No, the abstractions are not sufficient. We do care about these details, a lot.

Of course, not every application is optimized to the hilt. But if you want to so optimize an application, exactly things you're talking about are what come into play.

So yes, I would expect every competent HPC practitioner to have a solid (if not necessarily intimate) grasp of hardware architecture.


It's not intuitive, but for HPC is more about scalability than performance.

You won't be able to use a supercomputer at all without scalability, and it's the one topic that is specific to it. But, of course, those computers time is quite expensive so you'll want to optimize for performance too. It's just secondary.


Regardless of what you do, domain knowledge tends to be more valuable than purely technical skills.

Knowing more numerical analysis will get probably get you further in HPC than knowledge of specific hardware architectures.

Ideally you want both, of course.


You'd be surprised how actually backwards and primitive are the tools used in HPC.

Take for instance the so-called workload managers, of which the most popular ones are Slurm, PBS, UGE, LSF. Only Slurm is really open-source, PBS has a community edition, the rest is proprietary stuff executed in the best traditions of enterprise software which locks you into using pathetically bad tools, ancient and backwards tech with crappy / nonexistent documentation and inept tech support.

The interface between WLMs and the user who wants to use some resources is through submitting "jobs". These jobs can be interactive, but most often they are the so-called "batch jobs". A batch job is usually defined as... a Unix Shell script, where the comments are parsed to interpret those as instructions to the WLM. In the world with dozens of configuration formats... they chose to do this: embed configuration into Shell comments.

Debugging job failures is a nightmare, mostly because WLM software has really poor quality of execution. Pathetic error reporting. Idiotic defaults. Everything is so fragile it falls apart if you just as much as look at it in the wrong way. Working with it reminds me the very early days of Linux, when sometimes things just won't build, or would segfault right after you've tried running them, and there wasn't much you could do beside spending days or weeks trying to debug it just to get some basic functionality going.

When I have to deal with it, I feel kind of like in a steam-punk movie. Some stuff is really advanced, and then you find out that this advanced stuff is propped by some DIY retro nonsense you thought have died off decades ago. The advanced stuff is usually more on the side of hardware, while software is not keeping up with it for the most part.


You do a lot of scare quotes. Do you have any suggestions on how things could be different? You need batch jobs because the scheduler has to wait for resources to be available. It's kinda like Tetris in processor/time space. (In fact, that's my personal "proof" that workload scheduling is NP-complete: it's isomorphic to Tetris.)

And what's wrong with shell scripts? It's a lingua franca, generally accepted across scientific disciplines, cluster vendors, workload managers, .... Considering the complexity of some setups (copy data to node-local file systems; run multiple programs, post-process results, ... ) I don't see how you could set up things other than in some scripting language. And then unix shell scripts are not the worst idea.

Debugging failures: yeah. Too many levels where something can go wrong, and it can be a pain to debug. Still, your average cluster processes a few million jobs in its lifetime. If more than a microscopic portion of that would fail, computing centers would need way more personnel than they have.


> And what's wrong with shell scripts?

When used as configuration? Here are some things that are wrong:

* Configuration forced into a single line makes writing long lines inconvenient (for example, if you want Slurm with Pyxis, and you need to specify the image name -- it will most likely not fit on the screen.

* Oh, and since we mentioning Pyxis -- their image names have pound sign in them, and now you also need to figure out how to escape it, because for some reason if used literally it breaks the comments parser.

* No syntax highlighting (because it's all comments).

* No way to create more complex configuration, i.e. no way to have any types other than strings, no way to have variables, no way to have collections of things.

* No way to reuse configuration (you have to copy it from one job file to another). I honestly don't even know what happens if you try to source a job configuration file from another job configuration.

All in all, it's really hard to imagine a worse configuration format. This sounds like a solution from some sort of a code-golfing competition where the goal was to make it as bad as possible, while still retaining some shreds of functionality.


I really like using Slurm, the documentation is great (https://slurm.schedmd.com) and the model is pretty straightforward, at least for the mostly-single-node jobs I used it for.

You can launch a job(s) via command-line, config in Bash comments, REST APIs, linking to their library, and I think a few more ways.

I found it pretty easy to setup and admin. Scaling in the cloud was way less developed when I used it, so I just hacked in a simple script that allowed scaling up and down based on the job queue size.

What do you like better and for what use-case? Mine was for a group of researchers training models, and the feature I desired most was an approximately fair distribution of resources (cores, GPU hours, etc.).


Having switched from LSF to slurm, I have to appreciate that the ecosystem is so bash-centric. Lots of re-use in the conversion. If I’d had to learn some kind of slurm-markup-language or slurmScript or find buttons in some SlurmWizard, it would have been a nightmare.


Oh LSF... I don't know if you know this. LSF is perhaps the only system alive today that I know of that uses literal patches as a means of software distribution.

Fist time I saw it, I had a flashback to the times when I worked for HP, and they were making some huge SAP knock-off, and that system was so labor-intensive to deploy that their QA process involved actual patches. As in pre-release QA cycle involved installing the system, validating it (which could take a few weeks) and if it's not considered DoD, then the developers are given the final list of things they need to fix and those fixes would have to be submitted as patches (sometimes, literal diffs that need to be applied to the deployed system with the patch tool).

This is, I guess, how the "patch version component" came to be in SemVer spec. It's kind of funny how lots of tools are using this component today for completely unrelated purposes... but yeah, LSF feels like the time is ticking there at a different pace :)


I've dug deeply into LSF in the last few years and it's like a car crash - you can't look away. It feels like something that started in the early unix days but was developed into perhaps the late 90s, but in reality LSF was only started in the 90s (in academia). As far as I can tell development all but stopped when IBM acquired it some ten years ago.


> Working with it reminds me the very early days of Linux

The other cool thing about HPC is it is one of the last areas where multi-user Unix is used! At least, if you're using a university or NSF cluster that is!

Only other place I really see multiple humans using the same machine is SDF or the Tildes


It's saturday afternoon.

  [login1 ~:3] who | cut -d ' ' -f 1 | sort -u | wc -l
  41


Well, working in HPC, in my view the reason is the users (researchers/scientists) just want to use the machine as they have used it in the past, and this has been going on for many years now. That makes sense, as it allows them to focus on the scientific questions, not on IT stuff. They keep on using the tooling with which the software package they use came, and in some cases that is already 30 years old.


HPC software is one area where we have arguably regressed in the last 30 years. Chapel is the only light I see in the darkness


Want to elaborate more on Chapel? I’ve recently being tasked with integrating Chapel into our system and it’s quite interesting.


Memory architecture and bandwidth are still very important, most of IBM's latest performance gains for both mainframes and POWER are reliant on some novel innovations there.


I don’t think I do HPC (I only will use up to, say, 8 nodes at a time), but the impression I get is that they are already working on quite hard problems at the high-level, so they need to lean on good libraries for the low-level stuff, otherwise it is just too much.


Kudos to Victor for assembling such a wonderful resource!

While I am not acquainted with him personally, I did my doctoral work at UT Austin the the 1990's and had the privilege of working with the resources (Cray Y-MP, IBM SP/2 Winterhawk, and mostly on Lonestar, a host name which pointed to a Cray T3E at the time) maintained by TACC (one of my Ph.D. committee members is still on staff!) to complete my work (TACC was called HPCC and/or CHPC if I recall the acronyms correctly).

Back then, it was incumbent on the programmer to parallelize their code (in my case, using MPI on the Cray T3E in the UNICOS environment) and have some understanding of the hardware, if only because the field was still emergent and problems were solved by reading the gray Cray ring-binder and whichever copies of Gropp et al. we had on-hand. That and having a very knowledgeable contact as mentioned above :) of course helped...


> Lonestar, a host name which pointed to a Cray T3E

Lonestar5 was a Cray again. Currently Lonestar6 is an oil-immersion AMD Milan cluster with A100 GPUs. The times, they never stand still.


Dealt with him via TACC for a big simulation I did and was grateful enough for his help to buy a paper copy of the first volume in the series. Very interesting though a bit outside of my area. I will look at the others and encourage anyone interested to check them out.


I am interested in the more hardware management side of HPC (how problems are detected, diagnosed, mapped into actions such as reboot/reinstall/repairs, how these are scheduled and how that is optimized to provide the best level of service, how this is done if there are multiple objectives to optimize at once e.g. node availability vs overall throughput, how different topologies affect the above, how other constraints affect the above, and in general a system dynamics approach to these problems).

I haven't found many good sources for this kind of information. If you are aware of any, please cite them in a comment below.


Assuming you are moving past just the typical nonblocking folded-Clos networksor Little's Law; and want to have a more engineering focus, "Queuing theory" is one discipline you want to dig into.

Queuing theory seems trivial and easy how it is introduced, but it has many open questions.

Performance metrics for a system with random arrival times, independent service times, with k servers (M/G/k) is still an open question as an example.

https://www.sciencedirect.com/science/article/pii/S089571770...

There are actually lots of open problems in queuing theory that one wouldn't expect.


Thanks. I have read that paper!

The real world is way more complicated...

You can think about each host as its own markov chain: they may be serving, have hidden or diagnosed problems at various confidence levels, be on route to various remedial processes (reboot/reinstall/repairs I cited above, to simplify), require software/firmware changes, scheduled or opportunistic diagnostics (e.g. periodic deep-screening for otherwise silent problems).

Repair workflows are even more complicated and depend on the specific of the fault, and connection of components (e.g. a system diagnosed with a missing GPU may be because a cable or an interposer card is not seated correctly). Also, repair time is modulated by work shifts and some details of the logistics in datacenters.

Parts can become bottlenecks too. I remember one time in the early TPUv3 days: we had delays in recovering from a large incident because of fault-positive diagnoses of a $4 fan, that was widely used in all systems.

Add that nowadays systems are not one host and some attached cards, but have multiple nodes: e.g. the simplest combination is a main compute node and the smart nic / IPU, but some systems can be a lot more complicated.

So these alone are a hierarchical markov chains, with inhomogeneous arrival times and service times that are themselves time-dependent functions. There are also a lot of long-term memory effects, ensemble average is not necessarily the the time average. Chaotic behavior, in the mathematical sense, is common.

Systems are built with field replaceable units (FRU): e.g. in some generations, you can't swap one GPU in a server that has 8, you have the swap a whole block of them. You can choose when to repair, to maximize TCO/$ based on usage patterns (how many users want all GPUs vs. a smaller number).

Some systems (e.g. TPU pods) have links between accelerator trays, both within a physical rack and across them. So the usefulness of the neighbor system is reduced while a host/tray is being services, and you can have the equivalent of deadlock/livelock in repair dependencies.

Cluster scheduling and management is designed to maximize service levels, minimizing disruptions. This also implies that some disruptions (e.g. when you do some repairs) is driven by what workloads you run. Workloads are power-law distributed in size, so you have small-world network dynamics and the potential for supercritical behavior: both in disruptions (picture jobs preempting other jobs as a graph) but also in risks (add to the previous picture, one job that trigger an hardware problem).

Multiply these by several thousands to get the size of a datacenter cluster. Add supporting compute+storage, networking, power and thermal constraints. Multiply these by the number of clusters (Some hyperscalers have global scheduling systems, that make it possible to see the whole ML fleet as one). Add rare events, because at this scale you start to think about utility and electrical grid failures.

Figuring out what to do in control systems for the current ML fleet is one problem. Simulating what kind of service the future systems a few generations of hardware, software, datacenter design down the line should provide, so you can define what you can offer, influence the designs and make the right investments... it's more complicated. Both of these two problems are my current job.


This seemed like a big topic when I was interviewing with Meta and nVidia some months ago.

Meta had a few good YouTube videos about the problems of dealing with this many GPUs at scale.


Thank you all for the replies. I picked this one, but the answer is for every other reply. I didn't know about some of these links.

Solving these problems has basically been by job for the past 6 years, for Google's TPU systems and some of the GPU systems (the non-Cloud ones). It is a pity that, after the pandemic, it has been impossible to give similar conference talks at my company (at least if you are employed in Europe, due to restrictions on travel outside one's country).

The Meta sessions are very interesting. I wasn't aware of their work on Arcadia; Nvidia has also their own systems (Nvidia Air / Omniverse) in this area.


Could you link me the YouTube videos/articles in question? It happens to be my research area and I'm interested in knowing how big companies such as meta deal with multi-GPU systems


Mark did a good video on ChatGPT infra.

[1]. https://techcommunity.microsoft.com/t5/microsoft-mechanics-b...


I don't have them bookmarked anymore, but they may have been from this playlist: [0]

[0] https://www.youtube.com/playlist?list=PLBnLThDtSXOw_kePWy3CS...


Thank you for sharing! I'll hunt it down


Mark Russinovich gives a good talk most years on the internals of Azure and the systems that run it. [1] is an example. Look for talks from other years as well.

Meta also publishes a number of papers/blogs/OSS projects on their engineering site [2]

James Hamilton of AWS gives a talk most years on their infrastructure. Worth watching multiple years [3].

[1] https://youtu.be/69PrhWQorEM?si=u7vh_Um6SQNoyeFH

[2] https://engineering.fb.com/category/data-center-engineering/

[3] https://youtu.be/AyOAjFNPAbA?si=nFRJVcQI4EiamC-O


This paper from Microsoft [1] is the coolest thing I've seen in this space. Basically workload (deep learning in this case) level optimization to allow jobs to be resized and preempted.

[1] https://arxiv.org/pdf/2202.07848.pdf


check out openbmc project and DTMF association


DMTF (not DTMF)

https://www.dmtf.org/


I took a course on scientific computing in 2013. It was cross-listed under both the computer science and applied math departments. The issue is that the field is pretty broad overall and a lot of topics were covered in a cursory manner, including anything related to HPC and parallel programming in particular. I don't regret taking the course, but it was too broad for the applications I was pursuing.

I haven't looked at what courses are being offered in several years, but when I was a graduate student, I really would have benefited from a dedicated semester-long course on parallel computing, especially going into the weeds about particular algorithms and data structures in parallel and distributed computing. Those were handled in a super cursory manner in the scientific computing course I took, as if somehow you'd know precisely how to parallelize things the first time you try. I've since learned a lot of this stuff on my own and from colleagues over the years, as many people do in HPC, but books like these would have been invaluable as part of a dedicated semester-long course.


Just amazed at how the author has created (and shared for free) such a comprehensive set of books including teaching C++ and Unix tools! There is something to learn for all Programmers (HPC specific or not) here.

Related: Jorg Arndt's "Matters Computational" book and FXT library - https://www.jjj.de/fxt/


I'm interested in what people think of the approach to teaching C++ used here. Any particular drawbacks?

I'm a very experienced Python programmer with some C, C++ and CUDA doing application level research in HPC environments (ML/DL). I'd really like to level up my C++ skills and looking through book 3 it seems aimed exactly at the right level for me - doesn't move too slowly and teaches best practices (per the author) rather than trying to be comprehensive.


C++ programmer and educator here. This (volume 3) is well organized good beginner level teaching material. You probably know most of it already.

I was looking for range-based for loop, std::array and std::span and happy to see that they are all there.

Because this book relates to HPC, I'd add a few things: Return Value Optimization, move semantics, and in the recursive function section a note about Tail Call Optimization.

As a beginner level material I can highly recommend it.


That's great - thank-you. Assuming I work through this quickly, what resources would you recommend as a follow-on?


I am not the person you asked the question to, but my recommendation would be;

1) Discovering Modern C++: An Intensive Course for Scientists, Engineers, and Programmers by Peter Gottschling - Not too thick and focuses on how to program in the language.

2) Software Architecture with C++: Design modern systems using effective architecture concepts, design patterns, and techniques with C++20 by Adrian Ostrowski et al. - Shows how to use C++ in the modern way/ecosystems i.e. with CI/CD, Microservices etc.

Optional but highly recommended;

a) Scientific and Engineering C++: An Introduction with Advanced Techniques and Examples by Barton & Nackman - Old pre-Modern C++ book which pioneered many of the techniques which have now become common. One of the best for learning C++ design.


Instead of giving you a list of books I'll give you a list of topics to learn well. They are listed in a proper learning sequence.

- Modern object initialization using {} and ().

- std::string_view

- std::map

- std::stack

- Emplace addition of objects to containers like vector and map.

- Smart pointers (std::unique_ptr, std::shared_ptr and their ilk).

- Ranges library.

- Concurrency support library (std::async, std::future, std::thread, locks and the whole deal).


> This book is notable for its coverage of MPI and OpenMP in both C, Fortran, C++, and (for MPI) Python.

"The Art of HPC", volume 2 > "Parallel Programming for Science Engineering" https://theartofhpc.com/pcse/index.html

FWIW, MPI is only one way to Python for HPC.

ipyparallel will run MPI jobs over tunnels you create yourself IIRC.

A chapter on dask-scheduler, CuDF, CuGraph (NetworkX), DaskML, and CuPy, and dask-labextension would be more current.

Dask doesn't handle data storage for you, so it's your responsibility to make sure that the data store(s) before each barrier are not the performance bottleneck.

Dask docs > High Performance Computers: https://docs.dask.org/en/stable/deploying-hpc.html

Sources of random may be the bottleneck. You don't know until you profile the job across the cluster.

Re: eBPF-based tracing tools: https://news.ycombinator.com/item?id=31688180

And then something about GitOps (and ChatOps), code review and revision, and project resource quotas


I was asked to share a TA role on a graduate course in HPC a decade ago. I turned down the offer.

After a cursory glance, I can honestly say that if this book were available then, I'd have taken the opportunity.

The combination of what I perceive to be Knuth's framing of art, along with carpentry and the need to be a better devops person than your devops person is compelling.

Kudos to the author for such an achievement. UT Austin seems to have achieved in computer science what North Texas State did in music.


UT Austin really is a fantastic institution for HPC and computational methods.


Every BLAS you want to use has at least some connection to UT Austin’s TACC.


Not quite. Every modern BLAS is (likely) based on Kazushige Goto's implementation, and he was indeed at TACC for a while. But probably the best open source implementation "BLIS" is from UT Austin, but not connected to TACC.


Oh really? I thought BLIS was from TACC. Oops, mea culpa.


https://github.com/flame/blis/

Field et al, recent winners of the James H. Wilkinson Prize for Numerical Software.

Field and Goto both collaborated with Robert van de Geijn. Lots of TACC interaction in that broader team.


aren't the lapack people in tennessee?


Sort of like BLAS, LAPACK is more than just one implementation. Dongarra described what everybody should do from Tennesse, but other places implemented it elsewhere.


plasma and magma are also from there.

I'm not aware of any other significant lapack-related developments, but I might just not know about them.


When joining a small company supporting the engineers of the HPC of a large car manufacturer, I was surprised to see so many in-house developed scripts around the scheduler (LSF). Only much later, when playing myself on a private miniature cluster with SLURM, I noticed that different versions of the scheduler software were generally incompatible to each other, i.e. one couldn't use one inside the cluster and another on external client machine. Hence the need for glue software to inject jobs into the scheduler from outside and retrieve the results later on (IMHO devaluating the scheduler).

I would have thought, that after some 30 years of high performance distributed computing, the requirements were well known and at least the protocol for command and data exchange could be fixed. Apparently not so.


Cluster authentication actually remains the sore point for all interactions - for most systems the least-common denominator remains whether one can SSH into a login node. At which point the main job commands - sbatch/squeue - are pretty stable.

There have been some attempts to standardize basic job management APIs in the past - DRMAA being one noteworthy example. Although DRMAA v2 was only ever implemented by Grid Engine, and is effectively an lightly-abstracted version of their internal APIs, that has never really seen first-class adoption by Slurm/PBS/LSF.

For Slurm, the REST API is meant to be the way forward. It punts the authentication problem to, potentially, anything the admins may care to wire up through an Apache / NGINX proxy. And the basic job submission and status APIs have stablizied to the point that a client application should be able to consume nearly any version going forward.


Is there something wrong with the GitHub files since I cannot render any of the textbooks PDF files?

https://github.com/VictorEijkhout/TheArtofHPC_pdfs/blob/main...


I think the files are too large to render in the github browser and they give an error. You can pick the 'download raw' option to download locally and read the file. Worked for me.


I just "git clone https://github.com/VictorEijkhout/TheArtofHPC_pdfs.git" on my local drive. Had it all in under a minute.


There is some really good content here for any programmer.

And with volume 3, such a contrast: the author teaches C++17 and... Fortran2008.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: