The H200 GPU die is the same as the H100, but its using a full set of faster 24G...

sva_ · on Nov 13, 2023

It is remarkable how much GPU compute is limited by memory speed.

brucethemoose2 · on Nov 13, 2023

Depends on the workload.

Sometimes things really are compute bound, and sometimes you get a "big" workload that still fits nicely in the GPU's L2. Generative AI is mostly at the far end of "memory bound."

Some ML startups (like Graphcore) seemed to bet on large caches, sparsity and clever preprocessing instead of raw memory bandwidth, but I think their strategy was compromised when model sizes exploded. Even Cerebras was kinda caught off guard when their 40GB pizza was suddenly kind of cramped.

zozbot234 · on Nov 13, 2023

Current ML architectures tend to be heavily optimized for ease of very large scale parallelism in training, even at the expense of a bigger model size and compute cost. So there may be some hope for different architectures as we stop treating idle GPUs as being basically available for free and start budgeting more strictly for what we use.

tbalsam · on Nov 14, 2023

In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.

Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!

PartiallyTyped · on Nov 13, 2023

In many cases it's the same for CPU. AMD's cpus with bigger caches due to on-die 3D stacked memory are in a different league when it comes to perf.

westurner · on Nov 13, 2023

What would make [HBM3E] GPU memory faster?

High Bandwidth Memory > HBM3E: https://en.wikipedia.org/wiki/High_Bandwidth_Memory#HBM3E

brucethemoose2 · on Nov 13, 2023

Compared to HBM3, you mean?

The memory makers bump up the speed the memory itself is capable of through manufacturing improvements. And I guess the H100 memory controller has some room to accept the faster memory.

westurner · on Nov 13, 2023

More technically, I suppose.

Is the error rate due to quantum tunneling at so many nanometers still a fundamental limit to transistor density and thus also (G)DDR and HBM performance per unit area, volume, and charge?

https://news.ycombinator.com/item?id=38056088 ; a new QC and maybe in-RAM computing architecture like HBM-PM: maybe glass on quantum dots in synthetic DNA, and then still wave function storage and transmission; scale the quantum interconnect

Is melamine too slow for >= HBM RAM?

ls612 · on Nov 14, 2023

My understanding is that while quantum tunneling defines a fundamental limit to miniaturization of silicon transistors we are still not really near that limit. The more pressing limits are around figuring out how to get the EUV light to consistently draw denser and denser patterns correctly.

westurner · on Nov 14, 2023

From https://news.ycombinator.com/item?id=35380902 :

> Optical tweezers: https://en.wikipedia.org/wiki/Optical_tweezers

> "'Impossible' photonic breakthrough: scientist manipulate light at subwavelength scale" https://thedebrief.org/impossible-photonic-breakthrough-scie... :

>> But now, the researchers from Southampton, together with scientists from the universities of Dortmund and Regensburg in Germany, have successfully demonstrated that a beam of light can not only be confined to a spot that is 50 times smaller than its own wavelength but also “in a first of its kind” the spot can be moved by minuscule amounts at the point where the light is confined

FWIU, quantum tunneling is regarded as error to be eliminated in digital computers; but may be a sufficient quantum computing component: cause electron-electron wave function interaction and measure. But there is zero or 1 readout in adjacent RAM transistors. Lol "Rowhammer for qubits"

westurner · on Nov 13, 2023

"HBM4 in Development, Organizers Eyeing Even Wider 2048-Bit Interface" (2023) https://news.ycombinator.com/item?id=37859497

WhitneyLand · on Nov 13, 2023

For anyone wondering how this applies to big LLMs, 144GB is big, but you’d need to roughly double this to train Gpt 3.x fitting everything in memory at once.

Of course even If 300GB GPUs were available tomorrow, and you sold a million house to buy as many as that would allow it’d still take years to train once.

latchkey · on Nov 13, 2023

What happened to the H100 NVL?

https://www.anandtech.com/show/18780/nvidia-announces-h100-n...

brucethemoose2 · on Nov 13, 2023

I dunno. But thats a dual GPU product, so its not really 180GB.

az226 · on Nov 14, 2023

They are also doing a custom 4 chip Hopper GPU for a European HPC initiative. Seems like a good move. Basically allows you to have 4 H100s connected via NVLink without needing a separate SXM board.

latchkey · on Nov 14, 2023

I'm holding out for 8 x MI300x on a single board. =)

jauntywundrkind · on Nov 13, 2023

This is a single-chip H100 NVL. Both are GH100's with the same tweaked 20% wider 6144-bit HBM3e (versus 5120 bit on other H100's) running at a higher speed.

The HBM3e loadout is slightly different than H100 NVL's was going to be, but this definitely seems like a higher bin H100. It's basically as-if AMD had shipped a 7900 XT, then latter started selling the 7900 XTX; same chip, but they brought up all the memory controllers on this one.