Hacker News new | past | comments | ask | show | jobs | submit login

The H200 GPU die is the same as the H100, but its using a full set of faster 24GB memory stacks:

https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-acc...

This is an H100 141GB, not new silicon like the Nvidia page might lead one to believe.




It is remarkable how much GPU compute is limited by memory speed.


Depends on the workload.

Sometimes things really are compute bound, and sometimes you get a "big" workload that still fits nicely in the GPU's L2. Generative AI is mostly at the far end of "memory bound."

Some ML startups (like Graphcore) seemed to bet on large caches, sparsity and clever preprocessing instead of raw memory bandwidth, but I think their strategy was compromised when model sizes exploded. Even Cerebras was kinda caught off guard when their 40GB pizza was suddenly kind of cramped.


Current ML architectures tend to be heavily optimized for ease of very large scale parallelism in training, even at the expense of a bigger model size and compute cost. So there may be some hope for different architectures as we stop treating idle GPUs as being basically available for free and start budgeting more strictly for what we use.


In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.

Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!


In many cases it's the same for CPU. AMD's cpus with bigger caches due to on-die 3D stacked memory are in a different league when it comes to perf.


What would make [HBM3E] GPU memory faster?

High Bandwidth Memory > HBM3E: https://en.wikipedia.org/wiki/High_Bandwidth_Memory#HBM3E


Compared to HBM3, you mean?

The memory makers bump up the speed the memory itself is capable of through manufacturing improvements. And I guess the H100 memory controller has some room to accept the faster memory.


More technically, I suppose.

Is the error rate due to quantum tunneling at so many nanometers still a fundamental limit to transistor density and thus also (G)DDR and HBM performance per unit area, volume, and charge?

https://news.ycombinator.com/item?id=38056088 ; a new QC and maybe in-RAM computing architecture like HBM-PM: maybe glass on quantum dots in synthetic DNA, and then still wave function storage and transmission; scale the quantum interconnect

Is melamine too slow for >= HBM RAM?


My understanding is that while quantum tunneling defines a fundamental limit to miniaturization of silicon transistors we are still not really near that limit. The more pressing limits are around figuring out how to get the EUV light to consistently draw denser and denser patterns correctly.


From https://news.ycombinator.com/item?id=35380902 :

> Optical tweezers: https://en.wikipedia.org/wiki/Optical_tweezers

> "'Impossible' photonic breakthrough: scientist manipulate light at subwavelength scale" https://thedebrief.org/impossible-photonic-breakthrough-scie... :

>> But now, the researchers from Southampton, together with scientists from the universities of Dortmund and Regensburg in Germany, have successfully demonstrated that a beam of light can not only be confined to a spot that is 50 times smaller than its own wavelength but also “in a first of its kind” the spot can be moved by minuscule amounts at the point where the light is confined

FWIU, quantum tunneling is regarded as error to be eliminated in digital computers; but may be a sufficient quantum computing component: cause electron-electron wave function interaction and measure. But there is zero or 1 readout in adjacent RAM transistors. Lol "Rowhammer for qubits"


"HBM4 in Development, Organizers Eyeing Even Wider 2048-Bit Interface" (2023) https://news.ycombinator.com/item?id=37859497


For anyone wondering how this applies to big LLMs, 144GB is big, but you’d need to roughly double this to train Gpt 3.x fitting everything in memory at once.

Of course even If 300GB GPUs were available tomorrow, and you sold a million house to buy as many as that would allow it’d still take years to train once.



I dunno. But thats a dual GPU product, so its not really 180GB.


They are also doing a custom 4 chip Hopper GPU for a European HPC initiative. Seems like a good move. Basically allows you to have 4 H100s connected via NVLink without needing a separate SXM board.


I'm holding out for 8 x MI300x on a single board. =)


This is a single-chip H100 NVL. Both are GH100's with the same tweaked 20% wider 6144-bit HBM3e (versus 5120 bit on other H100's) running at a higher speed.

The HBM3e loadout is slightly different than H100 NVL's was going to be, but this definitely seems like a higher bin H100. It's basically as-if AMD had shipped a 7900 XT, then latter started selling the 7900 XTX; same chip, but they brought up all the memory controllers on this one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: