Sometimes things really are compute bound, and sometimes you get a "big" workload that still fits nicely in the GPU's L2. Generative AI is mostly at the far end of "memory bound."
Some ML startups (like Graphcore) seemed to bet on large caches, sparsity and clever preprocessing instead of raw memory bandwidth, but I think their strategy was compromised when model sizes exploded. Even Cerebras was kinda caught off guard when their 40GB pizza was suddenly kind of cramped.
Current ML architectures tend to be heavily optimized for ease of very large scale parallelism in training, even at the expense of a bigger model size and compute cost. So there may be some hope for different architectures as we stop treating idle GPUs as being basically available for free and start budgeting more strictly for what we use.
In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.
Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!
The memory makers bump up the speed the memory itself is capable of through manufacturing improvements. And I guess the H100 memory controller has some room to accept the faster memory.
Is the error rate due to quantum tunneling at so many nanometers still a fundamental limit to transistor density and thus also (G)DDR and HBM performance per unit area, volume, and charge?
https://news.ycombinator.com/item?id=38056088 ; a new QC and maybe in-RAM computing architecture like HBM-PM: maybe glass on quantum dots in synthetic DNA, and then still wave function storage and transmission; scale the quantum interconnect
My understanding is that while quantum tunneling defines a fundamental limit to miniaturization of silicon transistors we are still not really near that limit. The more pressing limits are around figuring out how to get the EUV light to consistently draw denser and denser patterns correctly.
>> But now, the researchers from Southampton, together with scientists from the universities of Dortmund and Regensburg in Germany, have successfully demonstrated that a beam of light can not only be confined to a spot that is 50 times smaller than its own wavelength but also “in a first of its kind” the spot can be moved by minuscule amounts at the point where the light is confined
FWIU, quantum tunneling is regarded as error to be eliminated in digital computers; but may be a sufficient quantum computing component: cause electron-electron wave function interaction and measure. But there is zero or 1 readout in adjacent RAM transistors. Lol "Rowhammer for qubits"
For anyone wondering how this applies to big LLMs, 144GB is big, but you’d need to roughly double this to train Gpt 3.x fitting everything in memory at once.
Of course even If 300GB GPUs were available tomorrow, and you sold a million house to buy as many as that would allow it’d still take years to train once.
They are also doing a custom 4 chip Hopper GPU for a European HPC initiative. Seems like a good move. Basically allows you to have 4 H100s connected via NVLink without needing a separate SXM board.
This is a single-chip H100 NVL. Both are GH100's with the same tweaked 20% wider 6144-bit HBM3e (versus 5120 bit on other H100's) running at a higher speed.
The HBM3e loadout is slightly different than H100 NVL's was going to be, but this definitely seems like a higher bin H100. It's basically as-if AMD had shipped a 7900 XT, then latter started selling the 7900 XTX; same chip, but they brought up all the memory controllers on this one.
https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-acc...
This is an H100 141GB, not new silicon like the Nvidia page might lead one to believe.