This is in the article: if you aren't using the tensor cores, you aren't utilizi...

wtallis · 2024-05-13T07:08:13 1715584093

Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).

So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.

smallmancontrov · 2024-05-13T14:05:18 1715609118

Mind the Dark Silicon Fraction.

Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.

wtallis · 2024-05-13T14:32:10 1715610730

That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?

incrudible · 2024-05-13T10:03:50 1715594630

There is not a lot of fixed function left in the modern graphics pipeline, economics of scale dictate that there is no net benefit in trimming it.

wtallis · 2024-05-13T14:43:27 1715611407

And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.

Sharlin · 2024-05-13T08:21:11 1715588471

On the H100 specifically. The figure is likely different on consumer cards.