Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).
So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.
Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.
That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?
And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.