With the cascade of different clock domains on a core and package, the control loops can spend that thermal budget effectively elsewhere; idling is one of the benefits of CMOS.
Perhaps most weirdly, we've reached the point where that is actually desirable.
Power+heat is the limit now, and slapping on some extra circuitry that is only used for some operations makes the chip perform better.
I believe GP was counting the transistors in DRAM, not only those on the CPU.
If someone is really into high performance, it's ideal to never have to wait for DRAM, either with predictive fetches or explicit cache warming. For that, the more cache you have, the better.