SRAM also requires more power than DRAM and the simple regular structure of SRAM...

SRAM also requires more power than DRAM and the simple regular structure of SRAM arrays compared to (other) logic makes it possible to get good yield rates through redundancy and error correction codes so you could have giant monolithic dies, but information can't exceed the speed of light in a medium. There just isn't enough time for the signals to propagate to get the latency you expect of a L3 cache out of gigabytes (in relative terms) far away big dies containing gigabytes of SRAM. Also moving that the data would to perform computations without caching would be terrible wasteful given how much energy is needed just to move the data. Instead you would probably end up with something closer to the computing memory concept to map computation to ALUs close to the data with an at least two tier network (on-die, inter-die) to support reductions.