Cache size is constrained by different things at different levels.
Cache uses 6 transistors per bit while normal RAM uses 1 transistor and 1 capacitor. You can buy a LOT more RAM per dollar than you could buy cache. This has been further exacerbated by poor SRAM scaling with each new fabrication node. It's usually 20-30% instead of 60-80% for other transistors and is actually 0% for TSMC's upcoming N3 node. AMD chose to fab their RDNA 3 GPU die on 5nm, but their SRAM dies on 6nm because the cost savings were much bigger than the effect of a slightly larger die.
DRAM typically has a maximum cycle speed of around 300-500MHz due to the capacitors having more restrictive charge/discharge rates (when you see something like 6400MT/s, that's not random access speed, but is instead the aggregate speed of slowly reading thousands of cells in parallel at one time into a cache then quickly sending the results over the wire).
Once you move away from bulk storage into high-performance caches, you have to discuss speed, latency, and associativity. As cache gets bigger, keeping lookup times low requires more and more transistors to control the cache and keep it coherent with both the CPU/threads and main memory. If you want to make that controller smaller, it will mean lowering clockspeeds, altering latencies, changing associativity, etc. At a physical level, faster caches will require higher-performance transistors which require more die space and power per transistor too.
Sometimes, the tradeoffs aren't what you might expect. A great example is AMD moving from 64kb of I-cache in Zen1 to 32kb I-cache in Zen 2, 3, and 4 because they found the tighter latencies were more important in their uarch than the higher hit rate.
There's a further constraint for AMD's 3D chip where the cache reduces heat transfer forcing lower CPU clockspeeds. Unfortunately there's not much room on the substrate to add the cache beside the CPU. Infinity Cache on RDNA3 also uses a significant portion of it's die space for the super-fast interconnect with the GPU die. I suspect a similar thing happens here too.
AFAIK, power and speed. There's an interesting paper on this topic: [Cache Design Trade-offs for Power and Performance Optimization:
A Case Study](http://iacoma.cs.uiuc.edu/CS497/LP5a.pdf).
suprised there are three answers that dont mention area. the larger a die is the more expensive it is to produce, since the chance of getting a defect goes up. wafer-scale integration works by providing sufficient redundancy and/or routing to avoid damaged areas, but that too has a cost.
Die space. There's a maximum reticle size. You can't make chips bigger than that without using multi-chip techniques like wafer scale integration, chiplets and so on.
If you do a full reticle chip and use most of it for memory you can get about 1GB of SRAM. But that would also be extremely expensive.
It's not quite as bad as you might think (though it is bad).
TSMC N5 is 0.021um per SRAM cell in high-density configuration (still going to be 5-10x faster switching than DRAM). That's about 47.62 bits/um. With 1,000,000 um2 per mm2. EUV reticle limit is 858mm2.
That amounts to 40,857,142,857 bits or about 4.756GB. That's almost exactly 250 dies per wafer. TSMC N5 wafers cost $17,000 or roughly $68 per chip (you can get very close to 100% yields on this kind of very basic chip). If we use part of the chip for internal controllers and interconnects, it would be about $70 per 2-4GB.
The highest-clocked DDR5 RAM I could find with minimal searching was $360 for 32GB.
The SRAM version would cost $560 at the cheapest for just the 32GB RAM not counting the RAM controller chips and all the packaging costs. With profits and everything included, your final price tag would likely be around $750 if you could produce in bulk and $1000 if you could not.
On the plus side, worrying about stuff like CAS latency would basically be a thing of the past with the bottleneck once again becoming the speed over the wire. Power consumption would also be lower without the need to constantly refresh. I doubt that anyone wants to pay that much money for disproportionally marginal increase in performance, but looking at all the people forking over thousands to scalpers for GPUs, maybe I'm wrong.
You might want to check your maths on that. It's more like 30 maximum reticle size dies per wafer IIRC. Just the area based limit is pi * 150^2 / 858 = 82, but you obviously lose a load because dies are rectangular and wafers are circular.
SRAM doesn't need to be refreshed constantly like DRAM, so SRAM will use LESS power when not switching. We think it's different because something like L1 cache is CONSTANTLY switching. It should be noted that the power cost per GB still isn't worse than DRAM, but there's way more data being transferred.
Once you reach L3/L4 cache, it's not switching nearly as often and the actual total power becomes way less than sending data over from RAM. This is one reason why AMD (and now Nvidia) saved so much power with Infinity Cache on RDNA 2.
The big issue with SRAM is that it requires 6 transistors per bit while DRAM requires 1 bit plus 1 capacitor, so capacity per mm2 of die area is much worse. That capacitor keeps thing cheap, but is also the reason why individual DRAM cells can only operate at 300-500MHz while SRAM can easily operate into the many GHz range.
Interesting. Why with SRAM is switching so much more power consuming than just maintaining? I thought with a flip flop every gate is used on pretty much every cycle while it feeds back on itself.
I would have thought reading/writing SRAM is much more expensive than idling it not because of the cost of toggling the flip-flop, but all the cache coherency stuff associated with making a change in multicore.