I believe those are the utterly massive register files needed to feed a modern v...

hinkley · on June 18, 2023

That would have been my guess. But I don’t think I’ve ever seen a register file big enough that I could spot it without a label. I’m almost surprised they are so tall and not wide. Or is that because I am looking at it sideways, and each register is top to bottom?

unnah · on June 19, 2023

According to Agner Fog's manuals, the physical register file of a Zen 2 FPU contains 168 vector registers of 256 bits each. In total the vector registers hold 5 kB, so much less than the 32 kB L1 data cache, which isn't that large on the die image. Even if the middle of the FPU includes the register file, the regular structure must be mostly something else.

formerly_proven · on June 19, 2023

The register file allows much more accesses per cycle though and should actually be truly dual-ported (one register can be read and written in the same cycle). I'm not sure how exactly this works but Zen 2 has four vector units in the FPU so I'd naively expect the register file should be able to serve ~eight 256 bit reads per cycle and ~four 256 bit writes per cycle. So it should have those numbers of read and write ports at least? Additionally there should be a forwarding network around the ALUs. L1D only has a fraction of that connectivity. So between being dual-ported, having many more ports and probably the forwarding/bypass being integrated into this structure or at least being adjacent to it leads me to expect the FPU register file having a dramatically lower bit density than the L1D cache.

exmadscientist · on June 18, 2023

This is about Skylake rather than Zen2 but it's fascinating, if the subject of what's-really-in-a-register-file is fascinating to you: https://travisdowns.github.io/blog/2020/05/26/kreg2.html

hinkley · on June 18, 2023

When I still read architecture docs like novels, there was a group experimenting with processor-in-memory architectures. Where the demo was Memory chips doing vector processing of data in parallel.

I wonder how wide SIMD has to get before you treat it like a CPU embedded into cache memory.

Though I guess we are already looking at SIMD instructions wider than a cache line…