Hacker News new | past | comments | ask | show | jobs | submit login

> So, you have this really powerful instruction set that's vectorised on almost every instruction, and then you have a massive 128 registers per core coupled with 32KB local cache memory which is single cycle access

SPU local store is 256kb, not 32kb. It's not single cycle access either, it's multiple cycle latency (6, if I recall correctly); it's still wicked fast though, it's about the same as L1 cache access on the PPU.




Hmmm, just checked and you're right! I've not touched the SPU for well over a decade now, and it appears I've forgotten a lot more than I realised.

You're right about the latency too, but because it was a pipelined architecture all instructions had some latency, so while 6 sounds high, it wasn't really significant. Now I'm thinking about it (memories of re-arranging instructions manually, before I shifted from assembler directly to using GCC and intrinsics and letting the compiler worry about the interleaving), I also realise I'd forgotten the odd/even cycle split where you would pair instructions of different types (roughly ALU and non-ALU) together so they could execute concurrently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: