A high grade consumer gpu a (a 4090) is about 80 teraflops. So rounding up to 10...

vitus · 2024-11-18T22:05:04 1731967504

You're off by three orders of magnitude.

My point of reference is that back in undergrad (~10-15 years ago), I recall a class assignment where we had to optimize matrix multiplication on a CPU; typical good parallel implementations achieved about 100-130 gigaflops (on a... Nehalem or Westmere Xeon, I think?).

NineStarPoint · 2024-11-18T22:47:57 1731970077

You are 100% correct, I lost a full prefix of performance there. Edited my message.

Which does make the clusters a fair bit less impressive, but also a lot more sensibly sized.

winwang · 2024-11-18T22:51:47 1731970307

4090 tensor performance (FP8): 660 teraflops, 1320 "with sparsity" (i.e. max theoretical with zeroes in the right places).

https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...

But at these levels of compute, the memory/interconnect bandwidth becomes the bottleneck.