Yes, the figure was 250GFLOP for 4 cores instead of per core, I misread. Still i...

adrian_b · 2024-08-01T05:28:19.000000Z

The floating-point FMA throughput per desktop CPU socket and per clock cycle has been doubled every few years in the sequence: Athlon 64 (2003) => Athlon 64 X2 (2005) => Core 2 (2006) => Nehalem (2008) => Sandy Bridge (2011) => Haswell (2013) => Coffee Lake Refresh (2018) => Ryzen 9 3950X (2019) => Ryzen 9 9950X (2024), going from 1 FP64 FMA/socket/clock cycle until 256 FP64 FMA/socket/clock cycle, with double numbers for FP32 FMA (1 FMA/s is counted as 2 Flop/s).

kvemkon · 2024-08-01T10:29:14.000000Z

I'd wish memory bandwidth could also be doubled so often on desktops. Instead of 256 (even more due to 2-3 times higher core frequency) only 14 times increase: DDR-400 6.4 GB/s => DDR5-5600 89.6 GB/s. The machine balance keeps falling even further.

While flash memory became so fast in recent years, I haven't heard of any break-through technology prototypes to bring some significant progress into RAM. Let alone the RAM latency, which remained constant (+/- few ns) through all the years.

adrian_b · 2024-08-01T15:44:14.000000Z

You are right, which is why in modern CPUs the maximum computational throughput can be reached only when a large part of the operands can be reused, so they can be taken from the L1 or from the L2 cache memories.

Unlike the main memory bandwidth or that of the shared L3 cache memory, the memory bandwidth of the non-shared L1 and L2 caches has been increased exactly in the same ratio as the FMA throughput. Almost all CPUs have always been able to do exactly the same number of FMA operations per clock cycle and loads from the L1 cache per clock cycle (simultaneously with typically only a half of that number, of stores to the L1 cache per clock cycle).

Had this not been true, the computational execution units of the cores would have become useless.

Fortunately, the solution of systems of linear equations and the multiplication of matrices are very frequent operations and these reuse most of their operands, so they can reach the maximum computational throughput.