32 full width vector ALUs running at 3.5 GHz is probably not realistic. I think that it is running around 2GHz at most [1]. The trick is that FMAs are normally counted as two FLOP.
[1] (* (/ 512 64) 2 2 18 2 1000 1000 1000) = 1152000000000 FLOPS
(512 unit over 64 bits double) times 2 for FMA times two units, over 18 cores at 2 GHz)
edit: the 10 core part has a base clock of 3.3GHz. The 18 core part will probably be in the 2.5 range at best (the best 18 core Broadwell I can find runs at 2.3, but it is a dual socket part). Running in full AVX512 mode will probably downclock the cpu further.
> The 18 core part will probably be in the 2.5 range at best (the best 18 core Broadwell I can find runs at 2.3, but it is a dual socket part). Running in full AVX512 mode will probably downclock the cpu further.
Indeed, the 2.2-2.3/2.7-2.8 GHz (base/boost) of the >18C E5-269X v4 CPUs is the non-AVX instruction clock. With AVX the throttling these drop by 300-400 MHz [1]
and I expect the skylake chips to behave very similarly. In fact I would not be surprised if on average 512-bit AVX2 required more throttling than 256-bit.
[1] (* (/ 512 64) 2 2 18 2 1000 1000 1000) = 1152000000000 FLOPS (512 unit over 64 bits double) times 2 for FMA times two units, over 18 cores at 2 GHz)
edit: the 10 core part has a base clock of 3.3GHz. The 18 core part will probably be in the 2.5 range at best (the best 18 core Broadwell I can find runs at 2.3, but it is a dual socket part). Running in full AVX512 mode will probably downclock the cpu further.