Yeah, good point. SIMD is also superscalar on most systems.
For x86, Intel can do 3x AVX512 instructions per clock tick as long as each instruction is simple enough (add, AND, OR, NOT, XOR, maybe even multiply if you're not counting the latency issue)
Only port 5 and port 0/1 can do avx512 instructions on skylake/icelake, so I don't think you can get better throughput than 0.5 on current parts. Unless you count load & store as well, like I mentioned.
And icelake is 3 years old. If you have a newer avx512-supported chip which can do more than two 512-bit alu ops per cycle, I would love to take a look at it.
For x86, Intel can do 3x AVX512 instructions per clock tick as long as each instruction is simple enough (add, AND, OR, NOT, XOR, maybe even multiply if you're not counting the latency issue)