I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps). There's only so much space in my brain.
AVX-512's masks are OK. They're quite cheap. There are some infelicities. I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again. This may be a symptom of AVX-512's origin with Knights*.
It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations. I suppose this is inevitable given that they need to have physical proximity to their units - so explicit mask ops are on the same ports as the SIMD ops.
I also wish that there were 512b compares that produced zmm registers like the old compares used to; sometimes that's the behavior you want. However, you can reconstruct that in another cheap operation iirc.
> I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps).
Fair enough. I have high hopes for SVE, though. The first-faulting memory ops and predicate bisection features look like a vectorization godsend.
> There's only so much space in my brain.
I'm still going to attempt a nerd-sniping with the published architecture manual. Fujitsu includes a detailed pipeline description including instruction latencies. Granted its just one part, and its an HPC-focused part at that. But its not every day that this level of detail gets published in the ARM world.
> I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again.
SVE is blessedly uniform in this regard.
> It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations.
This goes both ways, though. A64FX has two vector execution pipelines and one dedicated predicate execution pipeline. Since the vector pipelines cannot execute predicate ops, I expect it is not difficult to construct cases where code gets starved for predicate execution resources.
The Fujitsu manuals are really good. Like you say, it's not often that you see that level of detail in the ARM world - or, frankly, the non-x86 world in general. From my prehistory as a Hyperscan developer back in the days before the Intel acquisition and the x86-only open source port, I have a lot of experience chasing around vendors for latency/throughput/opt-guide material. Most of it was non-public and/or hopelessly incomplete.
I salute your dedication to nerd-sniping. I need my creature comforts these days too much to spend days out there in the nerd-ghillie-suit waiting for that one perfect nerd-shot. That may be stretching the analogy hopelessly, but working with just architecture manuals and simulators is tough.
I am more aiming for nerd-artillery ("flatten the entire battlefield") these days: as my powers wane, I'm hoping that my superoptimizer picks up the slack. Despite my skepticism about SVE, I will retarget the superoptimizer to generate SVE/SVE2.
I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps). There's only so much space in my brain.
AVX-512's masks are OK. They're quite cheap. There are some infelicities. I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again. This may be a symptom of AVX-512's origin with Knights*.
It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations. I suppose this is inevitable given that they need to have physical proximity to their units - so explicit mask ops are on the same ports as the SIMD ops.
I also wish that there were 512b compares that produced zmm registers like the old compares used to; sometimes that's the behavior you want. However, you can reconstruct that in another cheap operation iirc.