Mostly agree, but there is actually a mismatch between madd_epi16 and Arm.
Implementing Arm semantics or x86 on the other requires ~5 instructions, but if we generalize the definition to allow reordering (e.g. Highway's ReorderWidenMulAccumulate [1]), it's only 2 instructions.
Indeed, and your comment led me to find additional issues with my port of _mm_madd_epi16.
I agree it would perhaps be possible to find better semantics for SIMD that kinda gloss over all the differences. That would be cleaner but require a lot of names. Well I suppose that's what Highway does, isn't it?
1: https://github.com/google/highway/blob/master/g3doc/quick_re...