There's not much in the way of contrary opinion here, so let me offer some. The ...

aktau · on April 9, 2014

This made me think of clang/gcc's vector extensions [1], which, together with __builtin_shuffle can be used to get some real "ok" cross-platform (SSE/NEON/...) SIMD code going on. An example of this in usage is [2].

That said you're right, usually the best performance can only be obtained by using really specific instructions. But in my experience, a decent performance increase can be obtained by using the generic vector extensions.

Moreover, if you can use the vector extensions for a large part of the code, that means you have to write a lot less platform-specific stuff. I.e.: you increase portability anyway, since now you only have to rewrite 5 out of 20 functions instead of 20/20. Even better, they allow one to write v3 = v1 + v2 instead of v3 = _mm_add_ps(v1,v2). The first one being clearer, more portable (will generate appropriate addps or equivalent NEON, ...) and plain nicer to read.

Your pcmpeqd example is a good example of an optimizer flaw. In my opinion this is orthogonal to whether or not to expose a specific or generic API. The compiler should've use the most efficient instruction for that simple idiom, period (without you telling it to use pcmpeqd). If we continue your line of reasoning, we're back to assembly for everything.

[1]: https://vec.io/posts/gcc-and-clang-vector-extensions (The vector extensions allow +,-,*,/,<,>,==,!= to be naturally used for SIMD types) [2]: https://github.com/rikusalminen/threedee-simd