There's not much in the way of contrary opinion here, so let me offer some. The approach of not tying you to a particular architecture is fundamentally wrong. The right way is to expose APIs for each processor architecture.
Here's why. SIMD offers no new capabilities (1), only more speed, and not much more at that, maybe 4x if you're lucky. It's also hard to use: it requires unnatural data layouts, and lacks many operations (e.g. integer division). None of this is specific to the .NET implementation: it's just the nature of the beast.
So successfully exploiting SIMD is not easy, and requires thinking at the level where instruction counts matter. And because the amount of parallelism is so limited, high level languages (by which I include C!) can very easily blow away any gains with suboptimal codegen. Just a handful of additional instructions can ruin your performance (2).
Here's what will go wrong with an architecture-independent SIMD API:
1. Say you invoke an operation without an underlying native instruction. The compiler is forced to implement this by transferring the data to scalar registers, performing the operation, and then transferring the result back. Game over: this exercise is likely to eat up any performance benefit.
2. To avoid this, say you limit the API to some "common subset" of all extant SIMD ISAs. The problem is, many algorithms admit vectorization only through the exotic instructions, such as POPCNT on SSE4, or the legendary vec_perm on Altivec. If this instruction is not exposed, you can't vectorize the algorithm. Game over again.
That's why software that takes advantage of SIMD invariably has separate implementations for each supported ISA. .NET should have followed suit: expose an API for each ISA (or a mega-API that covers all ISAs), and then provide rich information about which operations are implemented efficiently, and which are not, to allow apps to choose an optimal implementation at runtime. This API would demo and market very poorly, but the engineers will love it, because it's the one that enables the most benefit from SIMD.
1: with rare exceptions, such as the new fused multiply-add support in x86
2: Several years back, VC++ generated an all-bits-1 register by loading it from memory, instead of issuing a pcmpeqd, which caused my vector implementation to underperform my scalar one. This is my fear for the .NET implementation.
This made me think of clang/gcc's vector extensions [1], which, together with __builtin_shuffle can be used to get some real "ok" cross-platform (SSE/NEON/...) SIMD code going on. An example of this in usage is [2].
That said you're right, usually the best performance can only be obtained by using really specific instructions. But in my experience, a decent performance increase can be obtained by using the generic vector extensions.
Moreover, if you can use the vector extensions for a large part of the code, that means you have to write a lot less platform-specific stuff. I.e.: you increase portability anyway, since now you only have to rewrite 5 out of 20 functions instead of 20/20. Even better, they allow one to write v3 = v1 + v2 instead of v3 = _mm_add_ps(v1,v2). The first one being clearer, more portable (will generate appropriate addps or equivalent NEON, ...) and plain nicer to read.
Your pcmpeqd example is a good example of an optimizer flaw. In my opinion this is orthogonal to whether or not to expose a specific or generic API. The compiler should've use the most efficient instruction for that simple idiom, period (without you telling it to use pcmpeqd). If we continue your line of reasoning, we're back to assembly for everything.
Here's why. SIMD offers no new capabilities (1), only more speed, and not much more at that, maybe 4x if you're lucky. It's also hard to use: it requires unnatural data layouts, and lacks many operations (e.g. integer division). None of this is specific to the .NET implementation: it's just the nature of the beast.
So successfully exploiting SIMD is not easy, and requires thinking at the level where instruction counts matter. And because the amount of parallelism is so limited, high level languages (by which I include C!) can very easily blow away any gains with suboptimal codegen. Just a handful of additional instructions can ruin your performance (2).
Here's what will go wrong with an architecture-independent SIMD API:
1. Say you invoke an operation without an underlying native instruction. The compiler is forced to implement this by transferring the data to scalar registers, performing the operation, and then transferring the result back. Game over: this exercise is likely to eat up any performance benefit.
2. To avoid this, say you limit the API to some "common subset" of all extant SIMD ISAs. The problem is, many algorithms admit vectorization only through the exotic instructions, such as POPCNT on SSE4, or the legendary vec_perm on Altivec. If this instruction is not exposed, you can't vectorize the algorithm. Game over again.
That's why software that takes advantage of SIMD invariably has separate implementations for each supported ISA. .NET should have followed suit: expose an API for each ISA (or a mega-API that covers all ISAs), and then provide rich information about which operations are implemented efficiently, and which are not, to allow apps to choose an optimal implementation at runtime. This API would demo and market very poorly, but the engineers will love it, because it's the one that enables the most benefit from SIMD.
1: with rare exceptions, such as the new fused multiply-add support in x86
2: Several years back, VC++ generated an all-bits-1 register by loading it from memory, instead of issuing a pcmpeqd, which caused my vector implementation to underperform my scalar one. This is my fear for the .NET implementation.