What do you mean by programmable gather/scatter. GPUs already do efficient gather and scatter operations. I think the knight's landing AVX-512 even has efficient gather and scatter.
Read the linked article, and the paper linked from there. Basically the idea is that gather/scatter can be very inefficient from a cache and BW perspective. In the worst case you're using only a single element per cache line. So the idea is to "move" the scatter/gather engine to the memory controller, and pack the vectors already in the cache rather than in the register file.
Will it work in reality? No idea, but it's an interesting idea certainly worth exploring.
More generally, I'd really like "real" vector ISA's to become mainstream [1]. We've suffered from the scourge of packed-SIMD long enough, thank you very much.
ARM SVE, RISC-V V extension, and indeed to an extent AVX-512 look pretty good.
[1] Not saying that every chip must dedicate a huge portion of the die area to a monster vector unit, I'd just like the ISA to be there so programmers and compilers can target it.
Depends. Right now they're in two different spaces. One still hanging on to graphics/gaming, the other coming from dense compute space. We'll see how long it takes them to converge.
For all intents and purposes, they've already converged: the underlying GPU microarchitectures have been fairly general purpose SIMD-ish for a long time now.
And (as one example), CUDA happily runs on any gamer "consumer" GPU.
Those mining memory-bandwidth-hard cryptocurrencies, like zcash, may consider evaluating these. According to the article, they’ll have 1200 GB/s, vs 900 GB/s for the top nVidia Volta card. (Of course, it’s quite likely that this increase in memory bandwidth isn’t worth it for reasons of cost, ISA suitability for the particularities of Equihash, etc., but hard to say without a lot of thinking-through.)
I found it fascinating that NEC's new vector machine is now vector accelerator's on PCIe cards. First, this reminds me of how early vector processors were add-ons to existing processors. I wonder if that changes the programming model (compared to the sx-9/ace or cray j90s) and how.
Vector engines really suffer from numerical stability problems, so for certain kinds of problems, you'll get wrong answers (but at least you'll get them fast).
IEEE compliance only gives you guarantees on individual arithmetic operations. Vector machines take away control over the order in which operations are performed, but floating-point addition is not associative. Changing the order in which the terms of a sum are added may lead to vastly different results, due to, e.g., truncation.