Can Vector Supercomputing Be Revived?

jcbeard · on Oct 28, 2017

Better programmable gather/scatter (like described here: https://www.nextplatform.com/2017/09/14/shedding-light-dark-...) can definitely open up a wider range of applications to vectorization.

CyberDildonics · on Oct 28, 2017

What do you mean by programmable gather/scatter. GPUs already do efficient gather and scatter operations. I think the knight's landing AVX-512 even has efficient gather and scatter.

jabl · on Oct 28, 2017

Read the linked article, and the paper linked from there. Basically the idea is that gather/scatter can be very inefficient from a cache and BW perspective. In the worst case you're using only a single element per cache line. So the idea is to "move" the scatter/gather engine to the memory controller, and pack the vectors already in the cache rather than in the register file.

Will it work in reality? No idea, but it's an interesting idea certainly worth exploring.

jabl · on Oct 28, 2017

More generally, I'd really like "real" vector ISA's to become mainstream [1]. We've suffered from the scourge of packed-SIMD long enough, thank you very much.

ARM SVE, RISC-V V extension, and indeed to an extent AVX-512 look pretty good.

[1] Not saying that every chip must dedicate a huge portion of the die area to a monster vector unit, I'd just like the ISA to be there so programmers and compilers can target it.

payne92 · on Oct 28, 2017

It’s very very very tough to compete with the economies of scale and ecosystem with modern GPU computing.

Vector supercomputing doesn’t need to be “revived”, it is already here.

jcbeard · on Oct 28, 2017

Depends. Right now they're in two different spaces. One still hanging on to graphics/gaming, the other coming from dense compute space. We'll see how long it takes them to converge.

payne92 · on Oct 30, 2017

For all intents and purposes, they've already converged: the underlying GPU microarchitectures have been fairly general purpose SIMD-ish for a long time now.

And (as one example), CUDA happily runs on any gamer "consumer" GPU.

davidad_ · on Oct 28, 2017

Those mining memory-bandwidth-hard cryptocurrencies, like zcash, may consider evaluating these. According to the article, they’ll have 1200 GB/s, vs 900 GB/s for the top nVidia Volta card. (Of course, it’s quite likely that this increase in memory bandwidth isn’t worth it for reasons of cost, ISA suitability for the particularities of Equihash, etc., but hard to say without a lot of thinking-through.)

jdboyd · on Oct 28, 2017

I found it fascinating that NEC's new vector machine is now vector accelerator's on PCIe cards. First, this reminds me of how early vector processors were add-ons to existing processors. I wonder if that changes the programming model (compared to the sx-9/ace or cray j90s) and how.

wohlergehen · on Oct 28, 2017

And judging by Intel's Xeon Phi story, the next iteration from NEC will again be a full processor...

mmarx · on Oct 28, 2017

Vector engines really suffer from numerical stability problems, so for certain kinds of problems, you'll get wrong answers (but at least you'll get them fast).

justincormack · on Oct 28, 2017

Citation? They are generally IEEE compliant; there can be some differences in denorm handling?

mmarx · on Oct 28, 2017

IEEE compliance only gives you guarantees on individual arithmetic operations. Vector machines take away control over the order in which operations are performed, but floating-point addition is not associative. Changing the order in which the terms of a sum are added may lead to vastly different results, due to, e.g., truncation.