Read the linked article, and the paper linked from there. Basically the idea is that gather/scatter can be very inefficient from a cache and BW perspective. In the worst case you're using only a single element per cache line. So the idea is to "move" the scatter/gather engine to the memory controller, and pack the vectors already in the cache rather than in the register file.
Will it work in reality? No idea, but it's an interesting idea certainly worth exploring.
Will it work in reality? No idea, but it's an interesting idea certainly worth exploring.