Correct, in GPUs can indeed do a better job at hiding latency through massive parallelism.
My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.
Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?
Divergent warps are still a huge problem (but SLIDE doesn't have this problem AFAIK).
Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.
The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.
My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.
Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?