The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.