The primary issue is that large scale GPU training is primarily dominated by communication costs. Since to some approximation things need to be synchronized after every gradient update, it becomes very quickly quite infeasible to increase the communication cost.