The primary issue is that large scale GPU training is primarily dominated by com...

chillee on Jan 18, 2021 | parent | context | favorite | on: GPT-Neo – Building a GPT-3-sized model, open sourc...

The primary issue is that large scale GPU training is primarily dominated by communication costs. Since to some approximation things need to be synchronized after every gradient update, it becomes very quickly quite infeasible to increase the communication cost.