I'm basing that number on https://github.com/jcjohnson/cnn-benchmarks. I'm sure ...

jhj · on Sept 19, 2016

im2col + sgemm on CPU as in Caffe for instance is really slow; you are heavily penalized for extra memory traffic and the sgemm tile sizes are probably not well tuned for the problem size at hand.

At the roofline of performance, the difference in both mem b/w and arithmetic throughput between CPU and GPU is only 5-10x (for fp32, Pascal fp16 is a different story of course), and proper implementations on the CPU will get you there.

https://github.com/Maratyszcza/NNPACK