> Just for reference Nvidia's flagship GPU(3090)'s FP32 performance is 35.5 TFLOPS.
In context of ML, nvidia’s flagship is the A100, which has 312 TFLOPS. You can also compare with a TPU device which has 180 TFLOPS (v2) or 420 TFLOPS (v3). You can use at least the TPU v2 reliably on colab for free.
It's not really fair to compare a discrete GPU to a mobile GPU, I only provided this as a comparison for someone who maybe has one of these at home. And btw, you are talking about TF32 performance not FP32. TF32 actually uses 16 bits. A100's FP32 performance is actually lower than 3090, it's 19.5 TFLOPS:
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...
FP16 performance is also relevant as a lot of people now train in FP16. The default for pytorch/TF is still FP32.
I’ve been mostly using TPUs lately and non-TF32 GPUs at work, so I don’t have any practical experience with TF32, but the sales pitch seems pretty good. Do you have any personal experience on whether it’s as much of a drop in replacement for fp32 as they suggest?
I haven't used TF32 personally, but I think the sales pitch is not too far off. Most of the time I use mixed-precision training which should be similar to FP16/TF32 in terms of performance. It does tremendously speed up training.
In context of ML, nvidia’s flagship is the A100, which has 312 TFLOPS. You can also compare with a TPU device which has 180 TFLOPS (v2) or 420 TFLOPS (v3). You can use at least the TPU v2 reliably on colab for free.