It's not really fair to compare a discrete GPU to a mobile GPU, I only provided this as a comparison for someone who maybe has one of these at home. And btw, you are talking about TF32 performance not FP32. TF32 actually uses 16 bits. A100's FP32 performance is actually lower than 3090, it's 19.5 TFLOPS:
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...
FP16 performance is also relevant as a lot of people now train in FP16. The default for pytorch/TF is still FP32.
I’ve been mostly using TPUs lately and non-TF32 GPUs at work, so I don’t have any practical experience with TF32, but the sales pitch seems pretty good. Do you have any personal experience on whether it’s as much of a drop in replacement for fp32 as they suggest?
I haven't used TF32 personally, but I think the sales pitch is not too far off. Most of the time I use mixed-precision training which should be similar to FP16/TF32 in terms of performance. It does tremendously speed up training.
FP16 performance is also relevant as a lot of people now train in FP16. The default for pytorch/TF is still FP32.