For many of us, the Inception-style CNN workloads--especially at FP32--are much more realistic than large language models that may be better suited to take advantage of the tensor cores. If I'm going to be memory bottlenecked either way, I probably don't want to spend an extra $1000 on 400 tensor cores I can't take full advantage of.
If I may ask, why are the Inception style workloads still popular, rather than architectures like EfficientNet?
Also, why FP32? CNNs are some of the most robust models to train in FP16 (much easier than language models) so you could get yourself a quick XXX speedup and 2x memory savings by switching over.
(btw not intending to be accusatory or anything, I just think FP16 training deserves a lot more adoption that it currently seems to have :)