This needs to be taken with a massive grain of salt, as LLM training performance hugely depends on the framework used.
And by "performance" I mean the quality of the end result, the training speed, and what features device is physically capable of handling. While kinda discrete aspects in other fields, all of these are highly correlated in ML training.
One specific example I am thinking of is support for efficient long context training. If your stack doesn't, for instance, support flash attention (or roll its own custom implementation), then good luck with that.
Another I am thinking of is quantization. Sometimes you want to train with a specific quantization scheme, and sometimes doing that is just a performance compromise. And vendors really love to fudge their numbers with the fastest, most primitive quantization on the device.
And by "performance" I mean the quality of the end result, the training speed, and what features device is physically capable of handling. While kinda discrete aspects in other fields, all of these are highly correlated in ML training.
One specific example I am thinking of is support for efficient long context training. If your stack doesn't, for instance, support flash attention (or roll its own custom implementation), then good luck with that.
Another I am thinking of is quantization. Sometimes you want to train with a specific quantization scheme, and sometimes doing that is just a performance compromise. And vendors really love to fudge their numbers with the fastest, most primitive quantization on the device.