If you include multimodal data then I think it's pretty obvious that training is...

nullc 15 days ago | parent | context | favorite | on: Nvidia’s $589B DeepSeek rout

If you include multimodal data then I think it's pretty obvious that training is compute limited.

Also current SOTA models are good enough that you can generate endless training data by letting the model operate stuff like a C compiler, python interpreter, Sage computer algebra, etc.