OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".
His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".
There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.
This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?
Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.
You and me can model the dataset better, but we're already "pre-trained" on reality for decades.
Just because the dataset is large doesn't mean it contains useful information.
Perhaps, but even with an arbitrarily good training set, the LLM would still be constrained by it's own architectural limits. e.g. If a problem can't be broken down into sub-problems that each require <= N sequential steps, then an N-layer transformer will never be able to solve it.
Even if the architectural shortcomings were all fixed, it seems "[pre-training] data is all you need" would still be false, because there is no getting around the need for personal experience, for the same reasons that is true for us...
Perhaps most fundamentally, any action/prediction you make can only based on the content of your own mind, not the mind of a tutor you are trying to copy. Even if the tutor diligently tries to communicate all nuances and contingencies of a skill to you, those are still all relative to his/her own internal world model, not the one in your head. You will need to practice and correct to adapt the instructions to yourself.
His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".
There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.
This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?