Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.
What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.
:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks:
(1) A Mamba SSM and a Transformer on the Pile.
(2) Two transformers, one trained on the Pile, the other trained on Reddit comments.
All are trained to the same MMLU performance.
I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.
There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.
Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.
If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).
Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...
Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.
Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.
What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.
:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.
I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.