I'm James Betker. Of course architecture matters in this regard lol. Comparing a...

jfyi · 2024-04-25T14:26:44 1714055204

There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs · 2024-04-25T15:50:24 1714060224

Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.

HarHarVeryFunny · 2024-04-25T14:48:22 1714056502

If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo · 2024-04-25T17:00:19 1714064419

Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.