>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.
>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.
>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.
>architecture research matters. many people just take it for granted these days.
Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.
What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.
:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks:
(1) A Mamba SSM and a Transformer on the Pile.
(2) Two transformers, one trained on the Pile, the other trained on Reddit comments.
All are trained to the same MMLU performance.
I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.
There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.
Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.
If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).
Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...
Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.
Well, both can be true if you interpret the "it" as "the secret sauce / competitive advantage". A good architecture is a necessary but not sufficient condition for success, but everybody uses more or less the same currently, so data makes the difference. Until the next improvement in architecture.
I do argue that the IT is the architecture. We have pretty much had all the data that these LLMs were trained on for a long time. The game changer was the architecture not the data. Unless of course you are on the code is data camp ;).
Probably the "it" is whatever one model has that other models don't have. When everyone is using the same architecture, then the data makes the difference. If everyone has the same data, then the architecture makes the difference.
It sounds pretty obvious to say that the difference is whatever is different, but isn't that literally what both sides of this argument are saying?
edit: I do think that what the original linked essay is saying is slightly subtler than that, which is that _given_ that everyone is using the same transformer architecture, the exact hyperparameters and fine tuning that is done matters a lot less than the data set does.
MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.
Not sure about feasible, but certainly not efficient.
I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.
I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.
Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers.
Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.
That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.
How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !
So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.
No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.
Seems like an objection that is slightly beside the point? The claim is not that literally any model gives the same result as a large transformer model, that's obviously false. I think the more generous interpretation of the claim is that the model architecture is relatively unimportant as long as the model is fundamentally capable of representing the functions you need it to represent in order to fit the data.
OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".
His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".
There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.
This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?
Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.
You and me can model the dataset better, but we're already "pre-trained" on reality for decades.
Just because the dataset is large doesn't mean it contains useful information.
Perhaps, but even with an arbitrarily good training set, the LLM would still be constrained by it's own architectural limits. e.g. If a problem can't be broken down into sub-problems that each require <= N sequential steps, then an N-layer transformer will never be able to solve it.
Even if the architectural shortcomings were all fixed, it seems "[pre-training] data is all you need" would still be false, because there is no getting around the need for personal experience, for the same reasons that is true for us...
Perhaps most fundamentally, any action/prediction you make can only based on the content of your own mind, not the mind of a tutor you are trying to copy. Even if the tutor diligently tries to communicate all nuances and contingencies of a skill to you, those are still all relative to his/her own internal world model, not the one in your head. You will need to practice and correct to adapt the instructions to yourself.
I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.
The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.
The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.
> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.
Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.
The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.
I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.
As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.
A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.
Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.
It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.
the 15T tokens that got thrown at Llama-3 didn't seem to hurt. Will be interesting to see how well Phi-2 holds up with it's more curated approach, hopefully they don't get disappeared like WizardLM 2 =)
"The quality of the prompts used in SFT and the preference rankings used in PPO and DPO played a crucial role in the performance of the aligned models. Meta's team carefully curated this data and performed multiple rounds of quality assurance on annotations provided by human annotators."
This makes me sad, not because I disagree with it, but because it's basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.
That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a lot of data, training time and inference complexity tend to become better ways to make a model choice.
>In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.
There are countless competitions, etc. on Kaggle, AICrowd, or other platforms with an enforced standardized data set. Every entrant uses the same data set and there's a huge difference between the best and worst submissions.
Agreed but if you look at winning submissions which i did stop doing, a lot of them do very good feature engineering which is not a model related thing.
> the only people who think architecture/model choice makes a huge difference are n00bs and academics.
Are you referring to the current state of our best existing models or the potential future of ML? I find it incredibly hard to see how an LLM could implement the best “physically allowable” approximation to Solomonoff induction.
Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.
We manage to train neural nets to approximate complicated data sets via rather simple process: back propagation.
It is indeed a marvel that it works nearly as well as it does.
But then again, evolution is even dumber (in the sense that it only makes random choices that thrive or perish, and can't even take gradients into account), but evolution has still managed to produce intelligent critters.
I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?
Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.
Unfortunately Google brain researchers have not yet discovered my brilliance, but if you read my argument it's about the data being much more important than the model. Granted transformers are a great model, but that doesn't refute my point.
Why does it make you sad? It seems intuitiv and simple. And in reality of course the optimisation part is not trivial. What would we better if the "it" was more complicated?
It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.
You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.
There's a certain sadness, I think, in the revelation that the robots don't need the expertise of humanity's greatest experts and masters, they just need us to click all the squares that contain a motorcycle.
> It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.
What's sadder is coming into a field pre-deciding that the way you approach it "is the right way" and can't tolerate that different mindsets can also get results.
I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“
We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?
I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?
> OP probably means that he models learns wat a dog or cat is, right?
Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.
> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?
You put in a particular choice of nn architecture as well as the dataset. The insight (to the extent that it is insightful, and true) is that the architecture doesn't affect the results you get much compared to the dataset.
Ok the first thing must be just my non-native speaker mind then.
The second: still fills like Duh. It’s what these models are meant to do right? Form an internal representation of the relations hidden in the data. It’s what complex systems are, they hold models of reality and use those to predict. That is in fact what Claude Shannon meant with his definition of information. Idk maybe I’m getting it wrong.
> It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.
In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?
My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?
People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers. Which is exactly why hallucination is such a big problem.
"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases.
Start with the best data you can, and task train ("rlhf") behavior not preference.
Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?
I think that would be a really cool experiment.
There are probably some really good candidate concepts that just take a small leap of reasoning to reach.
But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?
Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.
If you had one that was character based (instead of the weird encoding they tend to use), you could directly sample without e.
Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).
I wonder how you would train a model to directly speak without e. Perhaps you use the general model like above with beamsearch, and then train a new model to directly predict the first models beamsearched-predictions.
Yes, and it's what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It's good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.
I don't think GPT4 is the top LLM, it's good at coding and good at understanding poorly written prompts but its high level prompt following and creativity are not great. GPT4 likes to answer a particular way and when your question matches up with that it'll seem very smart, but when it doesn't the rails it is on are very obvious.
This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?
I think it's the latter. We may well have some subsystems that work like LLMs or other current AIs, but the overall system of a human mind seems to work in a fundamentally different way, as it's able to make good creative choices (such as the next word to say) without looking at lots of options.
Consider a chess engine that plays at grandmaster level, i.e. a human grandmaster can sometimes beat it. Even though it's not the best chess engine in the world, it simulates billions of possible scenarios to decide each move. Yet the grandmaster can still beat it sometimes, even though he clearly isn't thinking about billions of possible scenarios. (On the question of whether human brains may in fact unconsciously process billions of possibilities when deciding a chess move, using some neurological process we haven't discovered, I've heard David Deutsch argue this would be thermodynamically impossible as it would require far more energy than the brain consumes.) So the human grandmaster's brain must be doing something else that we don't understand. I think a similar comparison applies with how an LLM and a human choose the next word to say. An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.
That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay's response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable "interpolatable" the data are, how predictable we are.
>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.
>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.
>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.
>architecture research matters. many people just take it for granted these days.