The "it" in AI models is the dataset

mnk47 · 2024-04-25T10:27:01 1714040821

Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021

>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.

>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.

>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.

>architecture research matters. many people just take it for granted these days.

neonbjb · 2024-04-25T13:43:01 1714052581

I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.

:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.

jfyi · 2024-04-25T14:26:44 1714055204

There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs · 2024-04-25T15:50:24 1714060224

Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.

HarHarVeryFunny · 2024-04-25T14:48:22 1714056502

If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo · 2024-04-25T17:00:19 1714064419

Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.

ahartmetz · 2024-04-25T10:35:50 1714041350

Well, both can be true if you interpret the "it" as "the secret sauce / competitive advantage". A good architecture is a necessary but not sufficient condition for success, but everybody uses more or less the same currently, so data makes the difference. Until the next improvement in architecture.

nkozyra · 2024-04-25T11:38:30 1714045110

Or until we run out of data that actually differentiates the models

segmondy · 2024-04-25T12:01:27 1714046487

I do argue that the IT is the architecture. We have pretty much had all the data that these LLMs were trained on for a long time. The game changer was the architecture not the data. Unless of course you are on the code is data camp ;).

empath-nirvana · 2024-04-25T13:35:53 1714052153

Probably the "it" is whatever one model has that other models don't have. When everyone is using the same architecture, then the data makes the difference. If everyone has the same data, then the architecture makes the difference.

It sounds pretty obvious to say that the difference is whatever is different, but isn't that literally what both sides of this argument are saying?

edit: I do think that what the original linked essay is saying is slightly subtler than that, which is that _given_ that everyone is using the same transformer architecture, the exact hyperparameters and fine tuning that is done matters a lot less than the data set does.

4death4 · 2024-04-25T12:49:17 1714049357

MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.

HarHarVeryFunny · 2024-04-25T13:36:54 1714052214

Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.

HarHarVeryFunny · 2024-04-25T11:51:21 1714045881

Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers.

Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.

That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.

How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !

So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.

No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.

geysersam · 2024-04-25T10:55:20 1714042520

Seems like an objection that is slightly beside the point? The claim is not that literally any model gives the same result as a large transformer model, that's obviously false. I think the more generous interpretation of the claim is that the model architecture is relatively unimportant as long as the model is fundamentally capable of representing the functions you need it to represent in order to fit the data.

HarHarVeryFunny · 2024-04-25T22:36:58 1714084618

OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".

His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".

There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.

This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?

geysersam · 2024-04-27T16:36:20 1714235780

Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.

You and me can model the dataset better, but we're already "pre-trained" on reality for decades.

Just because the dataset is large doesn't mean it contains useful information.

HarHarVeryFunny · 2024-04-27T20:40:39 1714250439

Perhaps, but even with an arbitrarily good training set, the LLM would still be constrained by it's own architectural limits. e.g. If a problem can't be broken down into sub-problems that each require <= N sequential steps, then an N-layer transformer will never be able to solve it.

Even if the architectural shortcomings were all fixed, it seems "[pre-training] data is all you need" would still be false, because there is no getting around the need for personal experience, for the same reasons that is true for us...

Perhaps most fundamentally, any action/prediction you make can only based on the content of your own mind, not the mind of a tutor you are trying to copy. Even if the tutor diligently tries to communicate all nuances and contingencies of a skill to you, those are still all relative to his/her own internal world model, not the one in your head. You will need to practice and correct to adapt the instructions to yourself.

omnicognate · 2024-04-25T11:09:12 1714043352

Machine learning insights from e e cummings.

tppiotrowski · 2024-04-25T13:44:05 1714052645

I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

jncfhnb · 2024-04-25T14:18:47 1714054727

There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.

HarHarVeryFunny · 2024-04-26T13:05:29 1714136729

> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.

Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.

The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.

I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.

Eisenstein · 2024-04-25T09:26:18 1714037178

As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.

A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.

Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.

It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

0xDEADFED5 · 2024-04-25T10:08:15 1714039695

the 15T tokens that got thrown at Llama-3 didn't seem to hurt. Will be interesting to see how well Phi-2 holds up with it's more curated approach, hopefully they don't get disappeared like WizardLM 2 =)

Eisenstein · 2024-04-26T21:14:04 1714166044

"The quality of the prompts used in SFT and the preference rankings used in PPO and DPO played a crucial role in the performance of the aligned models. Meta's team carefully curated this data and performed multiple rounds of quality assurance on annotations provided by human annotators."

* https://www.unite.ai/everything-you-need-to-know-about-llama...

redwood · 2024-04-25T10:16:43 1714040203

This is where software developers have a huge role to play: build software that invites user experiences that label as part of the user flow

disgruntledphd2 · 2024-04-25T08:50:57 1714035057

This makes me sad, not because I disagree with it, but because it's basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a lot of data, training time and inference complexity tend to become better ways to make a model choice.

sevagh · 2024-04-25T12:14:52 1714047292

>In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

There are countless competitions, etc. on Kaggle, AICrowd, or other platforms with an enforced standardized data set. Every entrant uses the same data set and there's a huge difference between the best and worst submissions.

infecto · 2024-04-25T12:27:48 1714048068

> ...on Kagi,....

Did you mean https://www.kaggle.com/?

sevagh · 2024-04-25T12:42:36 1714048956

Yes, thanks.

disgruntledphd2 · 2024-04-25T19:39:23 1714073963

Agreed but if you look at winning submissions which i did stop doing, a lot of them do very good feature engineering which is not a model related thing.

Xcelerate · 2024-04-25T14:00:19 1714053619

> the only people who think architecture/model choice makes a huge difference are n00bs and academics.

Are you referring to the current state of our best existing models or the potential future of ML? I find it incredibly hard to see how an LLM could implement the best “physically allowable” approximation to Solomonoff induction.

Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.

eru · 2024-04-25T14:58:36 1714057116

We manage to train neural nets to approximate complicated data sets via rather simple process: back propagation.

It is indeed a marvel that it works nearly as well as it does.

But then again, evolution is even dumber (in the sense that it only makes random choices that thrive or perish, and can't even take gradients into account), but evolution has still managed to produce intelligent critters.

I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?

Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.

HarHarVeryFunny · 2024-04-25T16:24:05 1714062245

> In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

The most notable voice refuting this opinion on Twitter was Yi Tay (founder of Reka.ai), who definitely does not belong to either of those categories!

Tay (ex. Google Brain) founded Reka.ai two years ago, and their latest multimodal language model is close to SOTA in performance.

https://x.com/YiTayML/status/1779895037335343521

danielbln · 2024-04-25T14:18:34 1714054714

This "n00b" seems to disagree with your sentiment on the importance on architecture: https://news.ycombinator.com/item?id=40155667

disgruntledphd2 · 2024-04-25T19:42:11 1714074131

Unfortunately Google brain researchers have not yet discovered my brilliance, but if you read my argument it's about the data being much more important than the model. Granted transformers are a great model, but that doesn't refute my point.

Also arguments from authority are boring.

jstummbillig · 2024-04-25T09:32:23 1714037543

Why does it make you sad? It seems intuitiv and simple. And in reality of course the optimisation part is not trivial. What would we better if the "it" was more complicated?

michaelt · 2024-04-25T10:21:05 1714040465

It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.

There's a certain sadness, I think, in the revelation that the robots don't need the expertise of humanity's greatest experts and masters, they just need us to click all the squares that contain a motorcycle.

disgruntledphd2 · 2024-04-25T10:30:57 1714041057

This is 100% not why I am sad, see my other reply for information.

As an aside, it's wild how people put their own spin onto what I said.

Obviously I should have been clearer :shrug:.

xanderlewis · 2024-04-25T10:45:55 1714041955

Well, you have to forgive some for making assumptions based on your choice of username…

disgruntledphd2 · 2024-04-25T10:58:02 1714042682

Fair, I'm just generally disgruntled to be fair, the PhD was just the name of my soon abandoned blog.

sevagh · 2024-04-25T12:16:11 1714047371

> It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

What's sadder is coming into a field pre-deciding that the way you approach it "is the right way" and can't tolerate that different mindsets can also get results.

xanderlewis · 2024-04-25T10:26:07 1714040767

How do you know? We’re not there yet.

disgruntledphd2 · 2024-04-25T10:29:53 1714040993

Because of the way it's presented, as if it's some vast new discovery that OpenAI have made, rather than common wisdom.

It makes me sad when people rediscover things (with massive compute in this case), that were already known.

It's very much spend a year in the lab to save an hour in the library.

macilacilove · 2024-04-25T10:09:36 1714039776

Possibly because being in the business of trying to turn iq edge into money, not data edge into money.

teekert · 2024-04-25T10:33:50 1714041230

I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“

We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?

I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?

empath-nirvana · 2024-04-25T13:37:41 1714052261

> “What that means is not only that they learn what it means to be a dog or a cat, …“

I think he's referring to the famous paper: "What is it like to be a bat"

https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

omnicognate · 2024-04-25T11:05:31 1714043131

> OP probably means that he models learns wat a dog or cat is, right?

Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.

> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?

You put in a particular choice of nn architecture as well as the dataset. The insight (to the extent that it is insightful, and true) is that the architecture doesn't affect the results you get much compared to the dataset.

teekert · 2024-04-25T13:29:09 1714051749

Ok the first thing must be just my non-native speaker mind then.

The second: still fills like Duh. It’s what these models are meant to do right? Form an internal representation of the relations hidden in the data. It’s what complex systems are, they hold models of reality and use those to predict. That is in fact what Claude Shannon meant with his definition of information. Idk maybe I’m getting it wrong.

rambambram · 2024-04-25T10:30:21 1714041021

> It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?

My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?

Hendrikto · 2024-04-25T11:19:16 1714043956

People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers. Which is exactly why hallucination is such a big problem.

eru · 2024-04-25T14:53:35 1714056815

> People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers.

Eh, you can still often (!) figure out whether what the LLM says makes sense.

Just like you can often figure out whether a human is bullshitting, by fact checking with other sources, or going over their reasoning.

CuriouslyC · 2024-04-25T11:50:31 1714045831

"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases.

Start with the best data you can, and task train ("rlhf") behavior not preference.

ttpphd · 2024-04-25T13:13:02 1714050782

Yeah when you use OpenAI you are giving them free labor for data curation.

bilsbie · 2024-04-25T11:42:14 1714045334

Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?

I think that would be a really cool experiment.

There are probably some really good candidate concepts that just take a small leap of reasoning to reach.

But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?

Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.

andy99 · 2024-04-25T11:47:20 1714045640

There was one where they tried to remove Harry Potter...

Who's Harry Potter? Approximate Unlearning in LLMs https://arxiv.org/abs/2310.02238

See also The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported https://arxiv.org/abs/2403.12082v1

queuebert · 2024-04-25T12:38:47 1714048727

I want to see an LLM that generates answers without the letter 'e', like the novel Gadsby by Ernest Vincent Wright.

eru · 2024-04-25T14:52:01 1714056721

If you had one that was character based (instead of the weird encoding they tend to use), you could directly sample without e.

Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).

I wonder how you would train a model to directly speak without e. Perhaps you use the general model like above with beamsearch, and then train a new model to directly predict the first models beamsearched-predictions.

sampo · 2024-04-25T09:37:15 1714037835

Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable Effectiveness of Data

https://static.googleusercontent.com/media/research.google.c...

0xDEADFED5 · 2024-04-25T09:41:38 1714038098

Rylan Schaeffer (2023): Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632

andy99 · 2024-04-25T10:32:52 1714041172

Yes, and it's what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It's good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.

See also "You won't train a better model from your desk: https://news.ycombinator.com/item?id=40155715

CuriouslyC · 2024-04-25T11:55:09 1714046109

I don't think GPT4 is the top LLM, it's good at coding and good at understanding poorly written prompts but its high level prompt following and creativity are not great. GPT4 likes to answer a particular way and when your question matches up with that it'll seem very smart, but when it doesn't the rails it is on are very obvious.

pyinstallwoes · 2024-04-25T10:43:06 1714041786

So "it" is the collective unconscious of humanity? The egregore of us all, our collective spirit? I see.

zer0gravity · 2024-04-25T13:56:44 1714053404

This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?

cal85 · 2024-04-25T16:07:43 1714061263

I think it's the latter. We may well have some subsystems that work like LLMs or other current AIs, but the overall system of a human mind seems to work in a fundamentally different way, as it's able to make good creative choices (such as the next word to say) without looking at lots of options.

Consider a chess engine that plays at grandmaster level, i.e. a human grandmaster can sometimes beat it. Even though it's not the best chess engine in the world, it simulates billions of possible scenarios to decide each move. Yet the grandmaster can still beat it sometimes, even though he clearly isn't thinking about billions of possible scenarios. (On the question of whether human brains may in fact unconsciously process billions of possibilities when deciding a chess move, using some neurological process we haven't discovered, I've heard David Deutsch argue this would be thermodynamically impossible as it would require far more energy than the brain consumes.) So the human grandmaster's brain must be doing something else that we don't understand. I think a similar comparison applies with how an LLM and a human choose the next word to say. An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

og_kalu · 2024-04-26T13:38:16 1714138696

>An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

LLMs don't work this way.

cal85 · 2024-04-27T13:38:57 1714225137

Could you elaborate? If my understanding of this is significantly off then I’d appreciate if you could explain.

og_kalu · 2024-04-28T01:34:17 1714268057

I mean there's no search. They compute probabilities but it's not a lookup table.

tadala · 2024-04-25T14:10:14 1714054214

Ah the nature vs nurture debate, we meet again!

Give me a Neural Net in its first epoch and I shall mold it into anything!

pk-protect-ai · 2024-04-25T12:05:56 1714046756

That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay's response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable "interpolatable" the data are, how predictable we are.

1. https://twitter.com/YiTayML/status/1783273130087289021

chrisdirl · 2024-04-25T09:51:56 1714038716

Is the secret sauce also tied to the generation distribution which can differ from the dataset distribution e.g. RLHF?

troq13 · 2024-04-25T10:41:41 1714041701

Weak argument for something everyone already knew. Nice you work at openAI, I guess.

tilt_error · 2024-04-25T13:04:42 1714050282

Is this a surprise?

Isn't this exactly what Naftali Tishby has been talking about [1].

[1] https://www.youtube.com/watch?v=XL07WEc2TRI

iNic · 2024-04-25T09:41:12 1714038072

The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of "dataset", but it _feels_ different.

redwood · 2024-04-25T10:15:11 1714040111

AKA "group think"