Hacker News new | past | comments | ask | show | jobs | submit login

> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!

> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.

While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.

We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.

The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?

Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.






> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.

DeepSeek did precisely this with their LLama fine-tunes. You can try the 70B one here (might have to sign up): https://groq.com/groqcloud-makes-deepseek-r1-distill-llama-7...


Yes, but I meant it slightly differently than the distills.

The idea is to create the next gen SOTA non reasoning model with synthetic reasoning training data.


So you mean something like, "what if the baseline, off-the-cuff response for the next-gen models was tuned based on the results of the reasoning model excluding the reasoning itself?"

Exactly, albeit it may need the reasoning later to form the proper foundational logic in the weights.

every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now

Users can be adversarial to the “truth” (to the extent it exists) without being adversarial in intent.

Dinosaur bones are either 65 million year old remnants of ancient creatures or decoys planted by a God during a 7 day creation, and a large proportion of humans earnestly believe either take. Choosing which of these to believe involves a higher level decision about fundamental worldviews. This is an extreme example, but incorporating “honest” human feedback on vaccines, dark matter, and countless other topics won’t lead to de facto improvements.

I guess to put it another way: experts don’t learn from the masses. The average human isn’t an expert in anything, so incorporating the average feedback will pull a model away from expertise (imagine asking 100 people to give you grammar advice). You’d instead want to identify expert advice, but that’s impossible to do from looking at the advice itself without giving into a confirmation bias spiral. Humans use meta-signals like credentialing to augment their perception of received information, yet I doubt we’ll be having people upload their CV during signup to a chat service.

And at the cutting edge level of expertise, the only real “knowledgeable” counterparties are the physical systems of reality themselves. I’m curious how takeoff is possible for a brain in a bottle that can’t test and verify any of its own conjectures. It can continually extrapolate down chains of thought, but that’s most likely to just carry and amplify errors.


Dirac’s prediction of antimatter came from purely mathematical reasoning—before any experimental evidence existed. Testing and verifying conjectures requires the ability to extrapolate beyond known data, rather than from it, and the ability discard false leads based on theoretical reasoning, rather than statistical confidence.

All of this is possible in a bottle, but laughably far beyond our current capabilities.


This is a good take. What models seem to be poor at is undoing their own thinking down a path even when they can test.

If you let a model write code, test it, identify bugs and fix them, you get an increasingly obtuse and complex code base where errors happen more. The more it iterate the worse it gets.

At the end of the day, written human language is a poor way of describing software. Even to a model. The code is the description.

At the moment we describe solutions we want to see to the models and they aren't that smart about translating that to an unambiguous form.

We are a long was off describing the problems and asking for a solution. Even when the model can test and iterate.


Same way corporations do it, they hire humans and other companies to do things. Organisations already have a mind of their own with more drive to survive than an llm.

This assumes that you give honest feedback.

Efforts to feed deployed AI models various epistemic poisons abound in the wild.


This assumes that the companies gathering the data don’t have silent ways of detecting bad actors and discarding their responses. If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known? Are you using a tool to generate this bad data, which might have detectable word frequency patterns that can be detected with something cheap like tf-idf?

There’s a lot of incentive to figure this out. And they have so much data coming in that they can likely afford to toss out some good data to ensure that they’re tossing out all of the bad.


> If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known?

We can use the same tactics they are using to crawl the web and scrape pages and bypass anti-scraping mechanisms.


Not necessarily, not all tactics can be used symmetrically like that. Many of the sites they scrape feel the need to support search engine crawlers and RSS crawlers, but OpenAI feels no such need to grant automated anonymous access to ChatGPT users.

And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.


> As long as the vast majority of their data is good (and it will be)

So expert answers are out of scope? Nice, looking forward to those quality data!


If you want to pick apart my hastily concocted examples, well, have fun I guess. My overall point is that ensuring data quality is something OpenAI is probably very good at. They likely have many clever techniques, some of which we could guess at, some of which would surprise us, all of which they’ve validated through extensive testing including with adversarial data.

If people want to keep playing pretend that their data poisoning efforts are causing real pain to OpenAI, they’re free to do so. I suppose it makes people feel good, and no one’s getting hurt here.


I'm interested in why you think OpenAI is probably very good at ensuring data quality. Also interested if you are trying to troll the resistance into revealing their working techniques.

They buy it through scale ai

What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?

Because paying experts for verified true data in the quantities they need isn't possible. Ilya himself said we've reached peak data (https://www.theverge.com/2024/12/13/24320811/what-ilya-sutsk...).

Why do you think we are stupid? We work at places developing these models and have a peek into how they're built...


You see a rowboat, and you need to cross the river.

Ask a dozen experts to decide what that boat needs to fit your need.

That is the specification problem, add on the frame problem and it becomes intractable.

Add in domain specific terms and conflicts and it becomes even more difficult.

Any nontrivial semantic properties, those without a clear T/F are undecidable.

OpenAI with have to do what they can, but it is not trivial or solvable.

It doesn't matter how smart they are, generalized solutions are hard.


Sure not necessarily the same tactics, but as with any hacking exercise, there are ways. We can become the 95% :)

It is absolutely fascinating to read the fantasy produced by people who (apparently) think they live in a sci-fi movie.

The companies whose datasets you're "poisoning" absolutely know about the attempts to poison data. All the ideas I've seen linked on this side so far about how they're going to totally defeat the AI companies' models sound like a mixture of wishful thinking and narcissism.


Are you suggesting some kind of invulnerability? People iterate their techniques, if big techs are so capable of avoiding poisoning/gaming attempts there would be no decades long tug-of-war between Google and black hat SEO manipulators.

Also I don't get the narcissism part. Would it be petty to poison a website only when looked by a spider? Yes, but I would also be that petty if some big company doesn't respect the boundaries I'm setting with my robots.txt on my 1-viewer cat photo blog.


Its not complete invulnerability. Instead, it is merely accepting that these methods might increase costs, like a little bit, but they don't cause the whole thing to explode.

The idea that a couple bad faith actions can destroy a 100 billion dollar company, is the extraordinary claim that requires extraordinary evidence.

Sure, bad actors can do a little damage. Just like bad actors can do DDoS attempts against Google. And that will cause a little damage. But mostly Google wins. Same thing applies to these AI companies.

> Also I don't get the narcissism part

The narcissism is the idea that your tiny website is going to destroy a 100 billion dollar company. It won't. They'll figure it out.


Grandparent mentioned "we", I guess they refer to a full class of "black hats" avoiding bad faith scraping that eventually could amass to a relatively effective volume of poisoned sites and/or feedback to the model.

Obviously a singular poisoned site will never make a difference in a dataset of billions and billions of tokens, much less destroy a 100bn company. That's a straw man, and I think people arguing about poisoning acknowledge that perfectly. But I'd argue they can eventually manage to at least do some little damage mostly for the lulz, while avoiding scraping.

Google is full of SEO manipulators and even when they recognize the problem and try to fix it, searching today is a mess because of that. Main difference and challenge in poisoning LLMs would be coordination between different actors, as there is no direct aligning incentive to poisoning except (arguably) global justified pettiness, unlike black hat SEO players that have the incentive to be the first result to certain query.

As LLMs become commonplace eventually new incentives may appear (i.e. an LLM showing a brand before others), and then, it could become a much bigger problem akin to Google's.

tl;dr: I wouldn't be so dismissive of what adversaries can manage to do with enough motivation.


Global coordination for lulz exists, it's called "memes".

Remember Dogecoin or Gamestop; the lulz-oriented meme outbursts had a real impact.

Equally, a particular way to gaslight LLM scrapers may become popular and widespread without any enforcement.


Didn't think of it that way, but I think you're right. As long as memes exist one could argue the LLMs are going to be poisoned in one way or another.

As someone who works in big tech on a product with a large attack surface -- security is a huge chunk of our costs in multiple ways

- Significant fraction of all developer time (30%+ just on my team?) - Huge increase to the complexity of the system - Large accumulated performance cost over time

Obviously it's not a 1-to-1 analogy but if we didn't have to worry about this sort of prodding we would be able to do a lot more with our time. Point being that it's probably closer to a 2x cost factor than it is to a 1% increase.


Who said they don't know? The same way companies know about hackers, it doesn't mean nothing ever gets hacked

> This assumes that you give honest feedback.

You don't need honest user feedback because you could judge any message part of a conversation using hindsight.

Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.

You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.

Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.


It goes then in the line of https://xkcd.com/810/

There are ways to analyze that your contributions make sense from the conversation point of view. Reasoning detects that pretty quickly. To attack you would actually use another AI, to generate non totally random stuff. It still could be detected.

I would assume to use data they would have to filter it a lot and correlate between many users.

You can detect if the user is the real one and trust their other chats "a bit more".


You would have to grade every user on every knowledge axis though. Just because someone is an expert in software doesn’t mean you should believe their takes on medicine, no matter how good faith their model interactions appear. I’d argue that coming up with an automated way to determine the objective truthfulness of information would be among the greatest creations of humanity (basically “solving” philosophy), so this isn’t a small task.

I've been thinking about how this happens with human cognitive development. There's a constant reinforcement mechanism that simply compares one's predicted reality with actual reality. The machines lack an authoritative reality.

If we had to grade truthiness of data sources - our sight or other main senses would probably be #1. Some gossip we heard from a 6 year old is near the bottom.

We know how to grade these data sources based on longitudinal experience and they are graded on multiple axes. For instance Angela is wrong about most facts but always right about matters of the heart.


of course. Each user input would be compared with other user input and existing data in the model before. Only legit and cross-referenced data could be used. Other data could still be used but marked as "possible controversial data". Good model should know that controversial data exists too and should distinguish it from the proper scientific data on each topic.

Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.

That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.


The AI models to begin with assume that a significant majority of the training material is honest/in good faith. So that is not new?

AI models don't assume anything. AI models are just statistical tools. Their data is prepared by humans, who aren't morons. What is it with these super-ignorant AI critiques popping up everywhere?

There’s so much data required for training it’d be surprising humans look at even a small subset of it at all. They need different statistical tools to clean it up. That’s where attacks will be concentrated, naturally, and this is why synthetic data will overtake real human data, just after ‘there isn’t enough data even if it’s too much already’.

Try a little benefit of the doubt, nuance or colloquialism. Or a bit of all three.

I am not in this space, question: are there "bad actors" that are known to feed AI models with poisonous information?

I'm not in the space either but I think the answer is an emphatic yes. Three categories come to mind:

1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).

2. Nation states like China who already require models to conform to state narratives.

3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.

There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."


The part where people disagree seems fun.

Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?

If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.


> The part where people disagree seems fun.

Apparently, you haven't been on that Internet thingie in the last five years or so... :-)

But I do agree with your point. What's interesting is the increasing number of people who act like there's some clearly objective and knowable truth about a much a larger percentage of topics than there actually is. Outside of mathematics, logic, physics and other hard sciences, the range of topics on which informed, reasonable people can disagree, at least on certain significant aspects, is vast.

That's why even the concept of having some army of "Fact Checkers" always struck me as bizarre and doomed at best, and at worst, a transparent attempt to censor and control public discourse. That more people didn't see even the idea of it as being obviously brittle is concerning.


On Wikipedia you are suppose to quote the different perspectives. No one has ever accomplished this.

We can trust altman and elon to weed out the "fakenews". Finally we will get the answer which is the greatest linux distro.

> Outside of mathematics, logic, physics

No need to go outside. There are plenty of Grigori Perelmans with various levels of credibility.



Yeah, it's great comedy.

> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.

Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs. Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"


Bad or not, depends on your POV. But certainly there are efforts to feed junk to AI web scrapers, including specialized tools: https://zadzmo.org/code/nepenthes/

And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.

If the AI already has a larger knowledge domain space than the user then all users are bad actors. They are just too stupid to know it.

I would not really classify them as "bad" actors, but there are definitely real research lines into this. This freakonomics podcast (https://freakonomics.com/podcast/how-to-poison-an-a-i-machin...) is a pretty good interview with Ben Zhao at the University of Chicago. He runs a lab that is attempting to figure out how to trip up model training when copyrighted material is being used.

Creators who use Nightshade on their published works.

yes, example: me

I more often than not use the thumbs up on bad Google AI answers

(but not always! can't find me that easily!)


I deliberately pick wrong answers in reCAPTCHA sometimes. I’ve found out that the audio version accepts basically any string slightly resembling the audio, so that’s the easiest way. (Images on the other hand punish you pretty hard at times – even if you solve it correctly!)

For images ones I have to turn off my brain. “Select all images that contain a crosswalk.”

What about unmarked crosswalks? Does it have to contain the crosswalk in whole or in part? That bit of white stripping is there just on the edge of this image, does that count? There’s a crosswalk in the background does that count? Etc etc.

The answer to all these questions is generally that you shouldn’t be asking. I can almost hear someone saying “You know what we mean.”


I don't know why HN users in particular fixate so heavily on fringe issues when it comes to LLMs. Same as the exaggerations of hallucinations.

Because hallucinations is something that from a distance looks very unimportant, but when looked closely is a structural problem. Some people here live very close to the LLM field.

Structural because while a human being can be the judge of an LLM output, a computer (or another LLM) cannot.

No amount of error correction is enough to turn an LLM output into a reliable input to another (possible dumb) computer system. Worse: each time that output is processed the error increases and when the final output is shown to an user, the error might have been amplified beyond human recovery (or recognition) capacity.

Think about this: one user sends Amazon support an email asking to refund for a stolen item.

Can this email be processed do feed an automatic refund pipeline system? If the answer is no and you need a human to verify the result, then we have one reason why hallucinations matter.

And there are the cases where a user verification is not even possible, like:

- what is the procedure to perform CPR in a person above 80 years old?

The user can’t recover errors in the output generated by an LLM here, because she doesn’t know the correct answer.

That being the case, you cannot build a search engine out of an LLM here. Hence hallucinations matters very much is this case too.

Not even in the case of simple information extraction from a text you can ignore hallucinations, because if you provide a list of names and ask for all those starting with “A” you cannot be certain that all names output will actually start with “A” and most certainly cannot be certain that all correct names will be in the output. And this behavior cannot (as of today) be corrected on the LLM we have right now (the first part yes, the second part no).

So, LLM with hallucinations are a very powerful tool, but not the tools they are being sold as.


Two questions:

1) Which search engine comes with infallible information? 2) Where are LLMs being sold as something different?


1) Current (traditional) search engines are indexes. They point to sources which can be read, analyzed and summarized by the human into information. LLM do the read, analysis and summarization part for the human.

2) chatbots, perplexity search engine, summarization chrome extensions, RAG tools. Those all built over the idea that hallucination is a quirk, a little cog in the machine, a minor inconvenience to be dutifully noted (for legal reasons) but conveniently underestimated.

Most things in life don’t have a compiler that will error on a inexistent python package.


> LLM do the read, analysis and summarization part for the human

No they don't. The human is meant to read, analyze and summarize the output same as they would for search results


> What is today's date?

>> Today's date is Tuesday, January 28, 2025.

> No, you're wrong, today's date is actually Wednesday the 29th.

>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.

Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.


But thats exactly what you get when you ask questions that require shifting, specific contextual knowledge. The model weights, by their nature, cannot encode that information.

At best, you can only try to layer in contextual info like this as metadata during inference, akin to how other prompting layers exist.

Even then, what up-to-date information should present for every round-trip is a matter of opinion and use-case.


> The model weights, by their nature, cannot encode that information.

This is mostly irrelevant no? A binary digit by definition cannot encode more than 2 dates; so therefore we devise a more elaborate system (of using multiple digits).

This is very similar to NYT's lawsuit against OpenAI where in addition to other claims, they claimed OpenAI maintainted a DB of NYT articles that they would directly grab from for a response. It's seems very feasible to maintain a DB or system of looking up real-time values like dates / weather.


> Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.

Such an ingenious attack, surely none of these companies ever considered it.


the date is in the "system prompt", so the cron job that updates the prompts to the current date may be in a different time zone than you. 7f5dbb71f54322f271c4d3fc3aaa4d3282a1af5541d82b2cbc5aa10c1420b6bc

why can't they feed in user data like time zone and locale?

They're not actually processing the entire system prompt (which is rather long) on every query, but continuing from a model state saved after processing the system prompt once.

That makes it a bit harder, but still, spitting out the wrong date just seems like a plain old time-zone bug.


not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?

an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"


A common use of these models is asking for code, and maybe you don't know the answer or would take a while to figure it out. For example, here's some html, make it blue and centered. You could give the model feedback on if its answer worked or not, without knowing the correct answer yourself ahead of time.

I was using llama3 and deepseek-r1 literally to center an element in a div and they were not able to despite many prompts and variations. I guess I figured it out in the end but I'm not convinced I saved any time vs just carefully reading flexbox docs.

You constantly have to correct an AI when using it because it either didn't get the question right or you guide him towards a more narrowed answer. There is only more to learn.

>not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?

For your example, what if you want to show what such a mushroom looks like to a friend? What if you want to use it on a website?


I feel like conventional image search would be more reliable to get a good picture of a mushroom variety that you know about. Ideally going out into the woods to get one I suppose.

On topics like history or biology if a model's answer is surprising I might check Wikipedia and call it out on it's bullshit by explaining how Wikipedia contradicts it and pasting an excerpt from Wikipedia. But frankly if the model can't even reliably internalize Wikipedia I don't have much hope for complex feedback training based on my chats.

While it's possible Wikipedia is wrong, the model always agrees with me when I correct it, so that isn't going to help with training either.

Of course for anything high stakes relying on a model probably isn't a great idea.


Does it?

If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.


Nah I just insult it and tell it that it costs me 20 dollars a month and it's a huge disappointment

If such labels are collected and used to retrain the model then yes. But these models are not learning online.

ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?

Really? Isn't that the point of RL used in the way R1 did?

Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?

I believe that's what GP meant by "respond", not telling GPT they were wrong.


That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.

Yes, that makes sense. We're both talking about offline learning.

> you provide a very valuable piece of data to train on

We've been saying this "we get valuable data" thing since the 2010s [1].

When will our collective Netflix thumbs ups give us artificial super-intelligence?

[1] Especially to investors. They love that line.


our collective netflix thumbs up indicators gave investors and netflix the confidence to deploy a series of adam sandler movies that cost 60 to 80 million US dollars to "make". So depending on who you are, the system might be working great.

Through analytics Netflix should know exactly when people stop watching a series, or even when in a movie they exit out. They no doubt know this by user.

They know exactly what makes you stay, and what makes you leave.

I would not be surprised if in the near future movies and series are modifed on the fly to ensure users stay glued to their screens.

In the distant future this might be done on a per user level.


So if I just pay OpenAI $200/mo, and randomly tell the AI, no that's wrong.

I can stop the AI takeover?


You would need a lot of pro accounts! I would be surprised if they didn't use any algorithms for detecting well poisoning.

You can have our thank-you cards forwarded to your cell at Guantanamo Bay.

My non technical cousin is a heavy paying user of ChatGPT, once she discovered that she can type incoherent stuff, with typos and whatnot and ChatGPT still will get the gist and produce satisfying answers, she will just type in tons of nonsense (to me) keep long chat sessions, complain it is getting slow and then get mad when I remind her to open new chat each time she has something new to ask that is not related to the previous chat. I have my doubt many users will provide valuable training data.

You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.

> I wonder if there is a cap to multi head attention architecture

I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.


I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...

>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.

The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.


> the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition

It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.

In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.

This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.


> I highly doubt you are getting novel, high quality data.

Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.

If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)


  > I highly doubt you are getting novel, high quality data.
That's not the point. The point is you reject low quality data, aka noise

And how would that work at inference time?

Why would it need to work at inference time?

> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.

Why is it promising, aren’t you potentially amplifying AI biases and errors?


it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")

https://newsletter.languagemodels.co/i/155812052/large-scale...

also from the posted article

"""

The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.

This makes intuitive sense, as language itself is effectively a reasoning DSL.

"""


*SOTA state of the art

shouldn't the whole idea be: get away from needing data at all? if a model can really reason, it should be able to figure things out on its own.

An LLM is just a really good parser connected to a lossy compressed corpus of data.

They need to be open ended and self training to be truly useful.

Reasoning is way far away...


the main bottleneck will be model depth... you can only do so much with N layers, and recurrence has proven to be way less efficient (for now)

It doesn't need much. 1 good lucky answer in a 1000 or maybe 10k queries gives you the little exponential kick you need to improve. This is how the hockey stick take off looks like and we're already here - OpenAI has it, now deepseek has it, too. You can be sure others also have it; Anthropic at the very least, they just never announced it officially, but go read what their CEO has been speaking and writing about.

I have looked a bit into Anthropic CEOs writings but if you can point in the right direction would be helpful!



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: